Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD AND TECHNICAL EQUIPMENT FOR RENDERING MEDIA CONTENT
Document Type and Number:
WIPO Patent Application WO/2018/109266
Kind Code:
A1
Abstract:
The invention relates to a method, comprising receiving media content relating to a video data sequence, the media content comprising reference sparse voxel octree (900) with identifications and frame change sets for frames of the video data sequence; storing data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and frame subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree using an identification of a node for determining a subtree lookup table (910) defining addresses for subtree root nodes for the frame subtrees within a buffer (920); when the determined subtree lookup table comprises an address of a subtree root node, performing raycasting using the nodes found in the frame subtrees and the selected one or more nodes. The invention relates also to technical equipment for implementing the method.

Inventors:
KERÄNEN JAAKKO (FI)
Application Number:
PCT/FI2017/050847
Publication Date:
June 21, 2018
Filing Date:
November 30, 2017
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/96; G06T15/06; G06T15/08; G06T17/00; H04N13/00
Foreign References:
US5579455A1996-11-26
US5579455A1996-11-26
Other References:
KAMPE, V. ET AL.: "Exploiting Coherence in Time-Varying Voxel Data", PROCEEDINGS OF THE 20TH ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (I3D '16, 28 February 2016 (2016-02-28), pages 15 - 21, XP058079601, Retrieved from the Internet [retrieved on 20180327]
KAMMERL, J. ET AL.: "Real-time Compression of Point Cloud Stream s", IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA, 18 May 2012 (2012-05-18), pages 778 - 785, XP032194242, Retrieved from the Internet [retrieved on 20180404]
KAMPE, V. ET AL.: "Exploiting Coherence in Time-Varying Voxel Data", PROCEEDINGS OF THE 20TH ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (I3D '16, 28 February 2016 (2016-02-28), pages 15 - 21, XP058079601, Retrieved from the Internet [retrieved on 20180327]
KAMMERL, J. ET AL.: "Real-time Compression of Point Cloud Stream s", IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA, 18 May 2012 (2012-05-18), pages 778 - 785, XP032194242, Retrieved from the Internet [retrieved on 20180404]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1 . A method, comprising:

- receiving media content relating to a video data sequence, the media content comprising a unique reference sparse voxel octree where said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence with identifications and frame change sets for frames of the video data sequence compared to the reference sparse voxel octree;

- storing data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices in one or more mapping tables;

- for a selected one or more nodes of the reference sparse voxel octree

• selecting a mapping table;

• based on an identification of the node, read an address of a subtree from the mapping table, wherein the read address comprises an address of a subtree root node, performing raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

2. The method according to claim 1 , comprising maintaining an alternative subtree lookup tables for previous and/or following frames, the alternative subtree lookup tables being used when a rendered pixel turns out empty.

3. The method according to claim 1 or 2, wherein a node is a selected node when it intersects a line from viewer's eye to a rendered pixel. 4. The method according to claim 3, wherein a deleted node is not a selected node.

5. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- to receive media content relating to a video data sequence, the media content comprising a unique reference sparse voxel octree where said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence with identifications and frame change sets for frames of the video data sequence compared to the reference sparse voxel octree;

- to store data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices in one or more mapping tables;

- for a selected one or more nodes of the reference sparse voxel octree

• to select a mapping table; • based on an identification of the node, read an address of a subtree from the mapping table, wherein the read address comprises an address of a subtree root node, to perform raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

6. The apparatus according to claim 5, comprising computer program code to cause the apparatus to maintain an alternative subtree lookup tables for previous and/or following frames, the alternative subtree lookup tables being used when a rendered pixel turns out empty.

7. The apparatus according to claim 5 or 6, wherein a node is a selected node when it intersects a line from viewer's eye to a rendered pixel.

8. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive media content relating to a video data sequence, the media content comprising a unique reference sparse voxel octree where said unique reference sparse voxel octree is used as a reference for all other frames of the video sequence with identifications and frame change sets for frames of the video data sequence compared to the reference sparse voxel octree;

- store data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices in one or more mapping tables;

- for a selected one or more nodes of the reference sparse voxel octree

• to select a mapping table;

• based on an identification of the node, read an address of a subtree from the mapping table, wherein the read address comprises an address of a subtree root node, to perform raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

Description:
A METHOD AND TECHNICAL EQUIPMENT FOR RENDERING MEDIA CONTENT

Technical Field The present solution generally relates to rendering media content. In particular, the solution relates to volumetric rendering and virtual reality (VR).

Background Since the beginning of photography and cinematography, the most common type of image and video content has been captured and displayed as a two-dimensional (2D) rectangular scene. The main reason of this is that cameras are mainly directional, i.e., they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around themselves, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all axes). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being "immersed" into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for real-time computer graphics and virtual reality. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising receiving media content relating to a video data sequence, the media content comprising reference sparse voxel octree with identifications and frame change sets for frames of the video data sequence; storing data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and frame subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree using an identification of a node for determining a subtree lookup table defining addresses for subtree root nodes for the frame subtrees within a buffer; when the determined subtree lookup table comprises an address of a subtree root node, performing raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

According to an embodiment, the method comprises maintaining an alternative subtree lookup tables for previous and/or following frames, the alternative subtree lookup tables being used when a rendered pixel turns out empty.

According to an embodiment, a node is a selected node when it intersects a line from viewer's eye to a rendered pixel.

According to an embodiment, a deleted node is not a selected node.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: to receive media content relating to a video data sequence, the media content comprising reference sparse voxel octree with identifications and frame change sets for frames of the video data sequence; to store data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and frame subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree, to use an identification of a node for determining a subtree lookup table defining addresses for subtree root nodes for the frame subtrees within a buffer; when the determined subtree lookup table comprises an address of a subtree root node, to perform raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

According to an embodiment, the apparatus further comprises computer program code to cause the apparatus to maintain an alternative subtree lookup tables for previous and/or following frames, the alternative subtree lookup tables being used when a rendered pixel turns out empty.

According to an embodiment, a node is a selected node when it intersects a line from viewer's eye to a rendered pixel. According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive media content relating to a video data sequence, the media content comprising reference sparse voxel octree with identifications and frame change sets for frames of the video data sequence; store data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and frame subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree to use an identification of a node for determining a subtree lookup table defining addresses for subtree root nodes for the frame subtrees within a buffer; when the determined subtree lookup table comprises an address of a subtree root node, to perform raycasting using the nodes found in the frame subtrees and the selected one or more nodes.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows a system and apparatuses for stereo viewing;

Fig. 2a shows a camera device for stereo viewing;

Fig. 2b shows a head-mounted display for stereo viewing;

Fig. 3 shows a camera according to an embodiment;

Figs. 4a, 4b show examples of a multicamera capturing device;

Figs. 5a, 5b show an encoder and a decoder according to an embodiment; illustrates an example of processing steps of manipulating volumetric video data;

Fig. 7 shows an example of a volumetric video pipeline;

Fig. 8 shows an example of an output of a video encoding; Fig. 9 shows an example of data prepared by a renderer;

Fig. 10 is a flowchart of a method according to an embodiment; and Fig. 1 1 shows an apparatus according to an embodiment. Description of Example Embodiments

The present embodiments relate to real-time computer graphics and virtual reality (VR).

Volumetric video may be captured using one or more 3D cameras. Volumetric video is to virtual reality what traditional video is to 2D/3D displays. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

The present embodiments are discussed in relation to media content captured with one or more multicamera devices. A multicamera device comprises two or more cameras, wherein the two or more cameras may be arranged in pairs in said multicamera device. Each said camera has a respective field of view, and each said field of view covers the view direction of the multicamera device. The multicamera device may comprise cameras at locations corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions of the human head at maximum flexion anatomical posture, eye positions of the human head at maximum extension anatomical postures, and/or eye positions of the human head at maximum left and right rotation anatomical postures. The multicamera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera's field of view fall within a hemispheric field of view, the multicamera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multicamera device having a total field of view covering a full sphere.

The multicamera device described here may have cameras with wide-angle lenses. The multicamera device may be suitable for creating stereo viewing image data and/or multiview video, comprising a plurality of video sequences for the plurality of cameras. The multicamera may be such that any pair of cameras of the at least two cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image. At least two cameras may have overlapping fields of view such that an overlap region for which every part is captured by said at least two cameras is defined, and such overlap area can be used in forming the image for stereo viewing.

Fig. 1 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback. The task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of images with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels). The human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound. The system of Fig. 1 may consist of three main parts: image sources, a server and a rendering device. A video capture device SRC1 comprises multiple cameras CAM1 , CAM2, CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The device SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions. The device SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded. The device SRC1 comprises or is functionally connected to a computer processor PROC1 and memory MEM1 , the memory comprising computer program PROGR1 code for controlling the video capture device. The image stream captured by the video capture device may be stored on a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface COMM1 . It needs to be understood that although an 8-camera-cubical setup is described here as part of the system, another multicamera (e.g. a stereo camera) device may be used instead as part of the system.

Alternatively or in addition to the video capture device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2. The image stream captured by the device may be stored on a memory device MEM5 (e.g. memory card CARD1 ) for use in another device, e.g. a viewer, or transmitted to a server or the viewer using a communication interface COMM2. There may be a storage, processing and data stream serving network in addition to the capture device SRC1 . For example, there may be a server SERVER or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface COMM3.

For viewing the captured or created video content, there may be one or more viewer devices VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices. The viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewer VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation sensor DET1 and stereo audio headphones. According to an embodiment, the viewer VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it. Alternatively, the viewer VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair. Any of the devices (SRC1 , SRC2, SERVER, RENDERER, VIEWER1 , VIEWER2) may be a computer or a portable computing device, or be connected to such. Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.

Fig. 2a shows a camera device 200 for stereo viewing. The camera comprises two or more cameras that are configured into camera pairs 201 for creating the left and right eye images, or that can be arranged to such pairs. The distances between cameras may correspond to the usual (or average) distance between the human eyes. The cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras. The cameras may be regularly or irregularly spaced to access the whole sphere of view, or they may cover only part of the whole sphere. For example, there may be three cameras arranged in a triangle and having different directions of view towards one side of the triangle such that all three cameras cover an overlap area in the middle of the directions of view. As another example, 8 cameras having wide-angle lenses and arranged regularly at the corners of a virtual cube and covering the whole sphere such that the whole or essentially whole sphere is covered at all directions by at least 3 or 4 cameras. In Fig. 2a three stereo camera pairs 201 are shown. Multicamera devices with other types of camera layouts may be used. For example, a camera device with all cameras in one hemisphere may be used. The number of cameras may be e.g., 2, 3, 4, 6, 8, 12, or more. The cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed.

Fig. 2b shows a head-mounted display (HMD) for stereo viewing. The head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes' field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module ORDET1 for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.

Fig. 3 illustrates a camera CAM1 . The camera has a camera detector CAMDET1 , comprising a plurality of sensor elements for sensing intensity of the light hitting the sensor element. The camera has a lens OBJ1 (or a lens arrangement of a plurality of lenses), the lens being positioned so that the light hitting the sensor elements travels through the lens to the sensor elements. The camera detector CAMDET1 has a nominal center point CP1 that is a middle point of the plurality of sensor elements, for example for a rectangular sensor the crossing point of the diagonals. The lens has a nominal center point PP1 , as well, lying for example on the axis of symmetry of the lens. The direction of orientation of the camera is defined by the line passing through the center point CP1 of the camera sensor and the center point PP1 of the lens. The direction of the camera is a vector along this line pointing in the direction from the camera sensor to the lens. The optical axis of the camera is understood to be this line CP1 -PP1 .

The system described above may function as follows. Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consists of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level. Finally, each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.

Figs. 4a and 4b show an example of a camera device for being used as a source for media content, such as images and/or video. To create a full 360 degree stereo panorama every direction of view needs to be photographed from two locations, one for the left eye and one for the right eye. In case of video panorama, these images need to be shot simultaneously to keep the eyes in sync with each other. As one camera cannot physically cover the whole 360 degree view, at least without being obscured by another camera, there need to be multiple cameras to form the whole 360 degree panorama. Additional cameras however increase the cost and size of the system and add more data streams to be processed. This problem becomes even more significant when mounting cameras on a sphere or platonic solid shaped arrangement to get more vertical field of view. However, even by arranging multiple camera pairs on for example a sphere or platonic solid such as octahedron or dodecahedron, the camera pairs will not achieve free angle parallax between the eye views. The parallax between eyes is fixed to the positions of the individual cameras in a pair, that is, in the perpendicular direction to the camera pair, no parallax can be achieved. This is problematic when the stereo content is viewed with a head mounted display that allows free rotation of the viewing angle around z-axis as well. The requirement for multiple cameras covering every point around the capture device twice would require a very large number of cameras in the capture device. In this technique lenses are used with a field of view of 180 degree (hemisphere) or greater, and the cameras are arranged with a carefully selected arrangement around the capture device. Such an arrangement is shown in Fig. 4a, where the cameras have been positioned at the corners of a virtual cube, having orientations DIR_CAM1 , DIR_CAM2, DIR_CAMN pointing away from the center point of the cube. Naturally, other shapes, e.g. the shape of a cuboctahedron, or other arrangement, even irregular ones, can be used.

A video codec consists of an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). An example of an encoding process is illustrated in Figure 5a. Figure 5a illustrates an image to be encoded ( ); a predicted representation of an image block (P' n ); a prediction error signal (D n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); a transform (T) and inverse transform (T ~1 ); a quantization (Q) and inverse quantization (Cr 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (P inter ); intra prediction (P intra ); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 5b. Figure 5b illustrates a predicted representation of an image block (P' n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); an inverse transform (T ~1 ); an inverse quantization (Cr 1 ); an entropy decoding (E ~1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F). Figure 6 demonstrates an example of processing steps of manipulating volumetric video data, starting from raw camera frames (from various locations within the world) and ending with a frame rendered at a freely-selected 3D viewpoint. The starting point 610 is media content obtained from one or more camera devices. The media content may comprise raw camera frame images, depth maps, and camera 3D positions. The recorded media content, i.e. image data, is used to construct an animated 3D model 620 of the world. The viewer is then freely able to choose his/her position and orientation within the world when the volumetric video is being played back 630. "A sparse voxel octree" is a central data structure to which the present embodiments are based. "Voxel" of a three-dimensional world corresponds to a pixel of a two- dimensional world. Voxels exist in a 3D grid layout. An octree is a tree data structure used to partition a three-dimensional space. Octrees are the three-dimensional analog of quadtrees. A sparse voxel octree describes a volume of a space containing a set of solid voxels of varying sizes. Empty areas within the volume are absent from the tree, which is why it is called "sparse".

A volumetric video frame is a complete sparse voxel octree that models the world at a specific point in time in a video sequences. Voxel attributes contain information like color, opacity, surface normal vectors, and surface material properties. These are referenced in the sparse voxel octrees (e.g. color of a solid voxel), but can also be stored separately. In computer graphics, one use for voxel octrees is "raycasting". Raycasting can be used for determining which voxel a 3D ray collides with inside an entire voxel volume. This entails traversing all the octree nodes that intersect a given ray, until an intersecting solid voxel is discovered. The octree can be traversed either recursively or in a single loop. Sparse voxel octrees make this a more efficient operation because all empty spaces can be skipped.

Volumetric video is composed of large amounts of data. VR rendering has high frame rate and resolution requirements, so GPU (Graphic Processing Unit) hardware may be used. In such case the data has to be transferred from system memory to GPU memory before it can be rendered. However, large memory transfers hinder rendering performance - GPUs are optimized for data that stays unchanged in GPU memory as long as possible.

As mentioned, a volumetric video makes it possible for a viewer to move freely inside the virtual world. The viewer can also move to areas outside the captured footage, or see objects from angles that were not captured by any of the cameras. These "occlusions" lead to blank spaces in the final rendered view.

The present embodiments are based on an encoded volumetric video sequence. When raycasting in the volumetric reference frame's sparse voxel octree, Location IDs stored in the octree nodes during encoding are checked. The Location IDs are used as indices to a lookup table. The raycasting operation may then switch between the reference frame's octree and the subtrees located via the lookup table. This technique can be applied to re-cast rays that did not hit any solid voxels simply by swapping the lookup table and repeating the raycasting operation. The re-casting allows filling occlusions (i.e. blank areas) in the final rendered 2D frames with valid information from another point in time within the video sequence. According to an embodiment, the lookup table can be disabled by using information from the reference octree. This can be done as a fallback option, if a ray fails to hit anything. Depending on how the reference octree was built, this may result in occlusions being filled with data from another point in time or some manner of synthesized data. Figure 7 illustrates an example of a volumetric video pipeline. The present embodiments are targeted to a "Frame Selection" and "Temporal Augmentation" stages of Voxel Rendering 790 in the pipeline.

In the beginning of the process, multiple cameras 710 capture video data of the world, which video data is input 720 to the pipeline. The video data comprises camera frames, positions and depth maps 730 which are transmitted to the Voxel Encoding 740.

During the "Video Sequencing" stage of the Voxel Encoding 740, the input video material has been divided into shorter sequences of volumetric frames. A single volumetric reference frame may have been chosen for each sequence. The reference frame can be the first frame in the sequence, any one of the other frames in the sequences, or it may have been synthesized on one or more frames in the sequence.

The encoder has produced a sparse voxel octree for the sequence's volumetric reference frame, and the volumetric frame currently being encoded. At the "Change Detection" stage, the encoder is configured to process each frame in the sequence separately. Each frame may be compared against the one reference frame chosen for the sequence. The comparison results in a changed set, where some nodes of the tree may have been deleted, some nodes in the tree may have been added, and/or some nodes may have changed their content.

As an output for each frame, the encoder produces a frame change set. This is also illustrated in Figure 8. The change sets comprise at least frame number (e.g. within the encoded sequence of frames); and a set of location IDs 801 , 802, each associated with a sparse voxel subtree 805. A deleted subtree is encoded as a special value that identifies that no subtree exists for that location ("X" in Figure 8). If no changes were detected in the compared frames, the change set can be omitted from the output entirely. The output data for the entire video sequence contains the full reference octree (that contains the location IDs) and all the frame change sets. In the output data, attributes can be shared between the octrees/subtrees to reduce total data size. According to an embodiment, the renderer receives a SVOX (Sparse VOXel) file 750 comprising the sparse voxel octree of a volumetric reference data and a number of frame change sets as illustrated in Figure 8.

The renderer is configured to prepare the data relevant for the current time within the video sequences. This can be done for multiple consecutive frames at a time, with the assumption that the renderer will keep the data for several frames available in GPU memory at once. Figure 9 illustrates the data that the renderer prepares in GPU buffers 900, 910, 920. The contents of the Reference Octree buffer 900 are copied to GPU memory without any further processing. The Location IDs 905 of the Reference Octree are stored as indices in a lookup table 910. The Reference Octree data was produced by the voxel encoder (Fig. 7: 740). The entire video sequence may use the same reference octree, so the buffers 900, 910, 920 can remain untouched for longer periods of time.

The Frame Subtrees buffer 920 contains one or more subtrees as found in the encoded Frame Change Sets. The memory locations of each subtree's root node are noted when the subtrees are copied into one or more GPU buffers. The Subtrees Lookup Tables 910 are small buffers (e.g. 65536 entries, 256 x 256 texels) that are filled in with the addresses of the subtree root nodes within the Frame Subtrees buffer 920. The lookup tables use node Location IDs as indices, using the Location IDs that are stored in the Reference Octree 900. Deleted locations are written as special invalid values (e.g., -1 ). Values for all remaining unused Location IDs are set to zero. Since the lookup tables are small, there can be many of them in memory at once, and swapping in new ones is a fast and simple operation.

In a Frame Selection phase, the renderer is configured to pick which Subtree Lookup Table 910 is the active one for the current point of time. For each on-screen pixel, the renderer casts a view ray into a voxel volume. This is a recursive operation, so all nodes of the octree are handled in the same way. The raycasting begins with the root node of the reference octree, and proceeds into all of the child nodes that intersect the view ray. The child that is nearest to the starting point of the view ray, may be handled first. When the current node has a Location ID 905, the currently active Subtree Lookup Table 91 1 is checked. There are three possible outcomes for this check:

1 ) The lookup table entry contains a zero value, which means that the node is not animated. Raycasting continues as usual.

2) The lookup table entry contains a special invalid value (e.g. -1 ), indicating that the node has been deleted in this frame. The node is treated as empty and raycasting returns to the parent node.

3) The lookup table entry contains the address of a subtree root node (as shown in Figure 9). In this case, the current octree is switched and raycasting continues using the nodes found in the Frame Subtrees buffer 920 at the given address.

If the raycasting fails to hit any solid voxels in a frame subtree, it returns back to the parent node in the reference octree. The address of this parent node was saved in the frame subtree by the encoder.

It is likely that some areas in the rendered frame turn out empty because the view ray failed to hit any solid voxels. This is possible because the captured video footage may contain occluded regions (i.e. the voxel representation of the world is missing some parts of the world). This is also particularly relevant when the viewer moves to different locations inside the virtual world and is thus free to examine the world from all directions.

Using a set of alternative Subtree Lookup Tables 910, in Temporal Augmentation phase, the renderer is able to retry casting view rays. In practice, if the renderer keeps a number of consecutive frames in memory at once, it already has the lookup tables for a number of previous and/or following frames available. In the case of occlusions created by moving objects, the previous or following frames may provide useful unoccluded information.

Finally, it should be noticed that ray recasting can take advantage of the fact that it is already known that the view ray did not hit any solid voxels in the non-animated nodes of the octree. Therefore, it is sufficient to recast the ray only in the nodes that have location ID. This may save time during rendering.

Figure 10 is a flowchart illustrating a method according to an embodiment. A method comprises receiving 1010 media content relating to a video data sequence, the media content comprising reference sparse voxel octree with identifications and frame change sets for frames of the video data sequence; storing 1020 data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree 1030: using 1040 an identification of a node for determining a subtree lookup table defining addresses for subtree root nodes within a buffer; when the determined subtree lookup table comprises an address of a subtree root node, performing raycasting 1050 using the nodes found in the frame subtrees and the selected one or more nodes.

An apparatus according to an embodiment comprises means for receiving media content relating to a video data sequence, the media content comprising reference sparse voxel octree with identifications and frame change sets for frames of the video data sequence; means for storing data of the media content to one or more buffers, wherein identifications of the reference sparse voxel octree are used as indices between the reference sparse voxel octree and subtrees of the frame change sets; for a selected one or more nodes of the reference sparse voxel octree: means for using an identification of a node for determining a subtree lookup table defining addresses for subtree root nodes within a buffer; when the determined subtree lookup table comprises an address of a subtree root node, means for performing raycasting using the nodes found in the frame subtrees and the selected one or more nodes. The means comprises at least one processor, a memory, and a computer program code residing in the memory. Figure 1 1 shows a computer graphics system suitable to be used in image processing according to an embodiment. The generalized structure of the computer graphics system will be explained in accordance with the functional blocks of the system. Several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor if desired. A data processing system of an apparatus according to an example of Fig. 1 1 comprises a main processing unit 1 100, a memory 1 102, a storage device 1 104, an input device 1 106, an output device 1 108, and a graphics subsystem 1 1 10, which are all connected to each other via a data bus 1 1 12. The main processing unit 1 100 is a conventional processing unit arranged to process data within the data processing system. The memory 1 102, the storage device 1 104, the input device 106, and the output device 1 108 are conventional components as recognized by those skilled in the art. The memory 1 102 and storage device 1 104 store data within the data processing system 1 100. Computer program code resides in the memory 1 102 for implementing, for example, computer vision process. The input device 1 106 inputs data into the system while the output device 1 108 receives data from the data processing system and forwards the data, for example to a display. The data bus 1 1 12 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any conventional data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.

The various embodiments may provide advantages. For example, the present embodiments make it efficient to render volumetric video sequences. The embodiments also provide a way to access the voxel data of other frames during the rendering without incurring any additional costs.

A sparse voxel octree is a simple structure, which makes resolution adjustments and spatial subdivision trivial: resolution can be changed simply by limiting the depth of the tree, and subdivision can be done by picking specific subtrees. When compared to triangle meshes, sparse voxel data has a simpler, recursive overall structure. It is essentially composed of a sequence of integers, while triangle meshes are more complicated (3D floating-point coordinates, edges, faces). Although GPUs have been primarily designed to work with triangle meshes, modern graphics APIs provide enough programmability (shader programs) that enable rendering sparse voxel data in real time. Another big advantage over triangle meshes is that mipmapping can be achieved trivially. This is important in 3D graphics because objects in the distance should be rendered using a lower level of detail to avoid wasting processing time.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.