Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Document Type and Number:
WIPO Patent Application WO/2021/191495
Kind Code:
A1
Abstract:
The embodiments relate to a method comprising receiving an uncompressed first data (002) and corresponding uncompressed second data (001); generating (102) first set of patches with metadata (204) from the uncompressed first data, and generating (101) second set of patches with metadata (203) from the uncompressed second data, wherein the metadata (203) comprises at least visibility cone information providing patch visibility information; transmitting the generated first set of patches (202) and the second set of patches (201) for packing (300) into video frames (402); encoding the packed video frames by video encoder (502) into a video bitstream (602); encoding associated metadata (401) by an atlas encoder (501) into an atlas bitstream (601); and encapsulating (700) the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream. The embodiments also relate to a method for decoding and rendering, as well as to technical equipment for implementing the methods.

Inventors:
KONDRAD LUKASZ (DE)
ILOLA LAURI (DE)
ROIMELA KIMMO (FI)
MALAMAL VADAKITAL VINOD KUMAR (FI)
Application Number:
PCT/FI2021/050164
Publication Date:
September 30, 2021
Filing Date:
March 05, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/597; G06T15/40; H04N13/117; H04N13/178; H04N19/70
Domestic Patent References:
WO2020008106A12020-01-09
WO2019202207A12019-10-24
Other References:
ANONYMOUS: "Test Model 4 for Immersive Video. MPEG document N19002", ISO/IEC JTC 1/SC 29/WG 11, MPEG 129TH MEETING, 1 March 2020 (2020-03-01), Brussels, pages 1 - 45, XP055862369, Retrieved from the Internet [retrieved on 20210603]
MIKA PESONEN, SEBASTIAN SCHWARZ, KIMMO ROIMELA: "[V-PCC] [new proposal] Visibility cones", 126. MPEG MEETING; 20190325 - 20190329; GENEVA; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 20 March 2019 (2019-03-20), XP030211332
RICARDO L. DE QUEIROZ (IEEE), GUSTAVO SANDRI (IEEE), PHILIP A. CHOU (IEEE): "Signaling view dependent point cloud information by SEI", 123. MPEG MEETING; 20180716 - 20180720; LJUBLJANA; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 14 July 2018 (2018-07-14), XP030197247
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1. A method, comprising:

- receiving an uncompressed first data and corresponding uncompressed second data;

- generating first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information;

- transmitting the generated first set of patches and the second set of patches for packing into video frames;

- encoding the packed video frames by video encoder into a video bitstream;

- encoding associated metadata by an atlas encoder into an atlas bitstream; and

- encapsulating the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

2. A method for decoding and rendering, comprising

- receiving an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream;

- outputting video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information;

- passing the visibility cone information for rendering;

- providing a position and an orientation of a viewer for rendering; and

- based on the visibility cone information and the position and the orientation of a viewer, determining a final attribute information for a pixel according to second data patches.

3. An apparatus for encoding comprising

- means for receiving an uncompressed first data and corresponding uncompressed second data;

- means for generating first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information; - means for transmitting the generated first set of patches and the second set of patches for packing into video frame;

- means for encoding the packed video frames by video encoder into a video bitstream;

- means for encoding associated metadata by an atlas encoder into an atlas bitstream; and

- means for encapsulating the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

4. The apparatus according to claim 3, wherein the first data is geometry data and second data is texture data.

5. The apparatus according to claim 3 or 4, further comprising means for packing the first set of patches and the second set of patches to one video frame.

6.The apparatus according to any of the claims 3 or 4, further comprising means for packing the first set of patches and the second set of patches to separate video frames.

7. The apparatus according to any of the claims 3 to 6, wherein the metadata further comprises one or more of the following information:

- tile grouping information of patches;

- supplemental patch information on how to interpret/process unique patches;

- visibility cone information for tile groups;

- visibility cone information for atlases; and

- mapping between common, unique and geometry patches.

8. An apparatus for decoding and rendering, comprising

- means for receiving an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream;

- means for outputting video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information;

- means for passing the visibility cone information for rendering;

- means for providing a position and an orientation of a viewer for rendering; and - based on the visibility cone information and the position and the orientation of a viewer, means for determining a final attribute information for a pixel according to second data patches.

9. The apparatus according to claim 8, wherein the first data patches comprises geometry patches and the second data patches comprises texture patches.

10. The apparatus according to claim 8 or 9, further comprising means for decoding the first data patches and the second data patches from one video frame.

11 .The apparatus according to any of the claims 8 or 9, further comprising means for decoding the first set of patches and the second set of patches from separate video frames.

12. The apparatus according to any of the claims 8 to 11 , wherein the metadata further comprises one or more of the following information:

- tile grouping information of patches;

- supplemental patch information on how to interpret/process unique patches;

- visibility cone information for tile groups;

- visibility cone information for atlases; and

- mapping between common, unique and geometry patches.

13. The apparatus according to any of the claims 8 to 12, further comprising

- means for determining a viewing direction from three-dimensional location of point to a rendering camera;

- means for selecting texture patches having viewing direction inside the visibility cone;

- means for determining

• per-texture weights based on amount of inclusion inside the view cone, or

• per-texture weights based on relative angle from the center of the visibility cone, or

• per-texture weights similarly to unstructured lumigraph rendering; and

- means for determining a final colour attribute of the pixel as a weighted blend between the texture patches. 14. An apparatus for encoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive an uncompressed first data and corresponding uncompressed second data;

- generate first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information;

- transmit the generated first set of patches and the second set of patches for packing into video frames;

- encode the packed video frames by video encoder into a video bitstream;

- encode associated metadata by an atlas encoder into an atlas bitstream; and

- encapsulate the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

15. An apparatus for decoding and rendering comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream;

- output video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information;

- pass the visibility cone information for rendering;

- provide a position and an orientation of a viewer for rendering; and

- based on the visibility cone information and the position and the orientation of a viewer, determine a final attribute information for a pixel according to second data patches.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

The present solution generally relates to volumetric video. Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method, comprising receiving an uncompressed first data and corresponding uncompressed second data; generating first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information; transmitting the generated first set of patches and the second set of patches for packing into video frames; encoding the packed video frames by video encoder into a video bitstream; encoding associated metadata by an atlas encoder into an atlas bitstream; and encapsulating the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

According to a second aspect, there is provided a method for decoding and rendering, comprising receiving an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream; outputting video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information; passing the visibility cone information for rendering; providing a position and an orientation of a viewer for rendering; and based on the visibility cone information and the position and the orientation of a viewer, determining a final attribute information for a pixel according to second data patches. According to a third aspect, there is provided an apparatus for encoding comprising means for receiving an uncompressed first data and corresponding uncompressed second data; means for generating first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information; means for transmitting the generated first set of patches and the second set of patches for packing into video frames; means for encoding the packed video frames by video encoder into a video bitstream; means for encoding associated metadata by an atlas encoder into an atlas bitstream; and means for encapsulating the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

According to an embodiment, the first data is geometry data and second data is texture data

According to an embodiment, the apparatus further comprises means for packing the first set of patches and the second set of patches to one video frame.

According to an embodiment, the apparatus further comprises means for packing the first set of patches and the second set of patches to separate video frames.

According to an embodiment, the metadata further comprises one or more of the following information:

- tile grouping information of patches;

- supplemental patch information on how to interpret/process unique patches;

- visibility cone information for tile groups;

- visibility cone information for atlases;

- mapping between common, unique and geometry patches.

According to a fourth aspect, there is provided an apparatus for decoding and rendering, comprising means for receiving an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream; means for outputting video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information; means for passing the visibility cone information for rendering; means for providing a position and an orientation of a viewer for rendering; and based on the visibility cone information and the position and the orientation of a viewer, means for determining a final attribute information for a pixel according to second data patches.

According to an embodiment, the first data patches comprises geometry patches and the second data patches comprises texture patches.

According to an embodiment, the apparatus further comprises means for decoding the first data patches and the second data patches from one video frame.

According to an embodiment, the apparatus further comprises means for decoding the first set of patches and the second set of patches from separate video frames.

According to an embodiment, the metadata further comprises one or more of the following information:

- tile grouping information of patches;

- supplemental patch information on how to interpret/process unique patches;

- visibility cone information for tile groups;

- visibility cone information for atlases;

- mapping between common, unique and geometry patches.

According to an embodiment, the apparatus further comprises

- means for determining a viewing direction from three-dimensional location of point to a rendering camera.

- means for selecting texture patches having viewing direction inside the visibility cone;

- means for determining

• per-texture weights based on amount of inclusion inside the view cone, or

• per-texture weights based on relative angle from the center of the visibility cone, or

• per-texture weights similarly to unstructured lumigraph rendering;

- means for determining a final colour attribute of the pixel as a weighted blend between the texture patches. According to a fifth aspect, there is provided an apparatus for encoding comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive an uncompressed first data and corresponding uncompressed second data; generate first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information; transmit the generated first set of patches and the second set of patches for packing into video frames; encode the packed video frames by video encoder into a video bitstream; encode associated metadata by an atlas encoder into an atlas bitstream; and encapsulate the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

According to a sixth aspect, there is provided an apparatus for decoding and rendering comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream; output video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information; pass the visibility cone information for rendering; provide a position and an orientation of a viewer for rendering; based on the visibility cone information and the position and the orientation of a viewer, determine a final attribute information for a pixel according to second data patches.

According to a seventh aspect, there is provided a computer program product for encoding comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an uncompressed first data and corresponding uncompressed second data; generate first set of patches with metadata from the uncompressed first data and generating second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information providing patch visibility information; transmit the generated first set of patches and the second set of patches for packing into video frames; encode the packed video frames by video encoder into a video bitstream; encode associated metadata by an atlas encoder into an atlas bitstream; and encapsulate the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream.

According to an eighth aspect, there is provided a computer program product for decoding and rendering comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream; output video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information providing patch visibility information; pass the visibility cone information for rendering; provide a position and an orientation of a viewer for rendering; and based on the visibility cone information and the position and the orientation of a viewer, determine a final attribute information for a pixel according to second data patches.

According to an embodiment, the computer program product is embodied on a non- transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of an encoding process;

Fig. 2 shows an example of a decoding process;

Fig. 3 shows an example of a compression process of a volumetric video;

Fig. 4 shows an example of a de-compression process of a volumetric video;

Fig. 5 shows an example of a 3VC bitstream structure;

Fig. 6 shows an example of an example of visibility cone; Fig. 7 shows an example of packing geometry and texture patches into a video frame;

Fig. 8 shows an example of a specular highlight modelled as a smaller cone inside a bigger one;

Fig. 9 illustrates an encoder according to an embodiment;

Fig. 10 illustrates a decoder according to an embodiment;

Fig. 11 illustrates an example of sectors of a visibility cone;

Fig. 12 is a flowchart illustrating a method according to an embodiment;

Fig. 13 is a flowchart illustrating a method according to another embodiment; and

Fig. 14 illustrates an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments will be described in the context of volumetric video encoding and decoding. In particular, the several embodiments relate to packing and signalling view-dependent attribute information for immersive video.

A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un-compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate). Figure 1 illustrates an encoding process of an image as an example. Figure 1 shows an image to be encoded (l n ); a predicted representation of an image block (P’ n ); a prediction error signal (D n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (I’ n ); a final reconstructed image (R’ n ); a transform (T) and inverse transform (T 1 ); a quantization (Q) and inverse quantization (Q _1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 2. Figure 2 illustrates a predicted representation of an image block (P’ n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); an inverse transform (T _1 ); an inverse quantization (Q -1 ); an entropy decoding (E -1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

Volumetric video refers to a visual content that may have been captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two- dimensional (2D) plane. Flowever, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.

Volumetric video data represents a three-dimensional scene or object and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in 30- space) and respective attributes (e.g. colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI (Computer Generated Imaginery), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data comprise, for example, triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi level surface maps.

In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities. Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency is increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 3 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 301 that is provided for patch generation 302, geometry image generation 304 and texture image generation 305.

The patch generation 302 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and - (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 302 for the input point cloud frame 301 is delivered to packing process 303, to geometry image generation 304 and to texture image generation 305. The packing process 303 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 304 and the texture image generation 305 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v . The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+AJ, where A is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 307. The image padding 307 may also receive as an input an occupancy map (OM) 306 to be used with the geometry images and texture images. The occupancy map 306 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 303.

The padding process 307, for which the present embodiment are related, aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 308. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 308 also generates reconstructed geometry images to be provided for smoothing 309, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 302. The smoothed geometry may be provided to texture image generation 305 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl) 3D location (xO, yO, zO) of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index o Index o Index

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:

- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

- Let / be index of the patch, which the current TxT block belongs to, and let J be the position of / in L. Instead of explicitly coding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 310 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

The binary value of the initial sub-block is encoded.

Continuous runs of 0s and 1s are detected, while following the traversal order selected by the encoder. The number of detected runs is encoded.

The length of each run, except of the last one, is also encoded.

Figure 4 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 401 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 402. In addition, the de-multiplexer 401 transmits compressed occupancy map to occupancy map decompression 403. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 404. Decompressed geometry video from the video decompression 402 is delivered to geometry reconstruction 405, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 405 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 406, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 407, which also receives a decompressed texture video from video decompression 402. The texture reconstruction 407 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

S(u, v) = SO + g(u, v) s(u, v) = sO - uO + u r(u, v) = r0 - v0 + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

Visual volumetric video-based Coding (3VC) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video». 3VC will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 will be renamed to 3VC PCC, ISO/IEC 23090-12 renamed to 3VC MIV. In the highest level 3VC metadata is carried in vpcc_units which consist of header and payload pairs. An example of a 3VC bitstream structure is shown in Figure 5. Below is the syntax for vpcc_units and vpcc_unit_header structures.

General V-PCC unit syntax is presented in below:

The following table represents a general V-PCC unit header syntax:

A general VPCC unit payload syntax is presented below:

3VC metadata may be contained in atlas_sub_bistream() which may contain a sequence of NAL units including header and payload data. nal_unit_header() is used define how to process the payload data. NumByteslnNalUnit specifies the size of the NAL unit in bytes. This value is required for decoding of the NAL unit. Some form of demarcation of NAL unit boundaries is necessary to enable inference of NumByteslnNalUnit. One such demarcation method is specified in Annex C (23090-5) for the sample stream format.

3VC atlas coding layer (ACL) is specified to efficiently represent the content of the patch data. The NAL is specified to format that data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. Substantially all the data is contained in NAL units, each of which containing an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex C (23090-5) each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.

The following table represents a general General NAL unit syntax:

The following table represents a general NAL unit header syntax:

In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 7 3 of 23090-5. naljayerjd specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of naljayerjd shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of the current version of 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of naljayerjd not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:

The RBSP contains a string of data bits (SODB) as follows:

• If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.

• Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, left most) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows:

The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).

The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).

When the rbsp_stop_one_bit is not the last bit of a byte- aligned byte, one or more bits equal to 0 (i.e. instances of rbsp_alignment_zero_bit) are present to result in byte alignment. o One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.

Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. The following is an example a content:

• atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to a sequence of 3VC frames.

• atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to a specific frame. atlas_frame_parameter_set_rbsp( ) can be applied for a sequence of frames as well.

• sei_rbsp( ) is used to carry SEI messages in NAL units.

• atlas_tile_group_layer_rbsp( ) is used to carry patch layout information for tile groups.

When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, right-most) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. The following tables describe examples of relevant RBSP syntaxes.

An example of an atlas tile group layer RBSP syntax is presented below:

An example of an atlas tile group header syntax is presented below:

An example of a general atlas tile group data unit syntax is presented below:

An example of a patch information data syntax is presented below:

An example of a patch data unit syntax is presented below:

Annex F of 3VC V-PCC specification (23090-5) describes different SEI messages that have been defined for 3VC MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non-essential. 3VC SEI messages are signalled in sei_rspb() which is presented below: Non-essential SEI messages are such, which are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.

Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in 3VC V-PCC specification (23090-5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are actually present in the bitstream are counted.

Essential SEI messages are an integral part of the V-PCC bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:

• Type-A essential SEI messages: Type-A SEI messages contain information required to check bitstream conformance and for output timing decoder conformance. A V-PCC decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.

• Type-B essential SEI messages: V-PCC decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type- B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes.

So called ’’visibility cones” are used to provide patch visibility information, i.e. data indicative of where in the volumetric space the forward surface of the patch can be seen. Patch visibility information for each patch may be calculated from the patch normal of each respective patch. This can be implemented by providing a plurality of patches representing part of a volumetric scene; by providing, for each patch, patch visibility information indicative of a set of directions from which a forward surface of the patch is visible; by providing one or more viewing positions associated with a client device; by processing one or more of the patches dependent on whether the patch visibility information indicates that the forward surface of the one or more patches is visible from the one or more viewing positions. The visibility cones may be based on geometry, i.e. calculated based on the normal of a surface.

In ISO/IEC 23090-5 visibility cones can be assigned to a patch using patch_information() and scene_object_information() SEI messages as described in document m52705.

Atlas layouts may be separated for video encoded components in order to reduce video bitrates and pixel rates thus enabling higher quality experiences and wider support for platforms with limited decoding capabilities. The reduction of pixel rate and bitrate is mainly possible because of different characteristics of video encoded components. Certain packing strategies may be applied for geometry or occupancy information whereas different strategies make more sense for texture information. Similarly other components like normal or PBRT-maps may benefit from specific packing design which further increases the opportunities gained by enabling separate atlas layouts. Some examples of possible applications are listed in the following:

• Down-sampling flat geometries

• In certain conditions scaling down patches representing flat geometries may become viable. This will help reducing the overall pixel rate required by the geometry channel at minimal impact on output quality.

• Partial meshing of geometry

• Instead of signaling depth maps for every patch, it may be beneficial to signal geometry as a mesh for individual patches, thus being able to remove patches from the geometry frame should be considered.

• Uniform color tiles

• In some cases (e.g. Hijack) certain patches may contain uniform values for color data, thus signaling uniform values in the metadata instead of the color tile may be considered. Also scaling down uniform color tiles or color tiles containing only smooth gradients may be equally valid.

• Patch merging

• In some cases it may be possible to signal smaller patches inside larger patches, provided that the larger patch contains the same or visually similar data as the smaller patch.

• Future proofing MIV and V-PCC • There may be other non-foreseeable opportunities in atlas packing that require separation of patch layouts. Current design does not allow taking advantage of such capabilities and some flexibility to packing should be introduced.

For example, packing color tiles in a way that aligns same color edges of tiles next to each other may help improving the compression performance of the color component. Similar methods for depth component may exist but cannot be accommodated because fixed patch layouts between different components. Providing tools for separating patch layout of different components should thus be considered to provide further flexibility for encoders to optimize packing based on content.

To signal information when separation of atlas layouts for video encoded components is used in ISO/IEC 23090-5, following examples may be used:

• New 3VC specific SEI messages can be created for V-PCC bitstream, e.g.

”separate_atlas_component()” o SEI message inserted in NAL stream signalling which component the following or preceding NAL units are applied to. o SEI message may be defined as prefix or suffix. o If said SEI message does not exist in sample atlas_sub_bitstream, NAL units are applied to all video encoded components. o This design will provide flexibility to signal per component NAL units, which enable signalling different layouts and parameter sets for each video encoded component. o The new SEI message should contain at least component_type as defined in 23090-5 Table 7.1 V-PCC Unit Types as well as attribute_type.

• Definition of component type in NAL unit headerQ o By adding indication on which video encoded component each NAL unit should be applied to, allows flexibility for signalling different atlas layouts. o Default value for component type could be assigned to indicate that NAL units are applied to all video encoded components.

• Atlas layouts may be signalled in separate tracks. o Separate tracks of timed metadata per video encoded component describing patch layout (not likely, but possible).

• Mapping of atlas layer to a video component or to a group of video components may be signalled o Each layer of atlas may contain a different patch layout. Each video component or group of video components is assigned to different layer of an atlas (distinguished by nuh_layer_id) o The linkage of atlas nuh_layer_id and video component can be done on V-PCC parameter set level (V-PCC unit type of VPCC_VPS), on atlas sequence parameter level or atlas sequence parameter level. All the parameter sets have an extensions mechanism that can be utilized to provide such information

Volumetric video can be packed in one video component with the related signalling information. Signalling method can also contain information how to separate the signalling of patch information. Below are some examples of the signalling methods.

1. New vuh_unit_type defined and new packed_video() structure in vpcc_parameter_set()

• A new vpcc_unit_type is defined

• packed_video() structure provides information about the packing regions

2. Special use case where only attributes are packed in one video frame

• New identifier is defined that inform a decoder that number of attributes are packed in a video bitstream.

• New SEI message provides information about the packing regions.

3. New packed_patches() syntax structure in atlas_sequence_parameter_set()

• Constrains on tile groups of atlas to be aligned with regions of packed video

• Patches are mapped based on patch index in a given tile group

• Way of interpreting patches as 2D and 3D patches

4. New patch modes in patch_information_data and new patch data unit structures.

• Patch data type can be signalled in patch itself,

• Or the patch is mapped to video regions signalled in patched_video() structure Up to now, it has been assumed in V-PCC and MIV that apparent brightness of a surface to an observer is the same regardless of the observer’s angle of view. This is so called Lambertian surface. In this case, a reflection, for example, is calculated by taking the dot product of the surface's normal vector N, and a normalized light- direction vector L, pointing from the surface to a light source. The resulting value is then multiplied by the color of the surface and the intensity of the light hitting the surface. The angle between the directions of the two vectors L and N, provides the intensity and is the highest if the normal vector points in the same direction as the light vector.

However, V-PCC and MIV do not address non-Lambertian surfaces, i.e. from which light is not equally reflected towards all directions.

Thus, the purpose of the present embodiments is to provide functionality that allows assigning an attribute information to a geometry based on the viewing direction. This can be implemented by combining separation of patch layouts for one or more video components with the visibility cone functionality, which enables transmission and processing of view-dependent surface attributes.

MIV/V-PCC decoder can decide to display a given geometry patch information based on a visibility cone, and decide what attribute (e.g. color, normal, reflectance) to assign to the visible geometry based on other visibility cones assigned to attribute patches. The same geometrical region can have more than one texture assigned, wherein the texture to use is based on the viewer position. The texture can be as well interpolated from the number of patches.

Figure 6 shows an example on different (diffuse) colors reflected to different directions. In Figure 6 a geometry patch has one visibility cone 601 , and a surface represented by this geometry can be have three different textures 603 assigned depending on a viewer position. Each texture has its own visibility cone representation 603, shown in the Figure 6 as areas A - C. Figure 7 shows how geometry and texture patches can be packed into video frame and corresponding atlas “frame”. Patch metadata of index 1 in geometry data would have corresponding three patches of index 1 , 2, 3 in texture data. Similarly, patch metadata of index 2 in geometry data would have corresponding three patches of index 4, 5, 6 in texture data, and patch metadata of index 3 in geometry data would have corresponding three patches of index 7, 8, 9 in texture data.

Visibility cones of texture and other attribute patches belonging to one surface could also overlap. This has been illustrated in Figure 8 where a smaller cone is inside a bigger one. This example can be used to describe specular highlight on a modelled surface. In Figure 8 visibility cones of texture patches has been referred to by 801 , and visibility cone of geometry patch has been referred to by 802.

Figure 9 illustrates an encoder according to an embodiment. Uncompressed geometry data 002 and corresponding uncompressed texture data 001 is provided as input. Texture data 001 can have N variants representing different variants of the texture based on the position of the observer.

Geometry patch generator 102 operates according to MIV/V-PCC standard and provides geometry patches 202 together with associated metadata 204 that, among others, can include visibility cones information of patches.

Texture patch generator 101 generates texture patches for each texture data separately together with associated metadata 203. The metadata 203 may comprise visibility cone information for each patch assigned based on the texture visibility from viewer position. The texture patch generator 101 can generate a common texture patches (i.e. texture patches visible form all defined viewer positions) plus unique texture patches (i.e. texture patches visible from a given viewer position) together with associated metadata 203. Such metadata 203 can include visibility cone information for each patch assigned based on the texture visibility from viewer position, and how such unique patches should be interpreted and to which geometry patches they related to. One or more texture patches may exist for a given geometry patch and not all geometry patches need to have same number of corresponding texture patches.

Geometry patch generator 102 and texture patch generator 101 are synchronized, and they can exchange information between each other using the interface 103.

Geometry patches 202 and texture patches 201 are provided to a packing module 300. Packing module 300 is responsible for placing patches’ information into video frames 402 and generate associated metadata 401 . Packing module 300 may pack geometry patches 202 and texture patches 201 into separate video frames. Instead, the packing module 300 can pack geometry patches 202 to one video frame, or pack common texture patches from all views to one video frame and all unique texture patches to another video frame. Yet as further alternative, the packing module 300 can pack geometry patches 202 to one video frame and pack common texture patches from all views together with all unique texture patches to another video frame. Packing module 300 may have an algorithm for grouping texture patches into common tile groups based on the visibility cone information. Packed geometry and texture video frames 402 are provided to video encoders 502 while the metadata related to the patches 401 is provided to atlas encoder 501 .

Metadata 401 may contain one or more of the following information:

• Visibility cone information per patch

• Tile grouping information of patches

• Supplemental patch information on how to interpret/process unique patches (e.g. should they replace the common texture or blend with common texture etc.) if the texture patch generator 101 generated common texture patches and unique texture patches

• Visibility cone information for tile groups

• Visibility cone information for atlases

• Mapping between common, unique and geometry patches

Atlas encoder 501 is configured to encode patch data according to the MIVA/-PCC standard, and additional information about tile grouping of patches and its mapping may be provided in ASPS, AFPS or SEI messages.

Video Encoders 502 are configured to encode video frame according to a chosen video standard.

Atlas NAL bitstream 601 and Video NAL bitstreams 602 are provided to 3VC bitstream encapsulator 700, and they are encapsulated according to 3VC standard.

Figure 10 illustrates a decoder according to an embodiment. V-PCC/MIV decoder 2001 gets atlas NAL bitstream 1001 and corresponding video NAL bitstream 1002 as an input. NAL bitstreams 1001 , 1002 are extracted from 3VC bitstream 700 which contains encoded video and metadata data 601 , 602.

V-PCC/MIV decoder 2001 outputs video frames containing geometry patches 3001 , 402, video frames containing texture patches 3002, 402 and related atlas metadata 3003, 401. Metadata 3003 provides at least visibility cone information for each geometry and texture patch. The visibility cone information is passed to a Tenderer 4001

Viewer position module 6001 provides a position and orientation of a viewer to Tenderer 4001 through interface 5001. Viewer position module can get the information from an input of user (e.g. mouse movement) or can get it from external sensors (e.g. sensor of HTC Vive or Oculus headsets) The Tenderer 4001 process the input data for each output point or surface pixel. The operations may be as follow:

• Compute viewing direction from 3D location of point to rendering camera.

• Select all texture patches where the viewing direction is inside the view cone.

• Compute o per-texture weights based on amount of inclusion inside the view cone, or o per-texture weights based on relative angle from the center of the view cone, or o Compute per-texture weights similarly to unstructured lumigraph rendering

• Compute the final colour of the pixel as a weighted blend between the texture patches.

Alternatively, the Tenderer may select required patches based on geometry patch visibility and use such information to select needed texture patches. Texture patches may have more fine-grained visibility cones, which allow Tenderer to select best fitting texture patches for final colour synthesis.

A method for signalling, according to an embodiment, can comprise the following steps:

- packing geometry patches 202 to one or more video frame that are part of geometry bitstream carried by VPCC_GVD;

- packing common texture patches from all views to other video frames that are part of attribute bitstream carried by VPCC_AVD that map to ai_attribute_type_id equal to ATTR TEXTURE; and

- packing unique texture patches that are view-dependent to other video frames that are part of attribute bitstream carried by VPCC_AVD that map to ai_attribute_type_id idendifier equal to ATTR TEXTU RE EL

Geometry patches and texture patches with ai_attribute_type_id equal to ATTR_TEXTURE use the same patch information carried in atlas sub-bitstream.

Identifier ATTR_TEXTURE_EL informs a decoder that VPCC_AVD unit with this identifier contains enhancement information for texture and patch layout and patch indexes may not be the same as for geometry, occupancy and another attribute. However it should be possible to map patches of ATTR TEXTURE EL to geometry patches which they cover. A method for signalling, according to another embodiment, can comprise defining a SEI message to provide mapping of patches carried in VPCC_AVD unit with identifier ATTR_TEXTURE_EL to other patches carried in atlas data. An example of such SEI message is in the following:

In the enhancement_patch_information structure, a definition el_method() provides information on how to process the enhancement layer patch data. For example, in this invention the structure would provide information for the Tenderer 4001 how to compute per texture weight, e.g.

• per-texture weights based on amount of inclusion inside the view cone, or

• per-texture weights based on relative angle from the center of the view cone, or

• per-texture weights similarly to unstructured lumigraph rendering In the enhancement_patch_information structure, a definition epi_num_tile_group indicates the number of tile groups in a related atlas frame or group of atlas frames.

In the enhancement_patch_information structure, a definition epi_tile_group_address[ j ] indicates the tile group address of the j-th tile group. In the enhancement_patch_information structure, a definition epi_num_patches[ j ] indicates the number of patches that are in tile group indicated by epi_tile_group_address[ j ]

In the enhancement_patch_information structure, a definition epi_num_el_patches[ j ][ i ] indicates the number of enhancement patches that related to a patch with i-th index in j-th tile group indicated by epi_tile_group_address[ j ] In the enhancement_patch_information structure, a definition epi_pdu_2d_pos_x[ j ][ i ][ k ] specifies the x-coordinate of the top-left corner of the patch bounding box for patch p in the current atlas tile group, expressed as a multiple of PatchPackingBlockSize.

The value of pdu_2d_pos_x[ p ] shall be in the range of 0 to Min(

2 af Ps _2d_ p os_x_b i t_ u n t_m i n u s i + 1 _ ^ asps_frame_width / PatchPackingBlockSize - 1), inclusive. The number of bits used to represent pdu_2d_pos_x[ p ] is afps_2d_pos_x_bit_count_minus1 + 1.

In the enhancement_patch_information structure, a definition epi_pdu_2d_pos_y[ j ][ i ][ k ] specifies the y-coordinate of the top-left corner of the patch bounding box for patch p in the current atlas tile group, expressed as a multiple of PatchPackingBlockSize.

The value of pdu_2d_pos_y[ p ] shall be in the range of 0 to Min(

2 af ps_2d_pos_ y _b i t_ u n t_m i n u s i + 1 _ ^ asps_frame_height / PatchPackingBlockSize - 1), inclusive. The number of bits used to represent pdu_2d_pos_y[ p ] is afps_2d_pos_y_bit_count_minus1 + 1.

In the enhancement_patch_information structure, a definition epi_pdu_2d_delta_size_x[ j ][ i ][ k ] when p is equal to 0, specifies the width value of the patch with index 0 in the current atlas tile group. When p is larger than 0, pdu_2d_delta_size_x[ p ] specifies the difference of the width values of the patch with index p and the patch with index (p - 1). It is a requirement of bitstream conformance that, when p is equal to 0, pdu_2d_delta_size_x[ p ] shall be greater than 0.

In the enhancement_patch_information structure, a definition epi_pdu_2d_delta_size_y[ j ][ i ][ k ] when p is equal to 0, specifies the height value of the patch with index 0 in the current atlas tile group. When p is larger than 0, pdu_2d_delta_size_y[ p ] specifies the difference of the height values of the patch with index p and the patch with index (p - 1). It is a requirement of bitstream conformance that, when p is equal to 0, pdu_2d_delta_size_y[ p ] shall be greater than 0

A structure visibility_cone() provides visibility cone information of an enhancement patch. An example of such is given in the following:

In the visibility_cone structure, a definition vc_direction_x[ i ] indicates the normalized x-component value of the direction vector for the visibility cone of an enhancement patch. vc_direction_x is stored as a normalized unsigned integer, where UINT32_MIN maps to -1.0 and UINT32_MAX maps to 1.0. In the visibility_cone structure, a definition vc_direction_y[ i ] indicates the normalized y-component value of the direction vector for the visibility cone of an enhancement patch. vc_direction_y is stored as a normalized unsigned integer, where UINT32_MIN maps to -1.0 and UINT32_MAX maps to 1.0.

In the visibility_cone structure, a definition vc_direction_z[ i ] indicates the normalized z-component value of the direction vector for the visibility cone of an enhancement patch. vc_direction_z is stored as a normalized unsigned integer, where UINT32_MIN maps to -1.0 and UINT32_MAX maps to 1.0.

In the visibility_cone structure, a definition vc_angle[ i ] indicates the angle of the visibility cone along the direction vector. The value of vc_angle is stored as normalized unsigned integer, where UINT16_MIN corresponds to angle value of 0 and where UINT16_MAX corresponds to 180.

According to another embodiment, a generic ai_attribute_type_id is defined for view dependent video components

A SEI message is defined to link video component based on the attribute index. This embodiment allows packing enhancement patches in another video component. An example of such SEI message is given in below: In the enhancement_patch_information structure, a definition epi_num_video_comp[ j ][ i ] indicates in how many video components and enhancement patches are present for a patch with index i in tile group with address epi_tile_group_address[ j ]. In the enhancement_patch_information structure, a definition epi_video_comp_index[ j ][ i ][ k indicates an index of an attribute (vuh_attribute_index) in which the enhancement patches are present. According to an embodiment, a new type of a patch is defined l_EL and related data structure el_patch_data_unit( patchldx ). Information from that el_patch_data_unit structure relates only to a attributes with a predefined ai_attribute_type_id (it can be for example ATTR_VIEW_DEPENDENT EL), el_patch_data_unit structure does not relate to occupancy subbitstream geometry subbitstream or any attribute subbitstream with ai_attribute_type_id other that the one predefined.

According to an embodiment el_patch_data_unit() structure is defined as follow:

In the el_patch_data_unit() structure, a definition el_attribute_index[ i ] indicates an index of an attribute (vuh_attribute_index) in which the patch data is stored.

In the el_patch_data_unit() structure, a definition el_pdu_2d_pos_x[ i ] specifies the x-coordinate of the top-left corner of the patch bounding box.

In the el_patch_data_unit() structure, a definition el_pdu_2d_pos_y[ i ] specifies the y-coordinate of the top-left corner of the patch bounding box. In the el_patch_data_unit() structure, a definition el_pdu_2d_delta_size_x[ i ] specifies the width value of the patch. In the el_patch_data_unit() structure, a definition el_pdu_2d_delta_size_y[ i ] specifies the height value of the patch. el_patch_data_unit() structure follows a patch data unit that they enhance. Visibility cone is assigned to this patch using the same functionality as for the other patches.

According to an embodiment, el_patch_data_unit() structure is defined as follows to endure linkage between enhancement patches and geometry patches:

In the el_patch_data_unit() structure, a definition el_ref_tile_group_address[ i ] indicates a tile group address in which a base patch is present.

In the el_patch_data_unit() structure, a definition el_ref_patch_index[ i ] indicates an index of a base patch which is enhanced by patch which position is signalled in this el_patch_data_unit.

In the el_patch_data_unit() structure, a definition el_attribute_index[ i ] indicates an index of an attribute (vuh_attribute_index) in which the patch data is stored.

In the el_patch_data_unit() structure, a definition el_pdu_2d_pos_x[ i ] specifies the x-coordinate of the top-left corner of the patch bounding box.

In the el_patch_data_unit() structure, a definition el_pdu_2d_pos_y[ i ] specifies the y-coordinate of the top-left corner of the patch bounding box.

In the el_patch_data_unit() structure, a definition el_pdu_2d_delta_size_x[ i ] specifies the width value of the patch. In the el_patch_data_unit() structure, a definition el_pdu_2d_delta_size_y[ i ] specifies the height value of the patch. Visibility cone is assigned to this patch using the same functionality as for the other patches.

According to another embodiment, aps_el_patches_extension_flag is introduced. When the flag is present a el_method( ) structure is present. el_method() provides information how to process the enhancement layer patch data. For example, in these embodiments the structure would provide information for the Tenderer 4001 how to compute per texture weight in view-dependent texture patches, e.g.

• per-texture weights based on amount of inclusion inside the view cone, or

• per-texture weights based on relative angle from the center of the view cone, or

• per-texture weights similarly to unstructured lumigraph rendering According to another embodiment, in some circumstances the data required to signal sub visibility cones for enhancement patches may be optimized. Instead of signalling sub-visibility cones as cones, it may be beneficial to only signal the direction vector, which may be used to identify best fitting enhancement layers. Thus signalling of vc_angle is no longer necessary, and the Tenderer uses the direction vectors to produce best colour estimate for final output.

Other mechanisms for signalling the direction vector may exist. As an example, the direction vector may be derived from the original geometry patch visibility cone by providing delta coded information to offset the enhancement patch direction vector from the original direction vector.

Another option would be to signal only sectors of the original geometry visibility cone. Considering that enhancement layer patches may be linked to geometry patch they belong to, it is possible to subdivide the original geometry visibility cone into sectors. Such sector ids may be then linked to the enhancement patches. This requires adding sector_count attribute to the original geometry visibility cone.

In the visibility_cone( ) structure, a definition vc_sector_count signals the number of sectors for the visibility cone. Example of sectoring is provided in Figure 11. The indices 1 - 8 for sectors may be encoded as part of the enhancement patch information instead of providing precise visibility cones for each enhancement patch. The ordering of the sectors for the visibility cone may be defined per need. Also, the number of visibility cone sectors may change from geometry patch to another. This allows to provide more detail for highly specular patches.

The method according to an embodiment is shown in Figure 12. The method generally comprises receiving 1210 an uncompressed first data and corresponding uncompressed second data; generating 1220 a first set of patches with metadata from the uncompressed first data and generating a second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information; transmitting 1230 the generated first set of patches and the second set of patches for packing into video frames; encoding 1240 the packed video frames by video encoder into a video bitstream; encoding 1250 associated metadata by an atlas encoder into an atlas bitstream; and encapsulating 1260 the video bitstream and the atlas bitstream according to visual volumetric video-based coding into a bitstream. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving an uncompressed first data and corresponding uncompressed second data; means for generating a first set of patches with metadata from the uncompressed first data and means for generating a second set of patches with metadata from the uncompressed second data, wherein the metadata comprises at least visibility cone information; means for transmitting the generated first set of patches and the second set of patches for packing into video frames; means for encoding the packed video frames by video encoder into a video bitstream; means for encoding associated metadata by an atlas encoder into an atlas bitstream; and means for encapsulating the video bitstream and the atlas bitstream according to visual volumetric video- based coding into a bitstream. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 12 according to various embodiments.

The method according to an embodiment is shown in Figure 13. The method is for decoding and generally comprises receiving 1310 an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video- based coded bitstream; outputting 1320 video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information; passing 1330 the visibility cone information for rendering; providing 1340 a position and an orientation of a viewer for rendering; based on the visibility cone information and the position and the orientation of a viewer, determining 1350 a final attribute information for a pixel according to second data patches.

An apparatus according to an embodiment comprises means for receiving an atlas bitstream and corresponding video bitstream having been extracted from a visual volumetric video-based coded bitstream; means for outputting video frames containing first data patches and second data patches and associated metadata, wherein the metadata comprises at least visibility cone information; means for passing the visibility cone information for rendering; means for providing a position and an orientation of a viewer for rendering; based on the visibility cone information and the position and the orientation of a viewer, means for determining a final attribute information for a pixel according to second data patches. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 13 according to various embodiments.

An example of an apparatus is disclosed with reference to Figure 14. Figure 14 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises features according to either of the embodiment, shown in Figure 12 or 13.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.