Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Document Type and Number:
WIPO Patent Application WO/2022/074286
Kind Code:
A1
Abstract:
The embodiments relate to method, comprising receiving as an input a volumetric video frame comprising volumetric content (510); projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component (520); creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection (530); packing the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images (540); signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information (550); and transmitting the encoded bitstream to a storage for rendering (560). The embodiments also relate to an apparatus and to a computer program product.

Inventors:
AFLAKI-BENI PAYMAN (FI)
SCHWARZ SEBASTIAN (DE)
Application Number:
PCT/FI2021/050630
Publication Date:
April 14, 2022
Filing Date:
September 24, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/597; G06T3/60; G06T7/73; H04N13/178; H04N19/174; H04N19/70; H04N19/85
Domestic Patent References:
WO2020150148A12020-07-23
Foreign References:
US20200151913A12020-05-14
Other References:
ARASH VOSOUGHI, BYEONGDOO CHOI, SEHOON YEA, STEPHAN WENGER, SHAN LIU: "[V-PCC][CE2.19 related][New proposal] Dynamic point cloud partition packing using tile groups", 127. MPEG MEETING; 20190708 - 20190712; GOTHENBURG; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 4 July 2019 (2019-07-04), XP030207640
"V-PCC Codec description", 128. MPEG MEETING; 20191007 - 20191011; GENEVA; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 30 December 2019 (2019-12-30), XP030225590
"3 DG OF ISO/IEC JTC1/SC29/WG11 W19329 Information technology- Coded Representation of Immersive Media - Part 5: Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V-PCC", THE 130TH MEETING OF MPEG, 9 May 2020 (2020-05-09), Alpbach, Retrieved from the Internet [retrieved on 20211222]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
27

Claims:

1 . A method, comprising:

- receiving as an input a volumetric video frame comprising volumetric content;

- projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component;

- creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection;

- packing the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmitting the encoded bitstream to a storage for rendering.

2. The method according to claim 1 , further comprising signaling, in or along the bitstream, another indication on whether group of pictures comprises packed information on several patches.

3. The method according claim 1 or 2, wherein the packed information comprises a list of atlas image identifications is included, and a list of atlas image information being shared by all atlas images.

4. The method according to any of the claims 1 to 3, wherein the packed information comprises rotation or width of patches to be shared.

5. The method according to any of the claims 1 to 4, further comprising indicating in or along a bitstream a reference atlas image and/or reference region comprising the packing information.

6. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive as an input a volumetric video frame comprising volumetric content;

- project the volumetric content to at least to patches in a temporal order, wherein a patch comprises a volumetric video data component; - create atlas images including at least two patches, wherein each atlas image incudes the patches from the same temporal projection;

- pack the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signal, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmit the encoded bitstream to a storage for rendering.

7. An apparatus comprising:

- means for receiving as an input a volumetric video frame comprising volumetric content;

- means for projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component;

- means for creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection;

- means for packing the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- means for signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- means for transmitting the encoded bitstream to a storage for rendering.

8. The apparatus according to claim 7, further comprising means for signaling, in or along the bitstream, another indication on whether group of pictures comprises packed information on several patches.

9. The apparatus according to claim 7 or 8, wherein the packed information comprises a list of atlas image identifications and a list of atlas image information being shared by such atlas images.

10. The apparatus according to claim 7 or 8, wherein the packed information comprises a list of patch identifications and a list of patch information being shared by such patches.

11. The apparatus according to any of the claims 7 to 10, wherein the packed information comprises rotation or width of patches to be shared.

12. The apparatus according to any of the claims 7 to 11 , further comprising means for indicating in or along a bitstream a reference atlas image and/or reference region comprising the packing information.

13. The apparatus according to any of the claims 7 to 12, further comprising communicating atlas image packed information prior to a first atlas image included in the list of atlas images.

14. The apparatus according to any of the claims 7 to 13, wherein the packed information for the atlas images is the shared tile information.

15. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive as an input a volumetric video frame comprising volumetric content;

- project the volumetric content to at least to patches in a temporal order, wherein a patch comprises a volumetric video data component;

- create atlas images including at least two patches, wherein each atlas image incudes the patches from the same temporal projection;

- pack the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signal, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmit the encoded bitstream to a storage for rendering.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

The present solution generally relates to encoding and decoding of digital volumetric video.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising:

- receiving as an input a volumetric video frame comprising volumetric content;

- projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component;

- creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection;

- packing the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmitting the encoded bitstream to a storage for rendering.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive as an input a volumetric video frame comprising volumetric content;

- project the volumetric content to at least to patches in a temporal order, wherein a patch comprises a volumetric video data component;

- create atlas images including at least two patches, wherein each atlas image incudes the patches from the same temporal projection; - pack the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signal, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmit the encoded bitstream to a storage for rendering.

According to a third aspect, there is provided an apparatus comprising:

- means for receiving as an input a volumetric video frame comprising volumetric content;

- means for projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component;

- means for creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection;

- means for packing the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- means for signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- means for transmitting the encoded bitstream to a storage for rendering.

According to a fourth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive as an input a volumetric video frame comprising volumetric content;

- project the volumetric content to at least to patches in a temporal order, wherein a patch comprises a volumetric video data component;

- create atlas images including at least two patches, wherein each atlas image incudes the patches from the same temporal projection;

- pack the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images;

- signal, in or along a bitstream, at least an indication that the atlas images comprise packed information; and

- transmit the encoded bitstream to a storage for rendering. According to an embodiment, another indication on whether group of pictures comprises packed information on several patches is signaled, in or along the bitstream.

According to an embodiment, the packed information comprises a list of atlas image identifications and a list of atlas image information being shared by such atlas images.

According to an embodiment, the packed information comprises a list of patch identifications and a list of patch information being shared by such patches.

According to an embodiment, the packed information comprises rotation or width of patches to be shared.

According to an embodiment, a reference atlas image and/or reference region comprising the packing information is indicated in or along a bitstream.

According to an embodiment, atlas image packed information is communicated prior to a first atlas image included in the list of atlas images.

According to an embodiment, the packed information for the atlas images is the shared tile information.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of an encoding process;

Fig. 2 shows an example of a decoding process;

Fig. 3 shows an example of a compression process of a volumetric video;

Fig. 4 shows an example of a de-compression process of a volumetric video; Fig. 5 is a flowchart illustrating a method according to an embodiment; and

Fig. 6 shows an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments will be described in the context of volumetric video encoding and decoding. In particular, the several embodiments enable packing and signaling volumetric video in one video component.

A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un-compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate). Figure 1 illustrates an encoding process of an image as an example. Figure 1 shows an image to be encoded (l n ); a predicted representation of an image block (P’ n ); a prediction error signal (D n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); a transform (T) and inverse transform (T’ 1 ); a quantization (Q) and inverse quantization (Q’ 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 2. Figure 2 illustrates a predicted representation of an image block (P’ n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’n); a final reconstructed image (R’ n ); an inverse transform (T 1 ); an inverse quantization (Q -1 ); an entropy decoding (E -1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

Volumetric video refers to a visual content that may have been captured using one or more three-dimensional (3D) cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world. Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two- dimensional (2D) plane. However, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.

Volumetric video data represents a three-dimensional scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in three- dimensional space) and respective attributes (e.g. color, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in two-dimensional (2D) video). Volumetric video is either generated from three-dimensional (3D) models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data comprises triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function on time.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes, where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multilevel surface maps.

In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-spaces is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Instead of the above-mentioned approach, a 3D scene represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency may be increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities may be improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/de- compression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 3 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 301 that is provided for patch generation 302, geometry image generation 304 and texture image generation 305.

The patch generation 302 process aims at decomposing the point cloud frame by converting 3D samples to 2D samples on a given projection plane using a strategy that provides the best compression. The patch generation process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing reconstruction error.

For patch generation, the normal at every point can be estimated. The tangent plane and its corresponding normal are defined per each point, based on the point’s nearest neighbors m within a predefined search distance. A K-D tree is used to separate the data and find neighbors in a vicinity of a point /7 ; and a barycenter c = p of that set of points is used to define the normal. The barycenter cis computed as follows:

The normal is estimated from eigen decomposition for the defined point cloud as:

Based on this information each point is associated with a corresponding plane of a point cloud bounding box. Each plane is defined by a corresponding normal n p with values:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal p .and the plane normal

The sign of the normal is defined depending on the point’s position in relationship to the “center”.

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The next step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 302 for the input point cloud frame 301 is delivered to packing process 303, to geometry image generation 304 and to texture image generation 305. The packing process 303 aims at generating the geometry and texture maps, by appropriately considering the generated patches and by trying to efficiently place the geometry or texture data that corresponds to each patch onto a 2D grid of size WxH. Such placement also accounts for a minimum size block TxT (e.g. 16 x 16), which specifies the minimum distance between distinct patches as placed on this 2D grid. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The packing method may use a search algorithm as follows:

Initially, patches may be placed on a 2D grid in a manner that would guarantee a non-overlapping insertion. Samples belonging to a patch (rounded to a value that is a multiple of T) are considered as occupied blocks. In addition, a safeguard between adjacent patches is forced to distance of at least one block being multiple of T. Patches may be processed in an orderly manner, based on the patch index list. Each patch from the list is iteratively placed on the grid. The grid resolution depends on the original point cloud size and its width (W) and height (H) may be encoded in the bitstream and transmitted to the decoder. In the case that there is no empty space available for the next patch the height value of the grid is initially doubled, and the insertion of this patch is evaluated again. If insertion of all patches is successful, then the height is trimmed to the minimum needed value. However, this value is not allowed to be set lower than the originally specified value in the encoder. The final values for W and H correspond to the frame resolution that is used to encode the texture and geometry video signals using the appropriate video codec.

The geometry image generation 304 and the texture image generation 305 are configured to generate geometry images and texture images, respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 307. The image padding 307 may also receive as an input an occupancy map (OM) 306 to be used with the geometry images and texture images. The occupancy map 306 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted, respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 303. The padding process 307, for which the present embodiment are related, aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 308. The generated images/layers may be stored as video frames and compressed using for example the H.265 video codec according to the video codec configurations provided as parameters. The video compression 308 also generates reconstructed geometry images to be provided for smoothing 309, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 302. The smoothed geometry may be provided to texture image generation 305 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl)

- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = yO o Index 1, 30= yO, s0=z0 and rO = xO o Index 2, 30= zO, s0=x0 and rO = yO Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:

- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

- Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.

General V3C parameter set MIV extension syntax is presented in below:

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 310 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

■ The binary value of the initial sub-block is encoded.

■ Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.

■ The number of detected runs is encoded.

■ The length of each run, except of the last one, is also encoded.

An atlas is a collection of 2D bounding boxes, i.e. patches, projected into a rectangular frame that corresponds to a 3D bounding box in 3D space, which may be a subset of a point cloud. The patch in the V-PCC notation is a rectangular region within an atlas, i.e. a collection of information that represents a 3D bounding box of the point cloud and associated geometry and attribute description along with the atlas information that is required to reconstruct the 3D point positions and their corresponding attributes from the 2D projections. An atlas frame may be partitioned into tiles. The partitioned tiles may be presented as one or more tile rows and one or more tile columns. A tile is a rectangular region of an atlas frame. The tiles can further be divided into tile groups. Only rectangular tile groups are supported. In this mode, a tile group contains a number of tiles of an atlas frame that collectively form a rectangular region of the atlas frame.

Figure 4 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 401 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 402. In addition, the de-multiplexer 401 transmits compressed occupancy map to occupancy map decompression 403. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 404. Decompressed geometry video from the video decompression 402 is delivered to geometry reconstruction 405, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 405 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 406, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 407, which also receives a decompressed texture video from video decompression 402. The texture reconstruction 407 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows: 5(u, v) = <50 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. Such projection is presented using different patches. Each set of patches may represent a specific object or specific parts of a scene. One or more patches may create a tile and it is possible to create a group of tiles, including one or more tiles.

Visual volumetric video-based Coding (V3C, sometimes also 3VC) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1 -8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 is expected to be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV. A vpcc_unit consists of header and payload pairs. Below is the syntax for vpcc_units and vpcc_unit_header structures.

General V-PCC unit syntax is presented in below:

The following table represents a general V-PCC unit header syntax:

A general VPCC unit payload syntax is presented below:

The present embodiments are targeted in packing information on the patches and/or atlas images in an atlas image or a tile. Such an efficient packing improves the compression efficiency of the codec to encode the same number of patches with a smaller number of bits. It is to be noticed that the specification of V3C MIV comprises a parameter vme_packed_video_present_flag which enables packing of some general information of each atlas. vme_packed_video_present_flag[ j ] equal to 0 indicates that the atlas with ID j does not have packed data. vps_packed_video_present_flag[ j ] equal to 1 indicates that the atlas with ID j has packed data. When vps_packed_video_present_flag[ j ] is not present, it is inferred to be equal to 0.

However, no signalling for the patch information for packing is presented or introduced in the specification.

Thus, the present embodiments target introducing a set of parameters to be included as packed information to share specific characteristics of the patches and/or atlas image in the atlas image or tile. Such information is similar or identical among the patches and/or atlas images and hence, there is no need to send it separately for each patch and/or atlas image.

The present embodiments consider that at least two patches are present in the current atlas. The present embodiments introduce a set of parameters to be included in the packed information for each atlas image, where the parameters belong to the patches and/or atlas images and are similar between them.

The present embodiments have advantages, since they reduce the amount of overhead sent for each atlas image or group of images. The packed information may share the same information on patches among all patches that have similar criteria. Similarly, the packed information may share the same information on atlas images among all atlas images that have similar criteria.

Therefore, the atlas bitstream, according to an embodiment, may contain a signal that informs the decoder whether or not the packed information exists, wherein the packed information is packed patch information or packed atlas image information. There may be a signal (e.g. a flag) indicating existence of packed information for patches and another signal (e.g. another flag) indicating existence of packed information for atlas images for a group of pictures.

The flag indicating the similar criteria for atlas images for a group of pictures may be assigned for a specific number of atlas images, e.g. 16 or 32. Alternatively, it may follow the number of images included in an encoding GOP decided at the encoder side. Alternatively, it can be adaptive to the content, meaning that it can continue until a certain criterion is met. The criterion may be sudden change in the content, radical rotation of an object, big change in lamination of the scene, object entering or exiting from the scene, scene cuts, etc.

The signal will be received in the decoder side and respective packed information will be used in the respective patches or tile images.

According to an embodiment, the packed information presented in V3C MIV standard includes some of the information related to patches/atlas images to be shared among all or specific number of patches/atlas images e.g. rotation or width of patches to be shared. Therefore, in the packed information, there may be a list of patch entitylDs that share one or more specific criteria. Alternatively, it may be assigned that one or more specific criteria are shared among some patches which are presented with their patch entitylDs.

It is appreciated that the present embodiments are not limited to patches and the packed information may belong to tiles or in general atlas images are well. When the packed information belongs to tiles, the packed information is defined for two or more tiles in an atlas image. When the packed information belongs to atlas images, then the packed information is defined per sequence of atlas images, e.g. one GOP (group of pictures).

According to an embodiment, the packed information proposed in the present disclosure only belong to patches. In such embodiment, the packed information include, but are not limited to the following (the syntax elements which have a descriptor assigned to them):

General Patch data unit syntax is presented in below:

General RAW patch data unit syntax is presented in below:

General Patch data unit MIV extension syntax is presented in below:

According to another embodiment, the packed information included may represent the atlas image information which is packed to be shared between a series of atlas images. In such an embodiment, the atlas image packed information should be communicated prior to the first atlas image included in the list of atlas images with packed information. Thus, in such embodiment, the packed information include but are not limited to the following: atgdu_patch_mode[ p ] According to another embodiment, the atlas images may share similar tile information. In this embodiment, the packed information includes, but are not limited to, the following (the syntax elements which have a descriptor assigned to them): General Supplemental enhancement information message syntax is presented in below:

According to an embodiment, the aforementioned parameters will be included in the respective presentation of packed information, i.e. atlas level (for patch information) or sequence of pictures level (for atlas images information).

According to another embodiment, it is clarified which atlas images are going to share the respective packed atlas image information; or similarly, which patches are going to share the respective packed patch information. This means, at the beginning of the packed information for the sequence of images (or atlas images), a list of atlas image IDs is introduced and then the atlas image information that is to be shared by all such atlas images is listed; or similarly, at the beginning of the packed information for patches, a list of patch IDs is introduced and then the patch information that is to be shared by all such patches are listed.

An example embodiment on introduced syntax elements on packed information is the following:

New syntax element pi_ref_atlas_id [ j ] indicates the reference atlas to get packed information from. If pi_ref_atlas_id [ j ] is equal to j then new packed information is signalled. Otherwise, packed information is copied from the packed information for the atlas with the indicated index.

New syntax element pi_ref_region_id [ i ] indicates the reference region to get packed information from. If pi_ref_region_id [ i ] is equal to i then new packed information is signalled. Otherwise, packed information is copied from the packed information for the region with the indicated region id.

It is appreciated that the embodiments are presented above with reference to particular syntax elements. It needs to be understood that the naming of the presented syntax elements is given as an example, and embodiments can be formed similarly by alternative naming. The method according to an embodiment is shown in Figure 5. The method generally comprises receiving 510 as an input a volumetric video frame comprising volumetric content; projecting 520 the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component; creating 530 atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection; packing 540 the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images; signaling 550, in or along a bitstream, at least an indication that the atlas images comprise packed information; and transmitting 560 the encoded bitstream to a storage for rendering. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving as an input a volumetric video frame comprising volumetric content; means for projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component; means for creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection; means for packing the information on several patches and/or atlas images, wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images; means for signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and means for transmitting the encoded bitstream to a storage for rendering. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 5 according to various embodiments.

The various embodiments may provide advantages. For example, the present embodiments enable reducing the required bitrate to signal the patch information. In addition, the present embodiments enable reducing the required bitrate to signal the atlas image information.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. The computer program code comprises one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving as an input a volumetric video frame comprising volumetric content; projecting the volumetric content to at least two patches in a temporal order, wherein a patch comprises a volumetric video data component; creating atlas images including at least two patches, wherein each atlas image includes the patches from the same temporal projection; packing the information on several patches and/or atlas images , wherein the packed information comprises a set of parameters relating to said patches and/or atlas images to be shared among more than one patches/atlas images; signaling, in or along a bitstream, at least an indication that the atlas images comprise packed information; and transmitting the encoded bitstream to a storage for rendering.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.