Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Document Type and Number:
WIPO Patent Application WO/2023/047021
Kind Code:
A2
Abstract:
The embodiments relate to a method for encoding and decoding. The method for encoding comprises receiving a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; deriving a plurality of parts of the mesh which projection planes are aligned to; identifying the most relevant part from said plurality of parts for projection alignment; identifying most relevant surface normal of said most relevant part; updating an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; generating a bitstream containing information on object's alignment with a most relevant surface normal to be signaled to a decoder.

Inventors:
SCHWARZ SEBASTIAN (DE)
BACHHUBER CHRISTOPH (DE)
RONDAO ALFACE PATRICE (BE)
KONDRAD LUKASZ (DE)
ILOLA LAURI (FI)
MARTEMIANOV ALEKSEI (FI)
Application Number:
PCT/FI2022/050637
Publication Date:
March 30, 2023
Filing Date:
September 22, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
Attorney, Agent or Firm:
BERGGREN OY (FI)
Download PDF:
Claims:
34

Claims:

1 . An apparatus comprising:

- means for receiving a sequence of volumetric video frames comprising a volumetric visual object being defined with a mesh representing a surface and a set of interconnected clusters;

- means for deriving a plurality of clusters of the mesh which projection planes are aligned to;

- means for identifying the most relevant clusters from said plurality of clusters for projection alignment;

- means for identifying most relevant surface normals of said most relevant clusters;

- means for updating the most relevant surface normals in consecutive frames;

- means for rotating the volumetric video object to align the most relevant surface normals with a surface normal of a projection plane; and

- means for generating a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters, said bitstream to be signaled to a decoder.

2. The apparatus according to claim 1 , wherein the plurality of relevant clusters is derived by determining when a set of connected vertices connects to one or more clusters, said set of connected vertices is added to such a cluster, and when a set of connected vertices does not connect to any cluster, the apparatus comprises means for generating a new cluster.

3. The apparatus according to claim 1 or 2, wherein the most relevant clusters are identified by one or more of the following: a) selecting a cluster with the largest number of connected faces overall; b) selecting a cluster with the largest number of UV map pixels covered; c) selecting a cluster with the highest/lowest density of faces; d) selecting a cluster with the highest/lowest variance in UV map texture; e) selecting a cluster with the highest/lowest variance in 3D vertex locations; f) selecting a cluster with the lowest variance in surface normals; g) selecting a cluster based on a certain UV map texture distribution; h) selecting a cluster previously identified in a preceding frame. 35

4. The apparatus according to any of the previous claims 1 to 3, further comprising means for indicating a presence of rotation parameters to align with a most relevant surface normal in the bitstream.

5. The apparatus according to any of the previous claims 1 to 4, further comprising means for indicating in the bitstream coordinate components for the rotation to align with the most relevant surface normal.

6. A method, comprising:

- receiving a sequence of volumetric video frames comprising a volumetric visual object being defined with a mesh representing a surface and a set of interconnected clusters;

- deriving a plurality of clusters of the mesh which projection planes are aligned to;

- identifying the most relevant clusters from said plurality of clusters for projection alignment;

- identifying most relevant surface normals of said most relevant clusters;

- updating the most relevant surface normals in consecutive frames;

- rotating the volumetric video object to align the most relevant surface normals with a surface normal of a projection plane; and

- generating a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters, said bitstream to be signaled to a decoder.

7. The method according to claim 6, wherein the plurality of relevant clusters is derived by determining when a set of connected vertices connects to one or more clusters, said set of connected vertices is added such cluster, and when a set of connected vertices does not connect to any cluster, the method comprises generating a new cluster.

8. The method according to claim 6 or 7, wherein the most relevant clusters are identified by one or more of the following: a) selecting a cluster with the largest number of connected faces overall; b) selecting a cluster with the largest number of UV map pixels covered; c) selecting a cluster with the highest/lowest density of faces; d) selecting a cluster with the highest/lowest variance in UV map texture; e) selecting a cluster with the highest/lowest variance in 3D vertex locations; f) selecting a cluster with the lowest variance in surface normals; g) selecting a cluster based on a certain UV map texture distribution; h) selecting a cluster previously identified in a preceding frame.

9. The method according to any of the previous claims 6 to 8, further comprising indicating a presence of rotation parameters to align with a most relevant surface normal in the bitstream.

10. The method according to any of the previous claims 6 to 9, further comprising indicating in the bitstream coordinate components for the rotation to align with the most relevant surface normal.

11. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a sequence of volumetric video frames comprising a volumetric visual object being defined with a mesh representing a surface and a set of interconnected clusters;

- derive a plurality of clusters of the mesh which projection planes are aligned to;

- identify the most relevant clusters from said plurality of clusters for projection alignment;

- identify most relevant surface normal of said most relevant clusters;

- update the most relevant surface normals in consecutive frames;

- rotate the volumetric video object to align the most relevant surface normals with a surface normal of a projection plane; and

- generate a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters, said bitstream to be signaled to a decoder.

12. The apparatus according to claim 11 , wherein for deriving the plurality of relevant clusters the apparatus is further caused to determine when a set of connected vertices connects to one or more clusters, the apparatus is caused to add said set of connected vertices to such cluster; and to determine when a set of connected vertices does not connect to any cluster, the apparatus is caused to generate a new cluster.

13. The apparatus according to claim 11 or 12, wherein the most relevant clusters are identified by one or more of the following: a) selecting a cluster with the largest number of connected faces overall; b) selecting a cluster with the largest number of UV map pixels covered; c) selecting a cluster with the highest/lowest density of faces; d) selecting a cluster with the highest/lowest variance in UV map texture; e) selecting a cluster with the highest/lowest variance in 3D vertex locations; f) selecting a cluster with the lowest variance in surface normals; g) selecting a cluster based on a certain UV map texture distribution; h) selecting a cluster previously identified in a preceding frame.

14. The apparatus according to any of the previous claims 11 to 13, further comprising computer program code configured to cause the apparatus to indicate a presence of rotation parameters to align with a most relevant surface normal in the bitstream.

15. The method according to any of the previous claims 11 to 14, further comprising computer program code configured to cause the apparatus to indicate in the bitstream coordinate components for the rotation to align with the most relevant surface normal.

16. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive a sequence of volumetric video frames comprising a volumetric visual object being defined with a mesh representing a surface and a set of interconnected clusters;

- derive a plurality of clusters of the mesh which projection planes are aligned to; 38

- identify the most relevant clusters from said plurality of clusters for projection alignment;

- identify most relevant surface normal of said most relevant clusters;

- update the most relevant surface normals in consecutive frames;

- rotate the volumetric video object to align the most relevant surface normals with a surface normal of a projection plane; and

- generate a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters, said bitstream to be signaled to a decoder.

17. An apparatus comprising:

- means for receiving a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters to align with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment;

- means for reconstructing a volumetric visual object according to received information on object’s alignment;

- means for determining an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and

- means for signalling the volumetric visual object to a Tenderer to be rendered with an applied rotation.

18. A method comprising

- receiving a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters to align with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment;

- reconstructing a volumetric visual object according to received information on object’s alignment;

- determining an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and

- signalling the volumetric visual object to a Tenderer to be rendered with an applied rotation.

19. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code 39 configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters to align with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment;

- reconstruct a volumetric visual object according to received information on object’s alignment;

- determine an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and

- signal the volumetric visual object to a Tenderer to be rendered with an applied rotation.

20. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive a bitstream containing information on the projection of the volumetric visual object and on object’s rotation parameters to align with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment;

- reconstruct a volumetric visual object according to received information on object’s alignment;

- determine an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and

- signal the volumetric visual object to a Tenderer to be rendered with an applied rotation.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

The present solution generally relates to volumetric video data encoding and decoding.

Background

Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications. Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices have enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight, and structured light are examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding, this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multiview plus depth is the use of elevation maps, and multi-level surface maps.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided an apparatus comprising means for receiving a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; means for deriving a plurality of parts of the mesh which projection planes are aligned to; means for identifying the most relevant part from said plurality of parts for projection alignment; means for identifying most relevant surface normal of said most relevant part; means for updating an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; means for generating a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder.

According to a second aspect, there is provided a method, comprising receiving a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; deriving a plurality of parts of the mesh which projection planes are aligned to; identifying the most relevant part from said plurality of parts for projection alignment; identifying most relevant surface normal of said most relevant part; updating an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; and generating a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder.

According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; derive a plurality of parts of the mesh which projection planes are aligned to; identify the most relevant part from said plurality of parts for projection alignment; identify most relevant surface normal of said most relevant part; update an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; and generate a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder.

According to a fourth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; derive a plurality of parts of the mesh which projection planes are aligned to; identify the most relevant part from said plurality of parts for projection alignment; identify most relevant surface normal of said most relevant part; update an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; and generate a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder.

According to a fifth aspect, there is provided an apparatus comprising means for receiving a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; means for reconstructing a volumetric visual object according to received information on object’s alignment; means for determining an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and- means for signalling the volumetric visual object to a Tenderer to be rendered with an applied rotation.

According to a sixth aspect, there is provided a method comprising receiving a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; reconstructing a volumetric visual object according to received information on object’s alignment; determining an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and signalling the volumetric visual object to a Tenderer to be rendered with an applied rotation.

According to a seventh aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; reconstruct a volumetric visual object according to received information on object’s alignment; determine an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and signal the volumetric visual object to a Tenderer to be rendered with an applied rotation.

According to an eighth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; reconstruct a volumetric visual object according to received information on object’s alignment; determine an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and signal the volumetric visual object to a Tenderer to be rendered with an applied rotation.

According to an embodiment, plurality of relevant parts is derived by determining when a set of connected vertices connects to one or more parts, said set of connected vertices is added such part, and when a set of connected vertices connects does not connect to any part, the apparatus comprises means for generating a new part.

According to an embodiment, a relevant part is a cluster, whereupon the most relevant clusters are identified by one or more of the following: a) selecting a cluster with the largest number of connected faces overall; b) selecting a cluster with the largest number of UV map pixels covered; c) selecting a cluster with the highest/lowest density of faces; d) selecting a cluster with the highest/lowest variance in UV map texture; e) selecting a cluster with the highest/lowest variance in 3D vertex locations; f) selecting a cluster with the lowest variance in surface normals; g) selecting a cluster based on a certain UV map texture distribution; h) selecting a cluster previously identified in a preceding frame.

According to an embodiment, a presence of rotation parameters to align with a most relevant surface normal is indicated in a bitstream.

According to an embodiment, coordinate components for the rotation to align with the most relevant surface normal are indicated.

According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a compression process of a volumetric video;

Fig. 2 shows an example of a de-compression of a volumetric video;

Fig. 3a shows an example of volumetric media conversion at an encoder; Fig. 3b shows an example of volumetric media reconstruction at a decoder;

Fig. 4 shows an example of block to patch mapping;

Fig. 5a shows an example of an atlas coordinate system;

Fig. 5b shows an example of a local 3D patch coordinate system;

Fig. 5c shows an example of a final target 3D coordinate system;

Fig. 6 shows a V-PCC extension for mesh encoding;

Fig. 7 shows a V-PCC extension for mesh decoding;

Fig. 8 shows example meshes and their white bounding boxes;

Fig.9 shows an example of UV map of sequence Mitch with main projection directions per vertex;

Fig. 10 shows an example of surface normal distribution of UV map patches;

Figs. 11 a, 11 b are flowcharts illustrating methods according to embodiments; and

Fig. 12 shows an apparatus according to an embodiment.

Description of Example Embodiments

The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.

Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.

In the following, a short reference of ISO/IEC DIS 23090-5 Visual Volumetric Video-based Coding (V3C) and Video-based Point Cloud Compression (V- PCC) 2nd Edition is given. Visual volumetric video comprising a sequence of visual volumetric frames, if uncompressed, may be represented by a large amount of data, which can be costly in terms of storage and transmission. This has led to the need for a high coding efficiency standard for the compression of visual volumetric data.

Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.

The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1 .0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0) More precisely, each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user- defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point o H u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.

The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g., 16x16) pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl)

- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = y0 o Index 1, 30= yO, s0=z0 and rO = x0 o Index 2, 30= zO, s0=x0 and rO = yO

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:

- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks. - Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy. ■ The binary value of the initial sub-block is encoded.

■ Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.

■ The number of detected runs is encoded.

■ The length of each run, except of the last one, is also encoded.

Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

3(u, v) = 30 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

Visual volumetric video-based Coding (V3C) relates to a core part shared between ISO/IEC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV.

V3C enables the encoding and decoding processes of a variety of volumetric media by using video and image coding technologies. This is achieved through first a conversion of such media from their corresponding 3D representation to multiple 2D representations, also referred to as V3C video components, before coding such information. Such representations may include occupancy, geometry, and attribute components. The occupancy component can inform a V3C decoding and/or rendering system of which samples in the 2D components are associated with data in the final 3D representation. The geometry component contains information about the precise location of 3D data in space, while attribute components can provide additional properties, e.g., texture or material information, of such 3D data. An example is shown in Figures 3a and 3b, where Figure 3a presents volumetric media conversion at an encoder, and where Figure 3b presents volumetric media reconstruction at a decoder side. The 3D media is converted to a series of 2D representations: occupancy 301 , geometry 302, and attributes 303. Additional information may also be included in the bitstream to enable inverse reconstruction. Additional information that allows associating all these V3C video components, and enables the inverse reconstruction from a 2D representation back to a 3D representation is also included in a special component, referred to in this document as the atlas 304. An atlas 304 consists of multiple elements, named as patches. Each patch identifies a region in all available 2D components and contains information necessary to perform the appropriate inverse projection of this region back to the 3D space. The shape of such regions is determined through a 2D bounding box associated with each patch as well as their coding order. The shape of these regions is also further refined after the consideration of the occupancy information.

Atlases may be partitioned into patch packing blocks of equal size. The 2D bounding boxes of patches and their coding order determine the mapping between the blocks of the atlas image and the patch indices. Figure 4 shows an example of block to patch mapping with 4 projected patches onto an atlas when asps_patch_precedence_order_flag is equal to 0. Projected points are represented with dark grey. The area that does not contain any projected points is represented with light grey. Patch packing blocks are represented with dashed lines. The number inside each patch packing block represents the patch index of the patch to which it is mapped.

Axes orientations are specified for internal operations. For instance, the origin of the atlas coordinates is located on the top-left corner of the atlas frame. For the reconstruction step, an intermediate axes definition for a local 3D patch coordinate system is used. The 3D local patch coordinate system is then converted to the final target 3D coordinate system using appropriate transformation steps.

Figure 5a shows an example of a single patch 520 packed onto an atlas image 510. This patch 520 is then converted to a local 3D patch coordinate system (U, V, D) defined by the projection plane with origin O’, tangent (U), bi-tangent (V), and normal (D) axes. For an orthographic projection, the projection plane is equal to the sides of an axis-aligned 3D bounding box 530, as shown in Figure 5b. The location of the bounding box 530 in the 3D model coordinate system, defined by a left-handed system with axes (X, Y, Z), can be obtained by adding offsets TilePatch3dOffsetU, TilePatch3DOffsetV, and TilePatch3DOffsetD, as illustrated in Figure 5c. Coded V3C video components are referred to in this disclosure as video bitstreams, while a coded atlas is referred to as the atlas bitstream. Video bitstreams and atlas bitstreams may be further split into smaller units, referred to here as video and atlas sub-bitstreams, respectively, and may be interleaved together, after the addition of appropriate delimiters, to construct a V3C bitstream.

V3C patch information is contained in atlas bitstream, atlas_sub_bitstream(), which contains a sequence of NAL units. NAL unit is specified to format data and provide header information in a manner appropriate for conveyance on a variety of communication channels or storage media. All data are contained in NAL units, each of which contains an integer number of bytes. A NAL unit specifies a generic format for use in both packet-oriented and bitstream systems. The format of NAL units for both packet-oriented transport and sample streams is identical except that in the sample stream format specified in Annex D of ISO/IEC 23090-5 each NAL unit can be preceded by an additional element that specifies the size of the NAL unit.

NAL units in atlas bitstream can be divided to atlas coding layer (ACL) and non-atlas coding layer (non-ACL) units. The former dedicated to carry patch data while the later to carry data necessary to properly parse the ACL units or any additional auxiliary data.

In the nal_unit_header() syntax nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 of ISO/IEC 23090-5. nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal_layer_id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC. Decoders conforming to a profile specified in Annex A of ISO/IEC 23090-5 shall ignore (i.e., remove from the bitstream and discard) all NAL units with values of nal_layer_id not equal to 0. rbsp_byte[ i ] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows: The RBSP contains a string of data bits (SODB) as follows:

• If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.

• Otherwise, the RBSP contains the SODB as follows: o The first byte of the RBSP contains the first (most significant, leftmost) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. o The rbsp_trailing_bits( ) syntax structure is present after the SODB as follows:

■ The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any).

■ The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit).

■ When the rbsp_stop_one_bit is not the last bit of a byte- aligned byte, one or more bits equal to 0 (i.e., instances of rbsp_alignment_zero_bit) are present to result in byte alignment.

One or more cabac_zero_word 16-bit syntax elements equal to 0x0000 may be present in some RBSPs after the rbsp_trailing_bits( ) at the end of the RBSP.

Syntax structures having these RBSP properties are denoted in the syntax tables using an "_rbsp" suffix. These structures are carried within NAL units as the content of the rbsp_byte[ i ] data bytes. As an example, the following may be considered as typical content:

• atlas_sequence_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a sequence level.

• atlas_frame_parameter_set_rbsp( ), which is used to carry parameters related to atlas on a frame level and are valid for one or more atlas frames.

• sei_rbsp( ), used to carry SEI messages in NAL units.

• atlas_tile_group_layer_rbsp( ), used to carry patch layout information for tile groups.

When the boundaries of the RBSP are known, the decoder can extract the SODB from the RBSP by concatenating the bits of the bytes of the RBSP and discarding the rbsp_stop_one_bit, which is the last (least significant, rightmost) bit equal to 1 , and discarding any following (less significant, farther to the right) bits that follow it, which are equal to 0. The data necessary for the decoding process is contained in the SODB part of the RBSP. atlas_tile_group_laye_rbsp() contains metadata information for a list off tile groups, which represent sections of frame. Each tile group may contain several patches for which the metadata syntax is described below.

Annex F of V3C V-PCC specification (23090-5) describes different SEI messages that have been defined for V3C MIV purposes. SEI messages assist in processes related to decoding, reconstruction, display, or other purposes. Annex F (23090-5) defines two types of SEI messages: essential and non- essential. V3C SEI messages are signaled in sei_rspb() which is documented below.

Non-essential SEI messages are not required by the decoding process. Conforming decoders are not required to process this information for output order conformance.

Specification for presence of non-essential SEI messages is also satisfied when those messages (or some subset of them) are conveyed to decoders (or to the HRD) by other means not specified in V3C V-PCC specification (23090- 5). When present in the bitstream, non-essential SEI messages shall obey the syntax and semantics as specified in Annex F (23090-5). When the content of a non-essential SEI message is conveyed for the application by some means other than presence within the bitstream, the representation of the content of the SEI message is not required to use the same syntax specified in annex F (23090-5). For the purpose of counting bits, only the appropriate bits that are present in the bitstream are counted.

Essential SEI messages are an integral part of the V3C bitstream and should not be removed from the bitstream. The essential SEI messages are categorized into two types:

• Type-A essential SEI messages: These SEIs contain information required to check bitstream conformance and for output timing decoder conformance. Every V3C decoder conforming to point A should not discard any relevant Type-A essential SEI messages and shall consider them for bitstream conformance and for output timing decoder conformance.

• Type-B essential SEI messages: V3C decoders that wish to conform to a particular reconstruction profile should not discard any relevant Type- B essential SEI messages and shall consider them for 3D point cloud reconstruction and conformance purposes. A polygon mesh is a collection of vertices, edges and faces that defines the shape of a polyhedral object in 3D computer graphics and solid modelling. The faces usually consist of triangles (triangle mesh), quadrilaterals (quads), or other simple convex polygons (n-gons), since this simplifies rendering, but may also be more generally composed of concave polygons, or even polygons with holes. Objects created with polygon meshes are represented by different types of elements. These include vertices, edges, faces, polygons, and surfaces. In many applications, only vertices, edges and either faces or polygons are stored.

Polygon meshes are defined by the following elements:

• Vertex: A position in 3D space defined as (x, y, z) along with other information such as color (r, g, b), normal vector and texture coordinates.

• Edge: A connection between two vertices.

• Face: A closed set of edges, in which a triangle face has three edges, and a quad face has four edges. A polygon is a coplanar set of faces. In systems that support multi-sided faces, polygons and faces are equivalent. Mathematically a polygonal mesh may be considered an unstructured grid, or undirected graph, with additional properties of geometry, shape and topology.

• Surfaces: or smoothing groups, are useful, but not required to group smooth regions.

• Groups: Some mesh formats contain groups, which define separate elements of the mesh, and are useful for determining separate subobjects for skeletal animation or separate actors for non-skeletal animation.

• Materials: defined to allow different portions of the mesh to use different shaders when rendered.

• UV coordinates: Most mesh formats also support some form of UV coordinates which are a separate 2D representation of the mesh "unfolded" to show what portion of a 2-dimensional texture map applies to different polygons of the mesh. It is also possible for meshes to contain other vertex attribute information such as color, tangent vectors, weight maps to control animation, etc. (sometimes also called channels). Figure 6 and Figure 7 show the extensions to the V3C encoder and decoder to support mesh encoding and mesh decoding.

In the encoder extension, shown in Figure 6, the input mesh data 610 is demultiplexed 620 into vertex coordinate and attributes data 625 and mesh connectivity 627, where the mesh connectivity comprises vertex connectivity information. The vertex coordinate and attributes data 625 is coded using MPEG-I V-PCC 630 (such as shown in Figure 1 ), whereas the mesh connectivity data 627 is coded in mesh connectivity encoder 635 as auxiliary data. Both of these are multiplexed 640 to create the final compressed output bitstream 650. Vertex ordering is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC to reorder the vertices for optimal mesh connectivity encoding.

At the decoder, shown in Figure 7, the input bitstream 750 is demultiplexed 740 to generate the compressed bitstreams for vertex coordinates and attributes data and mesh connectivity. The vertex coordinates and attributes data are decompressed using MPEG-I V-PCC decoder 730. Vertex reordering 725 is carried out on the reconstructed vertex coordinates at the output of MPEG-I V-PCC decoder 730 to match the vertex order at the encoder. Mesh connectivity data is decompressed using mesh connectivity decoder 735. The decompressed data is multiplexed 720 to generate the reconstructed mesh 710.

Certain defects have been realized with 3D patch generation, when the initial pose of the input volumetric visual object, i.e., input model, is not well aligned with the bounding box, i.e., the projection planes used for the 3D to 2D projection. As shown in Figure 8 volumetric visual objects being represented by mesh models may start with some rotation against the mesh bounding boxes (marked in white). As the bounding box defines the projection planes for the V3C 3D to 2D projection, patches representing important areas, e.g., faces, are split and this creates unpleasant artefacts in the reconstruction. Furthermore, larger patches can be created when the dominant planes of the mesh are aligned to the bounding box, leading to improved compression efficiency. The V3C standard allows sending camera rotation information in the atlas_camera_parameters syntax. This functionality supports an initial adjustment of the model rotation to the projection planes. However, this solution does not fully support changes in model orientation on consecutive frames: Atlas_Camera_parameters is stored in Atlas adaptation parameter set (AAPS) which can be sent per frame. The pivot point is sent in VIII which is only once per sequence.

Atlas camera parameters syntax is shown below:

The present embodiments provides an implementation to identify relevant parts of a dynamic 3D mesh to which 3D projection planes are aligned, in order to maximize reconstruction quality, by creating larger consecutive patches; improve coding performance, by reducing creating larger patches, resulting in a reduced total number of patches. The advantage of larger patches is that the texture on the mesh’s surface is not broken up into many small parts. This facilitates a video codec’s intra prediction; and improve coding performance by enhancing temporal stability between consecutive frames. In addition, the present embodiments provide relevant signalling of per-frame model rotations to support the above-mentioned functionality in V3C-based standards (not limited to mesh coding). The embodiments disclosed in this specification relate to an implementation for determining the necessary model rotation to achieve the best alignment with the projection planes for efficient V3C 3D to 2D projection, as well as its relevant signalling.

The implementation enables identification of relevant parts, for example clusters, of a dynamic 3D mesh which 3D projection planes are aligned to. Figure 9 shows an example UV map for sequence Mitch, where all 3D vertices are color coded according to their projection direction The projection direction may be derived from the vertex’ surface normal. The projection plane with the smallest error to the vertex surface normal may be selected for projection. In total there are six projection planes (the 6 sides of the bounding box) in this example.

The cluster of faces (connected vertices) covering one consecutive UV map texture area can be derived as follows: cluster_connected_f aces ( li st_of_faces ) : for face in li st of faces

I f face doesn ' t connect to any existing cluster , start a new cluster

I f face connects exi sting clusters , add face to thi s cluster

I f face connects two or more clusters , add face and merge clusters end return list of clusters end

Identify best alignment for V3C projection

1 .1 Identify most relevant part(s) of the 3D mesh

According to an embodiment, one or more most relevant parts (e.g., clusters) for projection alignment are identified.

Based on identified relevant parts, such as clusters, as described above, it is possible to determine the most relevant parts for 3D projection. This can be done in various ways, for example, when a cluster is used as an example of a relevant part, by a) selecting the cluster with the largest number of connected faces overall; or b) select the cluster with the largest number of UV map pixels covered (not necessarily the same as (a)); or c) select the cluster with the highest/lowest density of faces, i.e., reflecting the face size on the UV map (for example high interest areas are represented with more, but smaller faces); or d) select the cluster with the highest/lowest variance in UV map texture; or e) select the cluster with the highest/lowest variance in 3D vertex locations; or f) select the cluster with the lower variance in surface normals; or g) select a cluster based on a certain U map texture distribution (e.g., detect skin colors); or h) select a cluster previously identified in a preceding frame; or i) any combination of the above.

It is also possible to eliminate clusters from the selection process, e.g., based on the criteria mentioned above and then search only the remaining clusters.

An example implementation method can comprise steps of

- removing all clusters with a size less than n faces from the list of candidates (n can be expressed as a percentage of overall faces);

- for each remaining cluster, determining main surface normal distribution (see Figure 10);

- selecting cluster with largest number of faces assigned to one single projection plane.

Another example of an implementation method can comprise steps of

- removing all clusters with a size less than n faces from the list of candidates (n can be expressed as a percentage of overall faces);

- for each remaining cluster, computing average texture value (e.g., RGB);

- determining certain texture value range(s), e.g., standard ranges for skin tones, removing all clusters outside these ranges;

- for each remaining cluster, determining main surface normal distribution (see Figure 10); - selecting cluster with largest number of faces assigned to one single projection plane.

Yet another example of an implementation method can perform the following operations to ensure a large number of faces can be aligned in between two projection planes:

- removing all clusters with a size less than n faces from the list of candidates (n can be expressed as a percentage of overall faces);

- for each remaining clusters, computing average texture value (e.g., RGB);

- determining certain texture value range(s), e.g., standard ranges for skin tones, removing all clusters outside these ranges;

- for each remaining cluster, determining main surface normal distribution (see Figure 10);

- selecting cluster with a) largest number of faces assigned to one single projection plane; b) largest density, e.g., the weighted number of faces assigned to one single projection plane where the weight is defined by the area of the face divided by the total area of the faces in this cluster; c) largest density of faces with normal direction aligned with the best fitting plane to the cluster. The best fitting plane can be estimated with the Principal Components Analysis on the 3D coordinates of the face vertices.

1 .2 Identify most relevant surface normal

UV-map patches may be wrapped around a 3D mesh, so averaging the surface normal over the complete patches will not yield reliable results to align with 3D projection planes. Instead, one or more of the following embodiments provide more reliable rotation information.

According to an embodiment, the geometric center of mass (centroid) of the 2D patch is found. The surface normal of the face covering the centroid is selected.

According to another embodiment, the average surface normal of the face covering the centroid (initial face), and its direct connections is selected. THE number of connected neighboring faces can be grown recursively to cover a larger area.

According to another embodiment, an exclusion condition is introduced, excluding the addition of new faces if their individual surface normal is too different from the current average of surface normal. This condition can also be used as a stopping condition to limit the growth of the pool of faces which are used to average the surface normal from.

According to another embodiment, a pre-defined number N of iterations of mesh smoothing operations are applied to the cluster’s 3D vertices positions, and face normal are re-estimated based on this smoothed geometry. This enables to handle high-varying curvature area cases of clusters with ridges and crests such as clothes, hair, the brain cortex etc., for which the normal of the centroid may not be oriented in a representative manner for the faces of the 2D patch. The number of mesh smoothing iterations can be made adaptive based on the observed normal direction variance. Mesh smoothing can be defined as follows: mesh smoothing (mesh ) : for each vertex v (x , y, z ) set v as v (x ' , y' , z ' ) where x ' , y' , z ’ coordinates are the (pos sibly area-weighted) average of the x, y, z coordinates of the vertex neighboring vertices end

Several iterations of smoothing reduce mesh coordinates noise and reduce the variance of the normal directions. When the number of iterations is set too large, the geometry is not preserved as the mesh coordinates all converge to the center of the mesh.

According to another embodiment, the geometric center of mass is found in 3D space. The face closest in 3D space is chosen as initial face. Pools of faces can be created as described above

According to another embodiment, the histogram of surface normals of the patch is derived (see Figure 8). The average surface normal of the largest histogram bin is selected. 1 .3 Updating most relevant surface normals in consecutive frames

Once the most relevant surface normal has been identified as described in 1 .2, object’s rotation can be updated for every frame. However, UV maps may change over time, and it may become difficult to keep identifying the corresponding patches over time. However, the object (i.e., 3D model) may stay similar, and therefore an inverse identification based on the object is possible.

According to an embodiment, the 3D location the initial face used for the average surface normal calculation of a first frame are stored. For a temporally successive second frame, the face closest in 3D space to the previous initial face can be used as new initial face. If desired, a pool of faces for averaging the surface normal can be created from this initial face in 2D from the UV maps as described in 1 .2.

According to another embodiment, the pool of faces for averaging the surface normal can be created in 3D as similar to the concepts described in 1 .2.

According to another embodiment, the 3D geometric center of mass of the faces forming the largest histogram bin used for the average surface normal calculation (Section 1 .2) of a first frame is stored. For a temporally successive second frame, the face closest in 3D space to the previous 3D center of mass can now be used as new initial face. If desired, a pool of faces for averaging the surface normal can be created form this initial face in 2D from the UV maps as described in 1 .2.

According to yet another embodiment, the 3D geometric center of mass of the faces forming the largest histogram bin used for the average surface normal calculation (Section 1 .2) of a first frame is stored. For a temporally successive second frame, the UV map patch containing the face closest in 3D space to the previous 3D center of mass is now used to calculate a new histogram of surface normal. The average surface normal of the largest histogram bin is selected.

Encoder 2.1 Aligning 3D models to most relevant surface normal

According to an embodiment, before the V3C patch creation and projection, the volumetric visual object is rotated to align the identified most relevant surface normal in parallel to the surface normal of one projection plane.

Signalling 3.1 Aligning 3D models to most relevant surface normal

According to an embodiment, the rotation of a volumetric visual object carried out before the V3C patch creation, and projection is signaled in or along the bitstream by the encoder to align the identified most relevant surface normal parallel to the surface normal of one projection plane.

In the following general atlas frame parameter set RBSP syntax is given afps_rotation_present_flag equal to 1 indicates that rotation parameters for the model to align with the most relevant surface normal are present. afps_rotation_present_flag equal to 0 indicates that rotation parameters for the model are not present. When afps_rotation_present_flag is not present, it shall be inferred to be equal to 0. afps_model_rotation_qx specifies the x component, qX, for the rotation of model to align with the most relevant surface normal, using the quaternion representation. The value of afps_model_rotation_qx shall be in the range of 14 14

-2 to 2 , inclusive. When afps_model_rotation_qx is not present, its value shall be inferred to be equal to 0. The value of is computed as follows:

14 qX = afps_model_rotation_qx i 2 afps_model_rotation_qy specifies the y component, qY, for the rotation of the model to align with the most relevant surface normal, using the quaternion representation. The value of afps_model_rotation_qy shall be in the range of 14 14

-2 to 2 , inclusive. When afps_model_rotation_qy is not present, its value shall be inferred to be equal to 0. The value of qY is computed as follows: 14 qY = afps_model_rotation_qy i 2 afps_model_rotation_qz specifies the z component, qZ, for the rotation of the model to align with the most relevant surface normal, using the quaternion representation. The value of afps_model_rotation_qz shall be in the range of 14 14

-2 to 2 , inclusive. When afps_model_rotation_qz is not present, its value shall be inferred to be equal to 0. The value of qZ is computed as follows:

14 qZ = afps_model_rotation_qz i 2

The fourth component, qW, for the rotation of the model using the quaternion representation is calculated as follows: qW = Sqrt( 1 - ( qX 2 + qY 2 + qZ 2 ) ) qW is always positive. If a negative qW is desired, one may signal all three syntax elements, afps_model_rotation_qx, afps_model_rotation_qy, and afps_model_rotation_qz, with an opposite sign, which is equivalent.

A unit quaternion can be represented as a rotation matrix R as follows:

RotationMatrix

Decoder

4.1 Reverting model alignment to most relevant surface normal

According to an embodiment, a decoder receives a V3C bitstream containing information about model alignment with a most relevant surface normal. After 3D reconstruction, the decoder applies the signaled model rotation before sending the model to the Tenderer.

According to another embodiment, the decoder does not apply the rotation itself, but instead send the relevant information to the Tenderer alongside the 3D model.

Figure 11a is a flowchart illustrating a method according to an embodiment. The method generally comprises receiving 1105 volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; deriving 1110 a plurality of parts of the mesh which projection planes are aligned to; identifying 1115 the most relevant part from said plurality of parts for projection alignment; identifying 1120 most relevant surface normal of said most relevant part; updating 1125 an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; and generating 1130 a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving a volumetric visual object being defined with a mesh representing a surface and a set of interconnected parts; means for deriving a plurality of parts of the mesh which projection planes are aligned to; means for identifying the most relevant part from said plurality of parts for projection alignment; means for identifying most relevant surface normal of said most relevant part; means for updating an object rotation for every frame by aligning the identified most relevant surface normal in parallel to a surface normal of a projection plane; and means for generating a bitstream containing information on object’s alignment with a most relevant surface normal to be signaled to a decoder. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 a according to various embodiments.

Figure 11 b is a flowchart illustrating a method according to another embodiment. The method generally comprises receiving 1140 a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; reconstructing 1145 a volumetric visual object according to received information on object’s alignment; determining 1150 an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and signalling 1155 the volumetric visual object to a Tenderer to be rendered with an applied rotation. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving a bitstream containing information on object’s alignment with a most relevant surface normal, said most relevant surface normal having been identified from a most relevant part of a mesh for projection alignment; means for reconstructing a volumetric visual object according to received information on object’s alignment; means for determining an object rotation by aligning the most relevant surface normal in parallel to a surface normal of a projection plane; and means for signalling the volumetric visual object to a Tenderer to be rendered with an applied rotation. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 b according to various embodiments.

An example of an apparatus is disclosed with reference to Figure 12. Figure 12 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a IIICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system, or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.