Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Document Type and Number:
WIPO Patent Application WO/2023/073283
Kind Code:
A1
Abstract:
The embodiments relate to a method comprising receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; sending the RTP packets over one or more RTP session to a client; and signalling per RTP session,information on the one or more atlas sub-bitstreams. The embodiments also relate to a technical equipment for implementing the method.

Inventors:
ILOLA LAURI ALEKSI (FI)
KONDRAD LUKASZ (DE)
AKSU EMRE BARIS (FI)
Application Number:
PCT/FI2022/050696
Publication Date:
May 04, 2023
Filing Date:
October 20, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N13/161; H04L65/65; H04L65/70; H04N21/234; H04N21/6437
Foreign References:
US20210021664A12021-01-21
US20210218999A12021-07-15
US20140294064A12014-10-02
Other References:
"Draft text of ISO/IEC FDIS 23090-10 Carriage of Visual Volumetric Video-based Coding Data", 133. MPEG MEETING; 20210111 - 20210115; ONLINE; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 25 March 2021 (2021-03-25), XP030293704
"Text of ISO/IEC DIS 23090-5 Visual Volumetric Video-based Coding and Video-based Point Cloud Compression 2nd Edition", 135. MPEG MEETING; 20210712 - 20210716; ONLINE; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11), 23 July 2021 (2021-07-23), XP030296511
ILOLA L., KONDRAD L.: "RTP Payload Format for Visual Volumetric Video-based Coding (V3C)", IETF DATATRACKER, pages 1 - 36, XP093068033, Retrieved from the Internet [retrieved on 20230727]
Attorney, Agent or Firm:
BERGGREN OY (FI)
Download PDF:
Claims:
43

Claims:

1 . An apparatus comprising:

- means for receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams;

- means for encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet;

- means for encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units;

- means for sending the RTP packets over one or more RTP session to a client; and

- means for signalling per RTP session information on the one or more atlas sub-bitstreams.

2. The apparatus according to claim 1 , wherein a component of an atlas subbitstream is an atlas frame or an atlas tile.

3. The apparatus according to claim 1 or 2, wherein a component of the atlas sub-bitstream is encapsulated to one or more RTP packets.

4. The apparatus according to claim 1 or 2, wherein a component of the atlas sub-bitstream is encapsulated to a single RTP packet.

5. The apparatus according to claim 1 or 2, wherein one or more components of the atlas sub-bitstream belonging to a same access unit are encapsulated in a single RTP packet.

6. The apparatus according to claim 1 or 2, wherein one or more components of the atlas sub-bitstream belonging to a different access units are encapsulated in a single RTP packet.

7. The apparatus according to any of the claims 1 to 6, further comprising means for signalling in a session specific file format an attribute indicating a presence of tile identifiers in RTP packets. 44

8. The apparatus according to any of the claims 1 to 7, further comprising means for signalling in a session specific file format an attribute indicating a presence of 16-bit tile identifier in RTP payload unit.

9. The apparatus according to any of the claims 1 to 8, further comprising means for signalling in a session specific file format an attribute indicating one or more tile identifiers.

10. The apparatus according to any of the claims 1 to 9, further comprising means for signalling in a session specific file format an attribute indicating partial access related information.

11 . A method, comprising:

- receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas subbitstreams;

- encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet;

- encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Videobased Coding (V3C) Network Abstraction Layer (NAL) units;

- sending the RTP packets over one or more RTP session to a client; and

- signalling per RTP session, information on the one or more atlas subbitstreams.

12. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams;

- encapsulate the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet;

- encapsulate one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Videobased Coding (V3C) Network Abstraction Layer (NAL) units; 45

- send the RTP packets over one or more RTP session to a client; and

- signal per RTP session, information on the one or more atlas sub-bitstreams.

13. The apparatus according to claim 12, wherein a component of an atlas sub-bitstream is an atlas frame or an atlas tile.

14. The apparatus according to claim 12 or 13, wherein a component of the atlas sub-bitstream is encapsulated to one or more RTP packets.

15. The apparatus according to claim 12 or 13, wherein a component of the atlas sub-bitstream is encapsulated to a single RTP packet.

16. The apparatus according to claim 12 or 13, wherein one or more components of the atlas sub-bitstream belonging to a same access unit are encapsulated in a single RTP packet.

17. The apparatus according to claim 12 or 13, wherein one or more components of the atlas sub-bitstream belonging to a different access units are encapsulated in a single RTP packet.

18. The apparatus according to any of the claims 12 to 17, further comprising computer program code configured to cause the apparatus to signal in a session specific file format an attribute indicating a presence of tile identifiers in RTP packets.

19. The apparatus according to any of the claims 12 to 18, further comprising computer program code configured to cause the apparatus to signal in a session specific file format an attribute indicating a presence of 16-bit tile identifier in RTP payload unit.

20. The apparatus according to any of the claims 12 to 19, further comprising computer program code configured to cause the apparatus to signal in a session specific file format an attribute indicating one or more tile identifiers.

21 . The apparatus according to any of the claims 12 to 20, further comprising computer program code configured to cause the apparatus to signal in a session specific file format an attribute indicating partial access related information.

22. A computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas subbitstreams; - encapsulate the video sub-bitstream to a Real-time Transfer Protocol

(RTP) packet;

- encapsulate one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units;

- send the RTP packets over one or more RTP session to a client; and

- signal per RTP session, information on the one or more atlas subbitstreams.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

Technical Field

The present solution generally relates to coding of volumetric video.

Background

Volumetric video data represents a three-dimensional (3D) scene or object, and can be used as input for AR (Augmented Reality), VR (Virtual Reality), and MR (Mixed Reality) applications. Such data describes geometry (Shape, size, position in 3D space) and respective attributes (e.g., color, opacity, reflectance, ...), and any possible temporal transformations of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video can be generated from 3D models, also referred to as volumetric visual objects, i.e., CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g., multi-camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for volumetric data comprise triangle meshes, point clouds, or voxels. Temporal information about the scene can be included in the form of individual capture instances, i.e., “frames” in 2D video, or other means, e.g., position of an object as a function of time.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided an apparatus comprising means for receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas subbitstreams; means for encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; means for encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; means for sending the RTP packets over one or more RTP session to a client; and means for signalling per RTP session information on the one or more atlas sub-bitstreams.

According to a second aspect, there is provided a method, comprising receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; sending the RTP packets over one or more RTP session to a client; and signalling per RTP session, information on the one or more atlas sub-bitstreams.

According to a third aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; encapsulate the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; encapsulate one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; send the RTP packets over one or more RTP session to a client; and signal per RTP session, information on the one or more atlas sub-bitstreams. According to a fourth aspect, there is provided computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: receive a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; encapsulate the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; encapsulate one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; send the RTP packets over one or more RTP session to a client; and signal per RTP session, information on the one or more atlas sub-bitstreams.

According to an embodiment, a component of an atlas sub-bitstream is an atlas frame or an atlas tile.

According to an embodiment, a component of the atlas sub-bitstream is encapsulated to one or more RTP packets.

According to an embodiment, a component of the atlas sub-bitstream is encapsulated to a single RTP packet.

According to an embodiment, one or more components of the atlas subbitstream belonging to a same access unit are encapsulated in a single RTP packet.

According to an embodiment, one or more components of the atlas subbitstream belonging to a different access units are encapsulated in a single RTP packet. According to an embodiment, an attribute indicating a presence of tile identifiers in RTP packets is signaled in a session specific file format.

According to an embodiment, an attribute indicating a presence of 16-bit tile identifier in RTP payload unit is signaled in a session specific file format.

According to an embodiment, an attribute indicating one or more tile identifiers is signaled in a session specific file format.

According to an embodiment, an attribute indicating one or more tile identifiers is signaled in a session specific file format.

According to an embodiment, an attribute indicating partial access related information is signaled in a session specific file format.

According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a compression process of a volumetric video;

Fig. 2 shows an example of a de-compression process of a volumetric video;

Fig. 3 shows an example of a V3C bitstream originated from ISO/IEC 23090-5;

Fig. 4 shows an example of atlas frame partitioned in seven tiles;

Fig. 5 shows an example of an extension header;

Fig. 6 shows an example of RTP streaming architecture with no network processing units; Fig. 7 shows an example of RTP streaming architecture containing SPLIT NPU;

Fig. 8 shows an example of RTP streaming architecture with MERGE NPU;

Fig. 9 shows an example of RTP streaming architecture with FILTER NPU;

Fig. 10 shows an example of a structure of RTP payload header;

Fig. 11 shows an example of RTP payload - single NAL unit packet;

Fig. 12 shows an example of RTP payload - single time aggregation packet;

Fig. 13 shows another example of RTP payload - single time aggregation packet;

Fig. 14 shows an example of RTP payload - multi-time aggregation packet;

Fig. 15 shows another example of RTP payload - multi-time aggregation packet;

Fig. 16 shows an example of RTP payload - fragmentation unit;

Fig. 17 shows an example of a fragmentation unit header;

Fig. 18 shows an example of a RTP header extension for V3C video components;

Fig. 19 is a flowchart illustrating a method according to an embodiment; and

Fig. 20 shows an apparatus according to an embodiment. Description of Example Embodiments

The following description and drawings are illustrative, and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments.

Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment in included in at least one embodiment of the disclosure.

Figure 1 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.

The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1 .0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0) More precisely, each point may be associated with the plane that has the closest normal (i.e., maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g., 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user- defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point o H u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+A], where is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 103.

The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g., 16x16) pixels is compressed independently. If the block is empty (i.e., unoccupied, i.e., all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e., occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e., edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors.

The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch:

- index of the projection plane o Index 0 for the planes (1 .0, 0.0, 0.0) and (-1 .0, 0.0, 0.0) o Index 1 for the planes (0.0, 1 .0, 0.0) and (0.0, -1 .0, 0.0) o Index 2 for the planes (0.0, 0.0, 1 .0) and (0.0, 0.0, -1 .0)

- 2D bounding box (uO, vO, ul, vl)

- 3D location (xO, yO, z0) of the patch represented in terms of depth 30, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (50, sO, rO) may be calculated as follows: o Index 0, 30= xO, s0=z0 and rO = y0 o Index 1, 30= yO, s0=z0 and rO = x0 o Index 2, 30= zO, s0=x0 and rO = yO

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows:

- For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks. - Let I be index of the patch, which the current TxT block belongs to, and let J be the position of I in L. Instead of explicitly coding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 110 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e., blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy. ■ The binary value of the initial sub-block is encoded.

■ Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.

■ The number of detected runs is encoded.

■ The length of each run, except of the last one, is also encoded.

Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (30, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth 3(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

3(u, v) = 30 + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

There are alternatives to capture and represent a volumetric frame. The format used to capture and represent the volumetric frame depends on the process to be performed on it, and the target application using the volumetric frame. As a first example a volumetric frame can be represented as a point cloud. A point cloud is a set of unstructured points in 3D space, where each point is characterized by its position in a 3D coordinate system (e.g., Euclidean), and some corresponding attributes (e.g., color information provided as RGBA value, or normal vectors). As a second example, a volumetric frame can be represented as images, with or without depth, captured from multiple viewpoints in 3D space. In other words, the volumetric video can be represented by one or more view frames (where a view is a projection of a volumetric scene on to a plane (the camera plane) using a real or virtual camera with known/computed extrinsic and intrinsic). Each view may be represented by a number of components (e.g., geometry, color, transparency, and occupancy picture), which may be part of the geometry picture or represented separately. As a third example, a volumetric frame can be represented as a mesh. Mesh is a collection of points, called vertices, and connectivity information between vertices, called edges. Vertices along with edges form faces. The combination of vertices, edges and faces can uniquely approximate shapes of objects.

Depending on the capture, a volumetric frame can provide viewers the ability to navigate a scene with six degrees of freedom, i.e., both translational and rotational movement of their viewing pose (which includes yaw, pitch, and roll). The data to be coded for a volumetric frame can also be significant, as a volumetric frame can contain many numbers of objects, and the positioning and movement of these objects in the scene can result in many dis-occluded regions. Furthermore, the interaction of the light and materials in objects and surfaces in a volumetric frame can generate complex light fields that can produce texture variations for even a slight change of pose.

A sequence of volumetric frames is a volumetric video. Due to large amount of information, storage and transmission of a volumetric video requires compression. A way to compress a volumetric frame can be to project the 3D geometry and related attributes into a collection of 2D images along with additional associated metadata. The projected 2D images can then be coded using 2D video and image coding technologies, for example ISO/IEC 14496- 10 (H.264/AVC) and ISO/IEC 23008-2 (H.265/HEVC). The metadata can be coded with technologies specified in specification such as ISO/IEC 23090-5. The coded images and the associated metadata can be stored or transmitted to a client that can decode and render the 3D volumetric frame.

In the following, a short reference of ISO/IEC 23090-5 Visual Volumetric Videobased Coding (V3C) and Video-based Point Cloud Compression (V-PCC) 2nd Edition is given. ISO/IEC 23090-5 specifies the syntax, semantics, and process for coding volumetric video. The specified syntax is designed to be generic, so that it can be reused for a variety of applications. Point clouds, immersive video with depth, and mesh representations can all use ISO/IEC 23090-5 standard with extensions that deal with the specific nature of the final representation. The purpose of the specification is to define how to decode and interpret the associated data (for example atlas data in ISO/IEC 23090-5) which tells a Tenderer how to interpret 2D frames to reconstruct a volumetric frame.

Two applications of V3C (ISO/IEC 23090-5) have been defined, V-PCC (ISO/IEC 23090-5) and MIV (ISO/IEC 23090-12). MIV and V-PCC use number of V3C syntax elements with a slightly modified semantics. An example on how the generic syntax element can be differently interpreted by the application is pdu_projection_id.

In case of V-PCC, the syntax element pdu_projection_id specifies the index of the projection plane for the patch. There can be 6 or 18 projection planes in V- PCC, and they are implicit, i.e., pre-determined. In case of MIV, pdu_projection_id corresponds to a view ID, i.e., identifies which view the patch originated from. View IDs and their related information is explicitly provided in MIV view parameters list and may be tailored for each content.

MPEG 3DG (ISO SC29 WG7) group has started a work on a third application of V3C - the mesh compression. It is also envisaged that mesh coding will reuse V3C syntax as much as possible and can also slightly modify the semantics.

To differentiate between applications of V3C bitstream that allow a client to properly interpret the decoded data, V3C uses the ptl_profile_toolset_idc parameter.

V3C bitstream is a sequence of bits that forms the representation of coded volumetric frames and the associated data making one or more coded V3C sequences (CVS). Where CVS is a sequence of bits identified and separated by appropriate delimiters, and is required to start with a VPS, includes a V3C unit, and contains one or more V3C units with atlas sub-bitstream or video subbitstream. This is illustrated in Figure 3. Video sub-bitstream and atlas subbitstreams can be referred to as V3C sub-bitstreams. A V3C unit header in conjunction with VPS information identify which V3C sub-bitstream a V3C unit contains and how to interpret it. An example of this is shown herein below:

V3C bitstream can be stored according to Annex C of ISO/IEC 23090-5, which specifies syntax and semantics of a sample stream format to be used by applications that deliver some or all of the V3C unit stream as an ordered stream of bytes or bits within which the locations of V3C unit boundaries need to be identifiable from patterns in the data.

To enable parallelization, random access, as well as a variety of other functionalities, an atlas frame can be divided into one or more rectangular partitions that are referred to as tiles. Tiles are not allowed to overlap. An atlas frame may contain regions that are not associated with a tile. Figure 4 illustrates an example tile partitioning of an atlas frame, where atlas frame is divided into 16 tile partitions and seven tiles.

An atlas frame refers to a single access unit of an atlas sub-bitstream and may contain multiple atlas tiles and frame dependent parameters. One atlas frame can consist of multiple V3C NAL units, and a single V3C NAL unit contains one atlas tile. Therefore, an atlas tile is a portion of an atlas frame being identified by tile-id. An atlas tile cannot contain frame dependent parameters. Each atlas frame can be divided into multiple atlas tiles. In addition, an atlas frame contains other information like parameters that cannot be stored in atlas tiles.

ISO/IEC 23090-5 does not define how patches should be generated or arranged in tiles and leaves that to the encoder implementation to decide. Considering spatial partial access, it might make sense to store patches that correspond to same 3D space region in the same tiles. This would allow an application consuming a V3C bitstream to cull irrelevant tiles altogether from the rendering process. V3C supports this through definition of volumetric annotation SEI messages, which describe among other things dimensional parameters for tiles. V3C also provides a possibility to signal the constrains that were applied during tile creation through VIII parameter structure, and syntax elements such as vui_fixed_atlas_tile_structure_flag, vui_fixed_video_tile_structure_flag, and vui_constrained_tiles_across_v3c_componets_idc.

In V3C high level syntax, tile is identified by tile ID, which is represented by ath_id syntax element in atlas_tile_header() structure.

NAL units in ISO/IEC 23090-5 are defined as described below:

NumBytesInNalllnits specifies the size of the NAL unit in bytes. This value is required for decoding of the NAL unit. Some form of demarcation of NAL unit boundaries is necessary to enable inference of NumBytesInNalUnit. rbsp_byte[i] is the i-th byte of an RBSP. An RBSP is specified as an ordered sequence of bytes as follows:

The RBSP contains a string of data bits (SODB) as follows.

- If the SODB is empty (i.e., zero bits in length), the RBSP is also empty.

- Otherwise, the RBSP contains the SODB as follows:

1 ) The first byte of the RBSP contains the first (most significant, leftmost) eight bits of the SODB; the next byte of the RBSP contains the next eight bits of the SODB, etc., until fewer than eight bits of the SODB remain. 2) The rbsp_trailing_bits() syntax structure is present after the SODB as follows: a. The first (most significant, left-most) bits of the final RBSP byte contain the remaining bits of the SODB (if any). b. The next bit consists of a single bit equal to 1 (i.e., rbsp_stop_one_bit). c. When the rbsp_stop_one_bit is not the last bit of a byte- aligned byte, one or more bits equal to 0 (i.e., instances of rbsp_alignment_zero_bit) are present to result in byte alignment.

Syntax structures having these RBSP properties are denoted in the syntax table using an “_rbsp” suffix. These structures are carried within NAL units as the content of the rbsp_byte[i] data bytes.

The NAL unit header contains the fields are presented below: nal_forbidden_zero_bit shall be equal to 0. nal_unit_type specifies the type of the RBSP data structure contained in the NAL unit as specified in Table 4 in ISO/IEC 23090-5. nal_layer_id specifies the identifier of the layer to which an ACL NAL unit belongs or the identifier of a layer to which a non-ACL NAL unit applies. The value of nal_layer_id shall be in the range of 0 to 62, inclusive. The value of 63 may be specified in the future by ISO/IEC.

The value of nal_layer_id shall be the same for all ACL NAL units of a coded atlas frame. The value of nal_layer_id of a coded atlas frame is the value of the nal_layer_id of the ACL NAL units of the coded atlas frame. nal_temporal_id_plus1 minus 1 specifies a temporal identifier for the NAL unit. The value of nal_temporal_id_plus1 shall not be equal to 0.

A Real Time Transfer Protocol (RTP) is intended for an end-to-end, real-time transfer or streaming media and provides facilities for jitter compensation and detection of packet loss and out-of-order delivery. RTP allows data transfer to multiple destinations through IP multicast or to a specific destination through IP unicast. The majority of the RTP implementations are built on the User Datagram Protocol (UDP). Other transport protocols may also be utilized. RTP is used in together with other protocols such as H.323 and Real Time Streaming Protocol RTSP.

The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and the RTCP is used to periodically send control information and QoS parameters.

RTP sessions may be initiated between client and server using a signalling protocol, such as H.323, the Session Initiation Protocol (SIP), or RTSP. These protocols may use the Session Description Protocol (RFC 8866) to specify the parameters for the sessions.

RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For a class of applications (e.g., audio, video), an RTP profile may be defined. For a media format (e.g., a specific video coding format), an associated RTP payload format may be defined. Every instantiation of RTP in a particular application may require a profile and payload format specifications.

The profile defines the codecs used to encode the payload data and their mapping to payload format codecs in the protocol field Payload Type (PT) of the RTP header.

For example, RTP profile for audio and video conferences with minimal control is defined in RFC 3551. The profile defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format, and a PT value using Session Description Protocol (SDP). The latter mechanism is used for newer video codec such as RTP payload format for H.264 Video defined in RFC 6184 or RTP Payload Format for High Efficiency Video Coding (HEVC) defined in RFC 7798.

An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream. The RTP specification recommends even port number for RTP, and the use of the next odd port number of the associated RTCP session. A single port can be used for RTP and RTCP in applications that multiplex the protocols.

RTP packets are created at the application layer and handed to the transport layer for delivery. Each unit of RTP media data created by an application begins with the RTP packet header.

The RTP header has a minimum size of 12 bytes. After the header, optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application. The fields in the header are as follows:

• Version: (2 bits) Indicates the version of the protocol.

• P (Padding): (1 bit) Used to indicate if there are extra padding bytes at the end of the RTP packet.

• X (Extension): (1 bit) Indicates the presence of an extension header between the header and payload data. The extension header is application or profile specific.

• CC (CSRC count): (4 bits) Contains the number of CSRC identifiers that follow the SSRC

• M (Marker): (1 bit) Signalling used at the application level in a profilespecific manner. If it is set, it means that the current data has some special relevance for the application.

• PT (Payload type): (7 bits) Indicates the format of the payload and thus determines its interpretation by the application.

• Sequence number: (16 bits) The sequence number is incremented for each RTP data packet sent and is to be used by the receiver to detect packet loss and to accommodate out-of-order delivery. • Timestamp: (32 bits) Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream. The granularity of the timing is application specific. For example, video stream may use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application.

• SSRC: (32 bits) Synchronization source identifier uniquely identifies the source of the stream. The synchronization sources within the same RTP session will be unique.

• CSRC: (32 bits each) Contributing source IDs enumerate contributing sources to a stream which has been generated from multiple sources.

• Header extension: (optional, presence indicated by Extension field) The first 32-bit word contains a profile-specific identifier (16 bits) and a length specifier (16 bits) that indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header. The extension header data is shown in Figure 5.

RTP payload contains the payload data associated with a payload header. An optional RTP payload header and other fields may exist in RTP payload that enables different packetization strategies for NAL unit-based media pipelines. Commonly at least the following packetization schemes exist:

• Single NAL unit packet, which contains only one NAL unit in each RTP packet;

• Aggregation NAL packet, which contains multiple NAL units in each RTP packet;

• Fragmentation unit, which allows to store one NAL unit in multiple RTP packets.

In this disclosure, the Session Description Protocol (SDP) is used as an example of a session specific file format. SDP is a format for describing multimedia communication sessions for the purposes of announcement and invitation. Its predominant use is in support of conversational and streaming media applications. SDP does not deliver any media streams itself, but is used between endpoints for negotiation of network metrics, media types, and other associated properties. The set of properties and parameters is called a session profile. SDP is extensible for the support of new media types and formats. The Session Description Protocol describes a session as a group of fields in a text-based format, one field per line. The form of each field is as follows:

<character>=<value><CR><LF> where <character> is a single case-sensitive character and <value> is structured text in a format that depends on the character. Values may be UTF- 8 encoded. Whitespace is not allowed immediately to either side of the equal sign.

Session descriptions consist of three sections: session, timing, and media descriptions. Each description may contain multiple timing and media descriptions. Names are only unique within the associated syntactic construct.

Fields appear in the order, shown below; optional fields are marked with an asterisk: v= (protocol version number, currently only 0) o= (originator and session identifier: username, id, version number, network address) s= (session name: mandatory with at least one UTF-8-encoded character) i=* (session title or short information) u=* (URI of description) e=* (zero or more email address with optional name of contacts) p=* (zero or more phone number with optional name of contacts) c=* (connection inf ormation— not required if included in all media) b=* (zero or more bandwidth information lines)

One or more time descriptions ("t=" and "r=" lines; see below) z=* (time zone adjustments) k=* (encryption key) a=* (zero or more session attribute lines)

Zero or more Media descriptions (each one starting by an "m=" line; see below)

Time description (mandatory): t= (time the ses sion i s active ) r=* ( zero or more repeat times )

Media description (optional): m= (media name and transport address ) i=* (media title or information field) c=* ( connection information — optional if included at session level ) b=* ( zero or more bandwidth information lines ) k=* (encryption key) a=* ( zero or more media attribute lines — overriding the Session attribute lines )

Below is a sample session description from RFC 4566. This session is originated by the user “jdoe” at IPv4 address 10.47.16.5. Its name is “SDP Seminar” and extended session information (“A Seminar on the session description protocol”) is included along with a link for additional information and an email address to contact the responsible party, Jane Doe. This session is specified to last two hours using NTP timestamps, with a connection address (which indicates the address clients should connect to or - when a multicast address is provided, as it is here - subscribe to) specified as IPv4 224.2.17.12 with a TTL of 127. Recipients of this session description are instructed to only receive media. Two media descriptions are provided, both using RTP Audio Video Profile. The first is an audio stream on port 49170 using RTP/AVP payload type 0 (defined by RFC 3551 as PCMU), and the second is a video stream on port 51372 using RTP/AVP payload type 99 (defined as “dynamic”). Finally, an attribute is included which maps RTP/AVP payload type 99 to format h263-1998 with a 90 kHz clock rate. RTCP ports for the audio and video streams of 49171 and 51373 respectively are implied. v=0 o=j doe 2890844526 2890842807 IN IP4 10 . 47 . 16 . 5 s=SDP Seminar i=A Seminar on the ses sion description protocol u=http : / / www . example . com/ seminars/ sdp . pdf e= j . doe@example . com ( Jane Doe ) c=IN I P4 224 . 2 . 17 . 12 / 127 t=2873397496 2873404696 a=recvonly m=audio 49170 RTP/AVP 0 m=video 51372 RTP/AVP 99 a=rtpmap : 99 h263-1998 / 90000

SDP uses attributes to extend the core protocol. Attributes can appear within the Session or Media sections and are scoped accordingly as session-level or media-level. New attributes can be added to the standard through registration with IANA. A media description may contain any number of “a=” lines (attributefields) that are media description specific. Session-level attributes convey additional information that applies to the session as a whole rather than to individual media descriptions.

Attributes are either properties or values: a=<attribute-name> a=<attribute-name> : <attribute-value>

Examples of attributes defined in RFC8866 are “rtpmap” and “fmpt”.

“rtpmap” attribute maps from an RTP payload type number (as used in an "m=" line) to an encoding name denoting the payload format to be used. It also provides information on the clock rate and encoding parameters. Up to one "a=rtpmap:" attribute can be defined for each media format specified. This can be the following: m=audio 49230 RTP/AVP 96 97 98 a=rtpmap : 96 L8 / 8000 a=rtpmap : 97 L16/ 8000 a=rtpmap : 98 L16/ 11025/ 2

In the example above, the media types are “audio/L8” and “audio/L16”. Parameters added to an "a=rtpmap:" attribute may only be those required for a session directory to make the choice of appropriate media to participate in a session. Codec-specific parameters may be added in other attributes, for example, "fmtp".

"fmtp" attribute allows parameters that are specific to a particular format to be conveyed in a way that SDP does not have to understand them. The format can be one of the formats specified for the media. Format-specific parameters, semicolon separated, may be any set of parameters required to be conveyed by SDP and given unchanged to the media tool that will use this format. At most one instance of this attribute is allowed for each format. An example is: a=fmtp : 96 prof ile- level- id=42e016 ; max-mbps=l 08000 ; max-f s=3600

For example RFC7798 defines the following sprop-vps, sprop-sps, sprop-pps, profile-space, profile-id, tier-flag, level-id, interop-constraints, profilecompatibility-indicator, sprop-sub-layer-id, recv-sub-layer-id, max-recv-level- id, tx-mode, max-lsr, max-lps, max-cpb, max-dpb, max-br, max-tr, max-tc, max-fps, sprop-max-don-diff, sprop-depack-buf-nalus, sprop-depack-buf- bytes, depack-buf-cap, sprop-segmentation-id, sprop-spatial-segmentation- idc, dec-parallel-cap, and include-dph.

“group” and “mid” attributes defined in RFC 5888 allows to group "m" lines in SDP for different purposes. An example can be for lip synchronization or for receiving a media flow consisting of several media streams on different transport addresses.

An example would be in a given session description, each "m" line is identified by a token, which is carried in a "mid" attribute below the "m" line. The session description carries session-level "group" attributes that group different "m" lines (identified by their tokens) using different group semantics. The semantics of a group describe the purpose for which the "m" lines are grouped. In the example below, the "group" line indicates that the "m" lines identified by tokens 1 and 2 (the audio and the video "m" lines, respectively) are grouped for the purpose of lip synchronization (LS). v=0 o=Laura 289083124 289083124 IN I P4 one . example . com c=IN I P4 192 . 0 . 2 . 1 t=0 0 a=group : LS 1 2 m=audio 30000 RTP/AVP 0 a=mid : 1 m=video 30002 RTP/AVP 31 a=mid : 2

RFC5888 defines two semantics for group Lip Synchronization (LS), as used in the example above, and Flow Identification (FID). RFC5583 defines another grouping type Decoding Dependency (DDP). RFC8843 defines another grouping type BUNDLE, which among other is utilized when multiple types of media are sent in a single RTP session as described in RFC8860.

"depend" attribute defined in RFC5583 allows to signal two types of decoding dependencies: layered and multi-description.

The following dependency-type values are defined in in RFC5583:

• lay: Layered decoding dependency identifies the described media stream as one or more Media Partitions of a layered Media Bitstream. When "lay" is used, all media streams required for decoding the Operation Point should be identified by identification-tag and fmt- dependency following the "lay" string.

• mdc: Multi-descriptive decoding dependency signals that the described media stream is part of a set of an MDC Media Bitstream. By definition, at least N-out-of-M media streams of the group need to be available to from an Operation Point. The values of N and M depend on the properties of the Media Bitstream and are not signaled within this context. When "mdc" is used, all required media streams for the Operation Point should be identified by identification-tag and fmt- dependency following the "mdc" string.

The example below shows a session description with three media descriptions, all of type video and with layered decoding dependency ("lay"). Each of the media descriptions includes two possible media format descriptions with different encoding parameters as, e.g., "packetization-mode" (not shown in the example) for the media subtypes "H264" and "H264-SVC" given by the "a=rtpmap:"-line. v=0 o=svcsrv 289083124 289083124 IN IP4 host.example.com s=LAYERED VIDEO SIGNALING Seminar t=0 0 c=IN IP4 192.0.2.1/127 a=group:DDP LI L2 L3 m=video 40000 RTP/AVP 96 97 b=AS : 90 a=f ramerate : 15 a=rtpmap:96 H264/90000 a=rtpmap:97 H264/90000 a=mid: LI m=video 40002 RTP/AVP 98 99 b=AS : 64 a=f ramerate : 15 a=rtpmap:98 H264-SVC/90000 a=rtpmap:99 H264-SVC/90000 a=mid: L2 a=depend:98 lay LI: 96, 97; 99 lay LI : 97 m=video 40004 RTP/AVP 100 101 b=AS : 128 a=f ramerate : 30 a=rtpmap:100 H264-SVC/90000 a=rtpmap:101 H264-SVC/90000 a=mid: L3 a=depend:100 lay Ll:96,97; 101 lay Ll:97 L2:99

As defined in RFC3550 and RFC3551 RTP was designed to support multimedia sessions, containing multiple types of media sent simultaneously, by using multiple transport-layer flows, i.e., RTP sessions. This approach, however, is not always beneficial and can:

• increase delay to establish a complete session

• increase state and resource consumption in the middleboxes • increase risk that a subset of the transport-layer flows will fail to be established

Therefore, in some cases using fewer RTP sessions can reduce the risk of communication failure and can lead to improved reliability and performance. It might seem appropriate for RTP-based applications to send all their RTP streams bundled into one RTP session, running over a single transport-layer flow. However, this was initially prohibited by the RTP specifications RFC3550 and RFC3551 , because the design of RTP makes certain assumptions that can be incompatible with sending multiple media types in a single RTP session.

RFC8860 updates RFC3550 and RFC3551 to allow sending an RTP session containing RTP streams with media from multiple media types such as audio, video, and text.

From signalling perspective, it shall be

• ensured that any participant in the RTP session is aware that this is an RTP session with multiple media types;

• ensured that the payload types in use in the RTP session are using unique values, with no overlap between the media types;

• ensured that RTP session-level parameters - for example, the RTCP RR and RS bandwidth modifiers [RFC3556], the RTP/AVPF trr-int parameter [RFC4585], transport protocol, RTCP extensions in use, and any security parameters - are consistent across the session; and

• ensured that RTP and RTCP functions that can be bound to a particular media type are reused where possible, rather than configuring multiple code points for the same thing.

When using SDP signalling, the BUNDLE extension RFC8843 is used to signal RTP sessions containing multiple media types.

The RTP and RTCP packets are then demultiplexed into the different RTP streams based on their SSRC. While the RTP payload type is then used to select the correct media-decoding pathway for each RTP stream. In case not enough payload type values are available, then to associate RTP streams multiplexed on the same transport flow with their respective SDP media description, an urn:ietf:params:rtp-hdrext:sdes:mid RTP header extension from RFC7941 could be used to provide media description identifier that matches the value of the SDP a=mid attribute defined in RFC5888.

SDP can be utilized in a scenario where two entities (i.e., server and client) negotiate at a common understanding and setup of a multimedia session between them. In such scenario one entity (e.g., server), offers the other a description of the desired session (or possible option of a session) from their perspective, and the other participant answers with the desired session from their perspective. This offer/answer model is described in RFC3264 and can be used by other protocols for example Session Initiation Protocol (SIP) RFC 3264.

V3C bitstream is a composition of number of video sub-bitstreams and atlas sub-bitstreams that can be identified by V3C unit headers. Video sub-bitstream can be coded by well-known video coding standards such as AVC, HEVC, WC which are NAL unit based and have well defined RTP payload format (RFC 6184, RFC 7798, and internet draft, respectively).

Atlas sub-bitstream does not have a RTP payload format defined. The present embodiments define key-concepts related to V3C atlas payload format, thus enabling real-time streaming of V3C content. Furthermore, special nature of the V3C atlas data is considered and partial real-time delivery enablers are defined. Partial delivery allows only delivering part of the V3C atlas data that is needed by the client, thus reducing overall bitrate requirements.

Thinning of RTP stream can be performed by network level processing units (NPU) based on the client feedback containing the tile IDs (e.g., webrtc). A NPU may filter the RTP packets based on the provided tile IDs. tile ID of NAL unit is provided by ath_id syntax element in atlas_tile_header () structure. ath_id syntax element has a variable descriptor and is preceded by Exp- Golomb_coded syntax element. This may require from a NPU to have the Exp- Golomb parser as well store parameter sets of the bitstream to be able to appropriately parse the ath_id syntax element. This is a burden on the NPU which should be able to easily check for tile ID of a given RTP packet. The present embodiments propose V3C atlas data related RTP payload format signalling along with required SPD extensions. The present embodiments are related to packing atlas frames and/or atlas tiles into RTP packets.

V3C atlas sub-bitstream data, consisting of NAL units, is stored in RTP payloads. Two-byte RTP payload header is used to store NAL unit header information along with other conditional information, such as decoding order number indicator and tile identifier. SDP is extended with new parameters for media level v3cmap attribute to provide tiling related information required to achieve partial delivery of V3C content over an RTP stream.

Different alternatives exist for encapsulating V3C NAL units of the atlas subbitstream in RTP payloads.

• In an embodiment, a single NAL unit is stored in a single RTP payload, and NAL unit header is stored in RTP payload header unmodified. Optional decoding order number and tile-id fields may follow RTP payload header.

• In another embodiment, two or more NAL units belonging to the same access unit are stored in a single RTP packet, henceforth known as single time aggregation packet (STAP). STPA contains an RTP payload header with NAL unit type (NUT) equal to 56, or other value in the unspecified range of V3C NAL units in ISO/IEC 23090-5. RTP payload header may be followed by a 16-bit tile identifier, which means that all NAL units in the aggregation units belong to the indicated tile. STPA may contain two or more aggregation units. Each aggregation unit may contain a 16-bit decoding order number and a 16-bit tile identifier field, followed by a mandatory 16-bit NAL unit size field. The rest of the aggregation units store unmodified NAL unit (including the header).

• In another embodiment, two or more NAL units belonging in different access units are stored in a single RTP packet, henceforth known as multi-time aggregation packet (MTAP). MTAP contains RTP payload header with NAL unit type (NUT) equal to 57, or other value in the unspecified range of V3C NAL units in ISO/IEC 23090-5. RTP payload header may be followed by a 16-bit base decoding order number field and a 16-bit tile identifier field. MTAP may contain two or more aggregation units. Each aggregation unit contains a 16-bit timestamp offset field, followed by a conditional 8-bit decoding order number difference field and conditional 16-bit tile identifier field. Conditional fields are followed by a mandatory NAL unit size field. The rest of the aggregation unit stored unmodified NAL unit (including the header).

• In another embodiment, a single NAL unit may be stored in two or more RTP packets, henceforth known as fragmentation units (FU). Each FU contains RTP payload header with NAL unit type (NUT) equal to 58, or other value in the unspecified range of V3C NAL units in ISO/IEC 23090-5. RTP payload header is followed by 8-bit FU header, which contains starting and ending indicators for the fragmentation unit along with the original NAL unit type of the fragmented NAL unit. Additional 16-bit decoding order number and 16-bit tile identifier field may be present in FU. Decoding order number and tile identifier fields may only be present in the FU where starting indicator is set to true. The rest of the fragmentation unit contains a slice of the fragmented NAL unit.

Following updates to a session specific file format, such as SDP, are expected to indicate V3C streaming support.

• According to an embodiment, an attribute specific to V3C content may be extended with a new parameter indicating the level of exposing tile identifiers in RTP packets, e.g., tile-id-pres. The value of the tile-id-pres determines if tile identifiers are present once per RTP packet if tile identifiers are present in aggregation units or if they are not present in RTP packets at all.

• According to another embodiment, an attribute specific to V3C content may be extended with a new parameter indicating if 16-bit tile identifier field is present in RTP payload header, e.g., ph-tile-id-pres. An attribute specific to V3C content may be extended with a new parameter indicating if 16-bit tile identifier is present in RTP payload unit, e.g., au- tile-id-pres.

• Attribute specific to V3C content may be extended with a new parameter indicating the tile identifier or tile identifiers of the media stream, e.g., tile-id.

• Attribute specific to V3C content may be extended with a new parameter indicating if the media stream contains only partial access related information, e.g., partial access. In the following, the present embodiments are discussed under three subsections. The first section discusses in general, how partial V3C content may be streamed using different types of streaming architectures. The second section describes how V3C atlas data may be stored in RTP payloads that enable different kind of streaming architectures for partial delivery. The third section introduces additions to session specific file format, such as SDP, that allows client and server negotiate how the partial V3C content in delivered.

Streaming architectures

V3C streaming may be done in different transmission modes depending on the network and client capabilities. Figure 6 illustrates an architecture, which does not use any network level processing units (NPU) and either interleaves all atlas tiles in the same RTP stream as separate packets, stores multiple atlas tiles in a single RTP packet in a single RTP stream or streams each atlas tile as separate packet in separate RTP stream.

In this example, the client device 620 may subscribe to individual RTP streams based on the tile identifiers, thus only streaming partial V3C content from the RTP server 610. When all atlas tiles are interleaved in a single RTP stream, and no network level processing units are available, partial delivery of V3C atlas data is not possible. Signalling of tile interleaving or separate RTP streams is done in a session specific file format, such as SDP.

Figure 7 illustrates another example where a streaming architecture contains a network processing unit 730 that performs a split operation on a single RTP stream generating additional RTP streams. V3C atlas RTP streams may be split based on tile identifiers, which may be exposed in RTP packets as described in the RTP payload definitions below. Two types of splitting operations may be performed. For example, if more than one tile is stored in a single RTP packet, tiles may be separate in different RTP packets. As another example if different tiles are stored as separate packets of the same RTP stream, the packets may be split to separate RTP streams.

Client consuming RTP stream that has been splitted, may subscribe to RTP streams that only contain tile identifiers that are actually needed. Streaming architectures may also contain network processing units, which perform merging operations. Merging operations may exist on two levels: the separate RTP streams may be merged into a single RTP stream, as shown in Figure 8, so that different tiles are packet into separate RTP packets; or it is possible to pack different tiles even in the same RTP packet, by using the functionality presented in this specification.

Depending on the chosen merge level (packet or stream level) client consuming the merged RTP streams is either able to filter packets, upon reception, on the packet level or aggregation unit level based on the tile-id.

Streaming architectures may also contain network processing units, which perform filtering operations on RTP streams, an example of which is shown in Figure 9. Filtering operations require establishing feedback channel, e.g., via webRTC to provide indicator to the processing unit, which tile-ids should be filtered out or kept. Filtering may be performed on two levels, first individual RTP packets in a single stream may be filtered based on the tile identifiers, second aggregation units may be filtered based on tile identifiers.

Client consuming filtered stream only receive RTP packets with tiles that have been requested.

Combinations of different types of streaming architectures are possible, combining splitting, merging, and filtering operations in the network in order to deliver data to the client as efficiently as possible, yet taking into account client device capabilities.

RTP Payload definitions

In this disclosure, four types of RTP packets are defined: single NAL unit packet, Single Time Aggregation Packet (STAP), Multi-Time Aggregation Packet (MTAP) and Fragmentation Unit (FU).

The first two bytes of the payload of an RTP packet are referred to as the RTP payload header. The payload header consists of the same fields (F, NUT, NLI, and TID) as the NAL unit header defined in ISO/IEC 23090-5, irrespective of the type of the payload structure. For convenience, an example of the structure of RTP payload header is show in Figure 10.

Single NAL unit packet contains exactly one NAL unit, and consists of a RTP payload header and following conditional fields: 16-bit DONL and 16-bit tile-id. The rest of the payload data contains the NAL unit payload data (excluding the NAL unit header). Single NAL unit packet may contain V3C NAL units of the types defined in ISO/IEC 23090-5: Table 4. An example of the structure of the single NAL unit packet is shown in Figure 11 .

A NAL unit stream composed by de-packetizing single NAL unit packets in RTP sequence number order conforms to the NAL unit decoding order. The payload header may be an exact copy of the NAL unit header of the contained NAL unit.

The DONL field, when present, specifies the value of the 16-bit decoding order number of the contained NAL unit. If sprop-max-don-diff is greater than 0 for any of the RTP streams, the DONL field is present, and the variable DONL for the contained NAL unit is derived as equal to the value of the DONL field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP streams), the DONL field is not present.

The tile-id field, when present, specifies the 16-bit tile identifier for the NAL unit, as signaled is V3C atlas tile header. If pl-tile-id-pres is equal to 1 , the tile- id field is present. Otherwise, if pl-tile-id-pres is not present in the media description in SDP or if pl-tile-id-pres is equal to 0, the tile-id field is not present.

Single time aggregation packet (STAP) may be used to combine NAL units that belong to the same access unit. Similarly to the single NAL unit packet, the first two bytes of the STAP shall contain RTP payload header. The NAL unit type (NUT) for the NAL unit header contained in the RTP payload header shall be equal to 56, or any other value which falls in the unspecified range of the NAL unit types in ISO/IEC 23090-5. STAP may also contain tile-id field. STAP contains two or more aggregation units. An example of the structure of STAP is shown in Figure 12. The fields in the payload header are set as follows. The F bits may be equal to 0 if the F bit of each aggregated NAL unit is equal to zero; otherwise, it may be equal to 1 . The Type field is equal to 56, or other value in the unspecified range of V3C NAL units in ISO/IEC 23090-5. The value of NLI is equal to the lowest value of NLI of all aggregated NAL units. The value of TID is the lowest value of TID of all the aggregated NAL units.

All ACL NAL units in a single time aggregation packet have the same TID value since they belong to the same access unit. However, the packet may contain non-ACL NAL units for which the TID value in the NAL unit header may be different than the TID value of the ACL NAL units in the same AP.

The tile-id field, when present, specifies the 16-bit tile identifier for all NAL units in the STAP. If tile-id-pres is equal to 1 , the tile-id field is present. Otherwise, the tile-id field is not present. STAP may carry at least two aggregation units (AU), and can carry as many aggregation units as necessary; however, the total amount of data in an AP may fit into an IP packet, and the size should be choses so that the resulting IP packet is smaller than the MTU size so to avoid IP layer fragmentation. The structure of the aggregation unit depends both on the presence of the decoding order number, the sequence order of the aggregation unit in the aggregation payload, and the presence of tile-id field. An example of an aggregation unit for STAP is shown in Figure 13.

If sprop-max-don-diff is greater than 0 for any of the RTP streams, an aggregation unit begins with the DOND/DONL field. The first aggregation unit in the aggregation payload contains DONL field, which specifies the 16-bit value of the decoding order number of the aggregated NAL unit. The variable DON for the aggregated NAL unit is derived as equal to the value of the DONL field. All subsequent aggregation units in the aggregation payload shall contain an (8-bit) DOND field, which specifies the difference between the decoding order number values of the current aggregated NAL unit and the preceding aggregated NAL unit in the same AP. The variable Don for the aggregated NAL unit is derived as equal to the DON of the preceding aggregated NAL unit in the same AP plus the value of the DOND field plus 1 module 65536.

When sprop-max-don-diff is equal to 0 for all the RTP streams, DOND / DONL fields are be present in an aggregation unit. The aggregation units may be stored in the aggregation packet so that the decoding order of the containing NAL units is preserved. This means that the first aggregation unit in the aggregation packet should contain the NAL unit that should be decoded first.

If tile-id-pres is equal to 2 and the NAL unit header type of the NAL unit stored in the aggregation unit is in range 0-35 inclusive, the tile-id field is present in the aggregation unit after the conditional DOND/DONL field. Otherwise, tile-id field is not present in the aggregation unit.

The conditional fields of the aggregation unit are followed by a 16-bit NALU size field, which provides the size of the NAL unit (in bytes) in the aggregation unit. The remainder of the data in the aggregation unit should contain the NAL unit (including the unmodified NAL unit header).

In another embodiment, the value of NUT other than 56 could be assigned to STAP, wherein the STAP with the other NUT value contains tile-id value in RTP packet. In this scenario the tile-id-pres attribute, when present in SDP, is equal to 1.

Multi-time aggregation packet (MTAP) enables packing NAL units in a single RTP packet from different access units. This means that a single RTP packet can contain NAL units belonging to different temporal instances. The first two bytes of the MTAP shall contain RTP payload header, where the NAL unit type (NUT) shall be equal to 57, which falls in the unspecified range of the NAL unit types in ISO/IEC 23090-5. The MTAP may also contain a conditional DONB and tile-id fields. MTAP may contain two or more aggregation units. Figure 14 illustrates an example of MTAP.

The fields in the payload header may be set as follows. The F bit is equal to 0 if the F bit of each aggregated NAL unit is equal to zero; otherwise, it is equal to 1 . The Type field is equal to 57. The value of NLI is equal to the lowest value of NLI of all the aggregated NAL units. The value of TID is the lowest value of TID of all the aggregated NAL units.

If sprop-max-don-diff is greater than 0 for any of the RTP streams, the RTP payload header may be followed by 16-bit field containing the base decoding order number (DONB). DONB contains the value of DON for the first NAL unit in the NAL unit decoding order among the NAL units of the MTAP. The first NAL unit in the NAL unit decoding order is not necessarily the first NAL unit in the order in which the NAL units are encapsulated in an MTAP.

When sprop-max-don-diff is equal to 0 for all the RTP streams, MTAP does not contain DONB-field. Instead, aggregation units may be stored in the MTAP so that the decoding order of the NAL units is preserved. This means that the first aggregation unit in the aggregation packet should contain the NAL unit that should be decoded first.

The tile-id field, when present, specifies the 16-bit tile identifier for all NAL units in the MTAP. If tile-id-pres is equal to 1 , the tile-id field is present after the conditional DONB field. Otherwise, the tile-id field is not present. MTAP may carry at least two aggregation units (AU). The structure of the aggregation unit depends both on the presence of the decoding order number and tile-id field. Figure 15 illustrates an example of aggregation unit for MTAP.

Each aggregation unit should begin with a 16-bit timestamp offset (TS offset) field, which contains difference in 90kHz clock-ticks to the RTP header timestamp. RTP header timestamp may be set equal to the earliest access unit in the aggregation unit.

If MTAP contains base NAL decoding order number (DONB), the timestamp offset field may be followed by an 8-bit field containing the decoding order number difference (DOND). The DOND field specifies the difference between the decoding order number values of aggregation unit and the base decoding order number of the MTAP. The variable DON for the aggregated NAL unit is derived as equal to the DONB plus the value of the DOND field plus 1 modulo 65536.

If tile-id-pres is equal to 2, the tile-id field is present in the aggregation unit after the conditional DOND field. Otherwise, tile-id field is not present in the aggregation unit.

The conditional fields are followed by a 16-bit NALU size field, which provides the size of the NAL unit (in bytes) in the aggregation unit. The remainder of the data in the aggregation unit should contain the NAL unit (including the unmodified NAL unit header).

In another embodiment, the value of NUT, other than 57, could be assigned to MTAP, wherein MTAP is with the other NUT value containing tile-id value in RTP packet. In this scenario tile-id-pres attribute, when present in SDP, may be equal to 1 .

Fragmentation Units (FUs) are introduced to enable fragmenting a single NAL unit into multiple RTP packets, possibly without co-operation or knowledge of the encoder. A fragment of a NAL unit consists of an integer number of consecutive octets of that NAL unit. Fragments of the same NAL unit may be sent in consecutive order with ascending RTP sequence numbers (with no other RTP packets within the same RTP stream being sent between the first and last fragment.

When a NAL unit is fragmented and conveyed within FUs, it is referred to as a fragmented NAL unit. Aggregation packets for STAP or MTAP are not fragmented. FUs are not nested; i.e., an FU should not contain a subset of another FU. The RTP header timestamp of an RTP packet carrying an FU is set to the NALU-time of the fragmented NAL unit.

An FU consists of a RTP payload header, an 8-bit FU header, a conditional 16-bit DONL field, a conditional tile-id field, and an FU payload. An example of the structure of an FU is illustrated in Figure 16.

The fields in the RTP payload header are set as follows. NUT may be equal to 58. The rest of the fields may be as in the fragmented NAL unit.

The FU header consists of an S bit, an E bit, and a 6-bit FUT field. An example of the structure of FU header is illustrated in Figure 17.

S (1 bit): When set to 1 , the S bit indicates the start of a fragmented NAL unit, i.e., the first byte of the FU payload is also the first byte of the payload of the fragmented NAL unit. When the FU payload is not the start of the fragmented NAL unit payload, the S bit may be set to 0. E (1 bit): When set to 1 , the E bit indicates the end of a fragmented NAL unit, i.e., the last byte of the payload is also the last byte of the fragmented NAL unit. When the Fll payload is not the last fragment of a fragmented NAL unit, the E bit may be set to 0.

FUT (6 bits): The field FUT may be equal to the field nal_unit_type of the fragmented NAL unit.

The DONL field, when present, specifies the value of the 16-bit decoding order number of the fragmented NAL unit. If sprop-max-don-diff is greater than 0 for any of the RTP streams, and the S bit is equal to 1 , the DONL field is present in the FU, and the variable DON for the fragmented NAL unit is derived as equal to the value of the DONL field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP streams, or the S bit is equal to 0), the DONL field is not present in the FU.

The tile-id field, when present, specifies the 16-bit tile identifier for the fragmented NAL unit. If tile-id-pres is equal to 1 and the S bit is equal to 1 , the tile-id field is present after the conditional DONB field. Otherwise, the tile-id field is not present.

The NAL unit header of the fragmented NAL unit is not included as such in the FU payload, but rather the information of the NAL unit header of the fragmented NAL unit is conveyed in F, NLI, and TID fields of the RTP payload headers of the FUs and the FUT field of the FU header. An FU payload may not be empty.

In another embodiment, the value of NUT, other than 58, could be assigned to FU, wherein FU is with the other NUT value containing tile-id value in RTP packet. In this scenario tile-id-pres attribute, when present in SDP, is equal to 1.

In the following, an example of additional SDP attributes and parameters is shown: a=v3cmap : <format> tile-id=<value>; tile-id- pres=<value>; partial-acces s=<value> Presence of tile-id parameter indicates that the media stream associated with the parameter only contains tile-id as indicated by the value of the parameter. If tile-id parameter is not present, it means that an RTP stream may contain multiple tiles of unknown identifiers. Value of the parameter may be an array of tile identifiers, which infers that all the tile identifiers stored in the array are stored in the media stream.

In one embodiment an optional parameter tile-id-pres may be present in v3cmap attribute. The value of the parameter indicates the level on which the tile id shall be signaled in the RTP payload. Value equal to 0 indicates that tile- id field is not present in the RTP payload. Value equal to 1 means that the tile- id field is only present in the payload level indicating that all NAL units in the payload share the same tile identifier. Value equal to 2 means that the tile-id field is present in the aggregation unit level, indicating that more than NAL units with different tile identifiers may be present in the RTP payload. If tile-id-pres parameter is not present in the attribute, tile-id fields shall not be present in the RTP payload.

In another embodiment an optional parameter pl-tile-id-pres may exist in the v3cmap attribute, which indicates, if the 16-bit tile-id field shall be present in the RTP payload level, essentially signaling that all NAL units in the payload are associated with the same tile identifier. Another optional parameter au-tile- id-pres may exist, which indicates if the 16-bit tile-id field shall be present int the aggregation unit level, essentially indicating that RTP payload may contain NAL units with different tile identifiers. If pl-tile-id-pres indicates the presence of tile identifier in RTP payload level, au-tile-id-pres parameter shall not be present in the attribute.

An optional parameter partial access may exist in v3cmap attribute, which indicates that the v3c media stream contains only partial access related information. This could mean that the media stream contains SEI messages consisting of volumetric annotation SEI messages as defined in ISO/IEC 23090-5 or dynamic spatial region information with ‘dyvm’ sample entry type as defined in ISO/IEC 23090-10. In one embodiment an RTP header extension is defined. A new identifier is defined to indicate that an RTP stream contains header extension as well as to describe how the header extension should be parsed. The new identifier may be used with an extmap attribute in media level of the SDP. urn : ietf : params : rtp-hdrext : v3c : tile-id

The 8-bit ID is the local identifier and length field are as defined RFC 5285. The 2 bytes of the RTP header extension, an example of which is shown in Figure 18, contains tile-id value that indicate that this RTP video packet contain video tile that correspond to v3c tiles with tile ID equal to tile-id.

In previous, embodiments for partitioned RTP streaming have been discussed. According to an embodiment, atlas frames can be packed in RTP packets as follows:

- an atlas frame can be packed as a single NAL unit packet, which requires multiple RTP packets to store a single atlas frame; one NAL unit per RTP packet;

- an atlas frame can be packed as a single time aggregation packet, whereupon one RTP packet stores two or more NAL units associated with an atlas frame (ideally storing one atlas frame in one packet), thus storing multiple atlas tiles and parameters;

- at atlas frame can be packed as multi-time aggregation packets, whereupon one RTP packet can store NAL units from two or more atlas frames;

- an atlas frame can be packed as fragmentation packet, whereupon two or more RTP packets can store one NAL unit of one atlas frame. This is useful, when NAL units are too large to fit in a single RTP packet.

According to an embodiment, atlas tiles can be separated from atlas frames to improve partial delivery and pack atlas tiles in RTP packets:

- an atlas tile can be packed as a single NAL unit packet, whereupon there is one RTP packet per atlas tile

- an atlas tile can be packed as single time aggregation packet, whereupon one RTP packet can contain two or more atlas tiles; - an atlas tile can be packed as multi-time aggregation packet, whereupon one RTP packet can contain one or more atlas tiles from two or more atlas frames;

- an atlas tile can be packed as fragmentation packet, whereupon two or more RTP packets store one atlas tile.

The method according to an embodiment is shown in Figure 19. The method generally comprises receiving 1905 a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; encapsulating 1910 the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; encapsulating 1915 one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; sending 1920 the RTP packets over one or more RTP session to a client; and signalling 1925 per RTP session, information on the one or more atlas sub-bitstreams. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving a bitstream representing coded volumetric video, the bitstream being a composition of a number of video sub-bitstreams and atlas sub-bitstreams; means for encapsulating the video sub-bitstream to a Real-time Transfer Protocol (RTP) packet; means for encapsulating one or more components of the atlas sub-bitstream to one or more Real-time Transfer Protocol (RTP) packets as Visual Volumetric Video-based Coding (V3C) Network Abstraction Layer (NAL) units; means for sending the RTP packets over one or more RTP session to a client; and means for signalling per RTP session, information on the one or more atlas sub-bitstreams. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 19 according to various embodiments.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving, and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of a various embodiments.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.