Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD AND AN APPARATUS FOR VOLUMETRIC VIDEO ENCODING AND DECODING
Document Type and Number:
WIPO Patent Application WO/2019/211519
Kind Code:
A1
Abstract:
A method and technical equipment for encoding, wherein the method comprises processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; assigning and streaming metadata for any block in the first frame of said group of frames; determining which blocks in any subsequent frame need update in said group of frames when compared to at least one previous frame; indicating in a bitstream if a block needs updates; and streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need an update. The embodiments also concern a method and an apparatus for decoding.

Inventors:
ROIMELA KIMMO (FI)
PESONEN MIKA (FI)
PYSTYNEN JOHANNES (FI)
Application Number:
PCT/FI2019/050334
Publication Date:
November 07, 2019
Filing Date:
April 25, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N13/178; G06T15/08; H04N13/268; H04N19/46; H04N19/597; H04N19/70; H04N21/235; H04N21/435
Foreign References:
US20170374362A12017-12-28
Other References:
MAMMOU, K: "PCC Test Model Category 2 v1", 121. MPEG MEETING;22-1-2018 - 26-1-2018; GWANGJU; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11),, no. N17348, 16 April 2018 (2018-04-16), XP030023995
KÄMPE, V. ET AL.: "Exploiting coherence in time-varying voxel data", PROCEEDINGS OF THE 20TH ACM SIGGRAPH SYMPOSIUM ON INTERACTIVE 3D GRAPHICS AND GAMES (I3D '16, 27 February 2016 (2016-02-27), pages 15 - 21, XP058079601, DOI: 10.1145/2856400.2856413
PARK, J. ET AL.: "Internal Cell Skip Method for Occupancy Map Coding in TMC2", ISO/IEC JTC1/SC29/WG11, no. M42465, 14 April 2018 (2018-04-14), San Diego, XP030070804
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1 . An encoding method, comprising:

- processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- assigning and streaming metadata for any block in the first frame of said group of frames;

- determining which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame;

- indicating in a bitstream if a block needs metadata update; and

- streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

2. An encoding method according to claim 1 , wherein the metadata comprises one or both of the following: patch indexing, pixel occupancy.

3. A decoding method, comprising:

- receiving at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- decoding metadata for any block in a first frame of the group of frames;

- decoding from the at least one bitstream an indication if a block in any subsequent frame needs metadata update;

- decoding metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and

- generating a volumetric video stream from the first and subsequent frames.

4. A decoding method according to claim 3, further comprising storing which blocks do not need updates within the group of frames.

5. An apparatus comprising at least

- means for processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- means for assigning and streaming metadata for any block in the first frame of said group of frames;

- means for determining which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame;

- means for indicating in a bitstream if a block needs metadata update; and - means for streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

6. An apparatus according to claim 5, wherein the metadata comprises one or both of the following: patch indexing, pixel occupancy.

7. The apparatus according to claim 5 or 6, further comprising at least one processor, memory including computer program code, the memory and the computer program code.

8. An apparatus comprising at least

- means for receiving at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- means for decoding metadata for any block in a first frame of the group of frames;

- means for decoding from the at least one bitstream an indication if a block in any subsequent frame needs metadata update;

- means for decoding metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and

- means for generating a volumetric video stream from the first and subsequent frames.

9. The apparatus according to claim 8, further comprising means for storing which blocks do not need updates within the group of frames.

10. The apparatus according to claim 8 or 9, further comprising at least one processor, memory including computer program code, the memory and the computer program code.

1 1 . A computer program product, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- process a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- assign and streaming metadata for any block in the first frame of said group of frames;

- determine which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame;

- indicate in a bitstream if a block needs metadata update; and

- stream metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

12. A computer program product according to claim 1 1 , being embodied on a non- transitory computer readable medium.

13. A computer program product, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- decode metadata for any block in a first frame of the group of frames;

- decode from the at least one bitstream an indication if a block in any subsequent frame needs metadata update;

- decode metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and

- generate a volumetric video stream from the first and subsequent frames.

14. A computer program product according to claim 13, being embodied on a non- transitory computer readable medium.

15. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- to process a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- to assign and stream metadata for any block in the first frame of said group of frames;

- to determine which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame;

- to indicate in a bitstream if a block needs metadata update; and

- to stream metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

16. An apparatus according to claim 15, wherein the metadata comprises one or both of the following: patch indexing, pixel occupancy.

17. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- to receive at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks;

- to decode metadata for any block in a first frame of the group of frames; - to decode from the at least one bitstream an indication if a block in any subsequent frame needs metadata update;

- to decode metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise to use the metadata in the first frame for any subsequent frames; and

- to generate a volumetric video stream from the first and subsequent frames.

18. The apparatus according to claim 17, further comprising computer program code configured to cause the apparatus to store which blocks do not need updates within the group of frames.

Description:
A METHOD AND AN APPARATUS FOR VOLUMETRIC VIDEO ENCODING AND DECODING

Technical Field

The present solution generally relates to volumetric video coding. In particular, the solution relates to compression of block allocation data.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view, and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

Now there has been invented a method and technical equipment implementing the method, for providing an improvement for volumetric video coding. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method for encoding, the method comprising processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; assigning and streaming metadata for any block in the first frame of said group of frames; determining which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; indicating in a bitstream if a block needs metadata update; and streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

According to a second aspect, there is provided a method for decoding, the method comprising receiving at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; decoding metadata for any block in a first frame of the group of frames; decoding from the at least one bitstream an indication if a block in any subsequent frame needs metadata update; decoding metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and generating a volumetric video stream from the first and subsequent frames.

According to a third aspect, there is provided an apparatus comprising at least means for processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; means for assigning and streaming metadata for any block in the first frame of said group of frames; means for determining which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; means for indicating in a bitstream if a block needs metadata update; and means for streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update. According to a fourth aspect, there is provided an apparatus comprising at least means for receiving at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; means for decoding metadata for any block in a first frame of the group of frames; means for decoding from the at least one bitstream an indication if a block in any subsequent frame needs metadata update; means for decoding metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and means for generating a volumetric video stream from the first and subsequent frames.

According to a fifth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following process a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; assign and streaming metadata for any block in the first frame of said group of frames; determine which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; indicate in a bitstream if a block needs metadata update; and stream metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following receive at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; decode metadata for any block in a first frame of the group of frames; decode from the at least one bitstream an indication if a block in any subsequent frame needs metadata update; decode metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and generate a volumetric video stream from the first and subsequent frames.

According to a seventh aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to process a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; assign and streaming metadata for any block in the first frame of said group of frames; determine which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; indicate in a bitstream if a block needs metadata update; and stream metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update.

According to an eighth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; decode metadata for any block in a first frame of the group of frames; decode from the at least one bitstream an indication if a block in any subsequent frame needs metadata update; decode metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and generate a volumetric video stream from the first and subsequent frames.

According to an embodiment, the metadata comprises one or both of the following: patch indexing, pixel occupancy.

According to an embodiment, the apparatus further comprises means for storing which blocks do not need updates within the group of frames.

According to an embodiment, the computer program product is being embodied on a non-transitory computer readable medium.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a compression process;

Fig. 2 shows an example of a decompression process;

Fig. 3 shows an example of different block types;

Fig. 4 shows per-block signaling according to a first embodiment; Fig. 5 shows per-block signaling according to a second embodiment;

Fig. 6 is a flowchart of an encoding method according to an embodiment;

Fig. 7 is a flowchart of a decoding method according to an embodiment;

Fig. 8 shows an example of an apparatus according to an embodiment; and

Fig. 9 shows an example of a layout of an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments will be described in the context of volumetric video. In particular, the present embodiments relate to lossless compression of block allocation data and the signaling of temporally coherent occupancy blocks. The present embodiments are applicable e.g. in the MPEG-I Point Cloud Compression (PCC).

Volumetric video may be captured using one or more 3D cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observer different parts of the world.

Volumetric video enables the viewer to move in six degrees of freedom: in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a 2D plane. Flowever, only a relatively small part of the volume changes over time. Therefore, it may be possible to reduce the total amount of data by only coding information about an initial state and changes which may occur between frames. Volumetric video can be rendered from synthetic 3D animations, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR, for example.

Volumetric video data represents a three-dimensional scene or object and can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI, or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, and more. Also, a combination of CGI and real-world data is possible. Examples of representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e.“frames” in 2D video, or other means, e.g. position of an object as a function of time.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.

In 3D point clouds, each point of each 3D surface is described as a 3D point with colour and/or other attribute information such as surface normal or material reflectance. Point cloud is a set of data points in a coordinate system, for example in a three-dimensional coordinate system being defined by X, Y, and Z coordinates. The points may represent an external surface of an object in the screen space, e.g. in a three-dimensional space.

In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive“frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Figure 1 illustrates an overview of an example of a compression process. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105.

The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1 .0, 0.0), and

- (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layers, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process.

The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbors. The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO.

The occupancy map compression 1 10 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non- padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows:

o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner

o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

The binary value of the initial sub-block is encoded.

Continuous runs of 0s and 1 s are detected, while following the traversal order selected by the encoder.

The number of detected runs is encoded.

The length of each run, except of the last one, is also encoded.

A multiplexer 1 12 may receive a compressed geometry video and a compressed texture video from the video compression 108, a compressed occupancy map from occupancy map compression 1 10 and optionally a compressed auxiliary patch information from auxiliary patch-info compression 1 1 1 . The multiplexer 1 12 uses the received data to produce a compressed bitstream.

Figure 2 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de- multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

Coding of occupancy information can be performed with the geometry image. A specific depth value, e.g. 0, or a specific depth value range may be reserved to indicate that a pixel is inpainted and not present in the source material. The specific depth value or the specific depth value range may be pre-defined, for example in a standard, or the specific depth value or the specific depth value range may be encoded into or along the bitstream and/or may be decoded from or along the bitstream. This way of multiplexing the occupancy information in the depth sample array creates sharp edges into the images, which may be subject to additional bitrate as well as compression artefacts around the sharp edges.

One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. For example, a time-varying 3D point cloud with spatial and texture coordinates, can be mapped into a sequence of at least two sets of planes, where one of the two sets carry the texture data and the other carries the distance of the mapped 3D surface points from the projection planes.

For accurate 2D to 3D reconstruction at the receiving side, the decoder must be aware which 2D points are“valid” and which points stem from interpolation/padding. This requires the transmission of additional data. The additional data may be encapsulated in the geometry image as a pre-defined depth value (e.g. 0) or a pre-defined range of depth values. This will increase the coding efficiency only on the texture image, since the geometry image is not blurred/padded. Furthermore, encoding artefacts at the object boundaries of the geometry image may create severe artefacts, which require post-processing and may not be concealable.

When using video compression, a set of frames can be compressed to get better compression results. The current MPEG-I PCC has lossless block allocation and occupancy map data that cannot be compressed with video codecs. The occupancy map for a block of a frame is a binary lossless image mask for depth values. The current MPEG-I PCC implementation includes compression of texture (color), geometry (depth) and occupancy map. Occupancy includes patch data, block-to-patch IDs and edge blocks. In the current version of MPEG-I PCC, the occupancy map and its data are streamed and updated for every frame.

The relative size of the occupancy map data is significant and is expected to increase drastically in the future when the patch allocation scheme is made temporally coherent, enabling bigger GOP (Group of Pictures) sizes and correspondingly higher compression in the texture and geometry streams. The current MPEG-I PCC does not consider the temporal coherency of occupancy map. In the current MPEG-I PCC implementation, each edge (i.e. partially occupied) block (e.g. 16x16 pixels) has a corresponding occupancy map block that is coded first with RLE compression (different scanning orders are tested) and then adaptive arithmetic coding is used for the best scanning order.

The present embodiments are targeted to improve the compression of the lossless occupancy map and the block-to-patch mapping in the MPEG-I PCC stream. This may be achieved by updating the first frame as a whole (a complete frame with all patch/occupancy metadata without temporal coherence) and enabling the rest of the frames that has been tagged as coherent to omit the patch/occupancy metadata. This may be implemented by indicating in a bitstream which blocks of a group of frames (GoF) are temporally coherent and which are changing. The temporal coherency is based on a metadata of the block. At first, it is checked for a group of frames, which blocks do not change, i.e. which blocks do not need updates, i.e. which are coherent. The coherency is based on metadata coherency, and thus the pixel content of the blocks may still change. The checking is based on an occupancy of a block and may be done for all occupancy elements indicated in a metadata of the block: block-to-patch IDs, full blocks, empty blocks (patch ID zero) and edge blocks. Any block metadata in the beginning (i.e. in the first frame) of the GoF is signaled, but only metadata of those blocks that need updates (i.e. blocks that change) is signaled for the rest of the frames (separate updates for block-to-patch and full/empty/edge). This means that, per frame, only for the blocks that have been explicitly signaled to receive new content every frame, the metadata is streamed and updated.

There are two kinds of data that can be associated as a metadata with blocks. A block may have metadata comprising a block-to-patch index, which defines which patch a given block belongs to. In addition, each non-empty block (i.e. a block that is allocated to a patch) has occupancy data as metadata. The occupancy data can be a single bit for full blocks or a single bit followed by an occupancy map for edge blocks. A coherency can be determined for one or both these data types: block-to-patch mapping and/or occupancy. Each block may therefore be coherent with respect to one or both of these data types. It is appreciated that a block may also change between edge and interior status during a group of frames, so the type of occupancy data may also change between the single“full” bit or a complete occupancy map on a frame basis.

Thus, the present embodiments propose signaling of temporally coherent block-to- patch allocation and occupancy map edge blocks. Fig. 3 shows an example of three types of occupancy blocks appearing in a patch atlas 301 .“Full occupancy block” refers to an occupancy block that is fully occupied with points and may be signaled with a patchID and a T bit.“Empty occupancy block” refers to an occupancy block that is completely empty, having no points. This kind of a block may be signaled with a patch ID of zero. “Edge occupancy block” refers to an occupancy block that has points partially covering the block. Edge occupancy block may be signaled with a patch ID followed by a Ό’ bit and a coded occupancy block.

In the current version of MPEG-I PCC, the empty blocks are signaled with a patch ID of zero, while full blocks are signaled with a patch ID and a single bit. Edge block additionally contain RLE compressed points for a block (16x16).

According to the present embodiments, temporally coherent blocks are signaled for a group of frames. The number of frames contained in the group of frames may or may not be the same as the size of a group of pictures in the underlying video codec. Various embodiments assume that the patch atlas generation for the color and geometry streams is coherent for this set of group of frames, i.e., that the patches are allocated statically and do not move within the atlas during the group of frames. The content inside each patch may still move.

When considering the coherent patch atlas, it is possible to evaluate the blocks that are changing within the group of frames. The evaluation may be based on the occupancy map of a block between frames. Such changing blocks are signaled according to present embodiments. In the following, two embodiments for representing the signals in the PCC bitstream indicating the changing blocks are discussed. The embodiments pertain to a group of frames.

The first embodiment combines block mapping and occupancy signaling. For a group of frames, one bit per occupancy block is signaled if the occupancy block is temporally coherent. For example, a 60x60 block atlas, a total of 3600 bits, compressed using e.g. arithmetic coding. For the first frame, a block-to-patch mapping is performed, and occupancy data is determined for all blocks as in the current PCC codec. For the subsequent frames, a block-to-patch mapping is performed, and the occupancy data is determined only for blocks that were not signaled as temporally coherent, i.e. for blocks that change temporally. Fig. 4 illustrates the per-block signaling according to the first embodiment. Occupancy data consists of one“full” bit followed by a coded occupancy mask only in the case that the full bit was not set.

In the first embodiment, the block-to-patch mapping and occupancy mapping may change independently. Thus, a block coherency signal may be added to the occupancy signal. For example, the signaling can comprise the following: • 1 - both block allocation and occupancy are constant during group of frames (temporally coherent)

• 01 - block allocation is constant during group of frames, but occupancy changes

• 00 - both block allocation and occupancy change during group of frames (temporally non-coherent)

Fig. 4 shows an example of per-block signaling for the second embodiment. In the second embodiment if the occupancy is coherent, 1 bit per occupancy block is signaled for a group of frames. If the occupancy is not coherent, the signal is followed by another 1 -bit signal indicating block allocation coherence. For the first frame, block-to-patch mapping is performed, and occupancy data is determined for all blocks as in the current PCC codec. For the subsequent frames, block-to-patch mapping is performed for a block that was signaled as not coherent. For the subsequent frames, occupancy data is determined for blocks that have a non-zero block ID and were signaled as not coherent. Since the patch ID is necessarily more than one bit, the second embodiment enables greater savings than the first embodiment.

The signaling according to first and second embodiments for group of frames can reduce the size of the occupancy for group of frames. Various embodiments prepare for the bigger group of frames used in MPEG-I PCC work. With bigger group of frames, the relative size of the occupancy data in the stream increases and the present embodiments tries to solve this.

It is to be noticed, that the bit allocation in the embodiments are examples based on the assumption that there can be more coherent than non-coherent blocks. The meaning of the bits in the second embodiment, for example, can be changed based on data statistics so that the one-bit signal is allocated to the most common case. In a third embodiment, an additional signal per GoF is used to signal the meaning of different per-block bit patterns for a given GoF.

Figure 6 is a flowchart illustrating an encoding method according to an embodiment. According to this example, the method comprises processing 601 a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; assigning and streaming 602 metadata for any block in the first frame of said group of frames; determining 603 which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; indicating 604 in a bitstream if a block needs metadata update; and streaming 605 metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update. An apparatus according to an embodiment comprises means for processing a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; means for assigning and streaming metadata for any block in the first frame of said group of frames; means for determining which blocks in any subsequent frame need metadata update in said group of frames when compared to at least one previous frame; means for indicating in a bitstream if a block needs metadata update; and means for streaming metadata update for blocks in any subsequent frame only when the block has been indicated to need a metadata update. The means comprises a processor, a memory, and a computer program code residing in the memory, wherein the processor may further comprise processor circuitry.

Figure 7 is a flowchart illustrating a decoding method according to an embodiment. According to this example, the method comprises receiving 701 at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; decoding 702 metadata for any block in a first frame of the group of frames; decoding 703 from a bitstream an indication if a block in any subsequent frame needs metadata update; decoding 704 metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and generating 705 a volumetric video stream from the first and subsequent frames.

An apparatus according to an embodiment comprises means for receiving at least one bitstream including a group of frames of a volumetric video stream, wherein the group of frames comprises pixels arranged into blocks; means for decoding metadata for any block in a first frame of the group of frames; means for decoding from the at least one bitstream an indication if a block in any subsequent frame needs metadata update; means for decoding metadata update for blocks in any subsequent frame of the group of frames, when the block has been indicated to need a metadata update; otherwise using the metadata in the first frame for any subsequent frames; and means for generating a volumetric video stream from the first and subsequent frames. The means comprises a processor, a memory, and a computer program code residing in the memory, wherein the processor may further comprise processor circuitry.

An apparatus according to an embodiment is disclosed with reference to Figures 8 and 9. Figure 8 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. Figure 9 shows a layout of an apparatus according to an embodiment. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.

The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus may further comprise any suitable short-range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection. The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller. The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia Interface (HDMI), Mobile High-definition Link (MHL), or Digital Visual Interface (DVI).

The various embodiments may provide advantages. For example, the embodiments provide better occupancy map compression for temporally coherent frames. In addition, when the group of frames increases for video streams, the occupancy map size increases relatively in the whole stream and this temporally coherent signaling reduces its size. Further, various embodiments enable the decoder to selectively decompress only the non-temporally-coherent blocks.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.