Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR ENCODING AND DECODING DIGITAL VOLUMETRIC VIDEO
Document Type and Number:
WIPO Patent Application WO/2019/185983
Kind Code:
A1
Abstract:
A solution comprises determining an occupancy map having a first resolution, grouping pixels of the occupancy map into non-overlapping blocks; and generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block. When the down-sampled occupancy map (601, 602) is received, the representative values in the down-sampled occupancy map are upsampled (611, 612), wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

Inventors:
AFLAKI, Payman (Yrttikatu 11 B, Tampere, 33710, FI)
SCHWARZ, Sebastian (Vähäjärvenkatu 4, Tampere, 33900, FI)
Application Number:
FI2019/050219
Publication Date:
October 03, 2019
Filing Date:
March 14, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (Karakaari 7, Espoo, 02610, FI)
International Classes:
H04N19/59; G06T3/40; H04N1/411; H04N1/417; H04N13/161; H04N19/33; H04N19/593; H04N19/597
Foreign References:
EP0890921A21999-01-13
US20050084016A12005-04-21
US6510246B12003-01-21
US20190087979A12019-03-21
Other References:
3DG: "PCC Test Model Category 2 vO", 120. MPEG MEETING; (MOTION PICTURE EXPERT GROUP OR ISO/IEC JTC1/SC29/WG11) N17248, 15 December 2017 (2017-12-15), Macau, XP030023909, Retrieved from the Internet [retrieved on 20181016]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (Ari Aarnio, IPR DepartmentKarakaari 7, Espoo, 02610, FI)
Download PDF:
Claims:
Claims:

1 . A method, comprising:

- determining an occupancy map having a first resolution;

- grouping pixels of the occupancy map into non-overlapping blocks; and

- generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

2. A method comprising:

- receiving a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and

- up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

3. An apparatus comprising:

- means for determining an occupancy map having a first resolution;

- means for grouping pixels of the occupancy map into non-overlapping blocks; and

- means for generating a down-sampled occupancy map for the occupancy map by generating one representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

4. The apparatus of claim 3, wherein the down-sampled occupancy map has a second resolution, and wherein the representative value comprises a binary value indicating whether a pixel in the down-sampled occupancy map is occupied or unoccupied.

5. The apparatus of claim 3 or 4, wherein examining pixel occupancy comprises determining a number of occupied pixels and/or locations of occupied pixels in the current block and the at least one adjacent block.

6. The apparatus according to claim 5, wherein if the number of occupied pixels in the current block is greater than the number of unoccupied pixels, the representative down-sampled pixel is determined to be occupied;

if the number of occupied pixels in the current block is smaller than the number of unoccupied pixels, the representative down-sampled pixel is determined to be unoccupied; and/or

if the number of occupied pixels in the current block equals with the number of unoccupied pixels, the representative down-sampled pixel is determined according to one or more adjacent pixels of the at least one adjacent block in the occupancy map.

7. The apparatus according to any of claims 3 to 6, further comprising:

means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining pixel occupancy associated with the current representative value and pixel occupancy associated with at least one representative value in the down-sampled occupancy map adjacent to the current representative value.

8. The apparatus according to claim 7, further comprising means to generate up- sampling guidance information to be signaled in a Chroma channel, the guidance information being based on an analysis of the occupancy map and the up-sampled occupancy map.

9. An apparatus comprising:

- means for receiving a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and

- means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down- sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

10. The apparatus of claim 9, wherein the up-sampling comprises:

- means for determining two occupied and diagonally adjacent blocks in the down-sampled occupancy map;

- means for determining if the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied, and in response to determining that the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied;

- means for up-sampling the current representative values of each occupied and diagonally adjacent blocks in the down-sampled occupancy map to respective up-sampled blocks of pixels in the up-sampled occupancy map,

- means for determining two pixels adjacent to both up-sampled blocks of pixels in the up-sampled occupancy map; and

- means for assigning the two determined pixels to be occupied in the up-sampled occupancy map.

1 1 . The apparatus of claim 10, wherein one or more pixels adjacent to either the two determined pixels or up-sampled blocks of pixels are assigned to be occupied in the up-sampled occupancy map.

12. The apparatus of claim 9, wherein up-sampling comprises:

- means for determining two occupied and diagonally aligned blocks in the down- sampled occupancy map;

- means for determining if the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map, and in response to determining that the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map;

- means for up-sampling the current representative values of each occupied and diagonally aligned blocks in the down-sampled occupancy map to respective up-sampled blocks of pixels in the up-sampled occupancy map;

- means for determining the least number of pixels which diagonally connect the two up-sampled blocks of pixels in the up-sampled occupancy map; and

- means for assigning the determined pixels to be occupied in the up-sampled occupancy map.

13. The apparatus of any of claims 9 to 12, wherein the means comprises

at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.

14. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- determine an occupancy map having a first resolution; - group pixels of the occupancy map into non-overlapping blocks; and

- generate a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

15. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive a down-sampled occupancy map, wherein a representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and

- up-sample representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining pixel occupancy of at least one representative value adjacent to the current representative value.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR ENCODING AND DECODING DIGITAL VOLUMETRIC VIDEO

Technical Field

The present solution generally relates to virtual reality. In particular, the solution relates to encoding and decoding of digital volumetric video.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view, and displayed as a rectangular scene on flat displays. Such content is referred as“flat content”, or“flat image”, or“flat video” in this application. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being“immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more 3D (three- dimensional) cameras. The cameras are in different positions and orientations within a scene. One issue to take into account is that compared to 2D (two-dimensional) video content, volumetric 3D video content has

much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel. Summary

Now there has been invented an improved method and technical equipment implementing the method, for down-sampling occupancy map prior to encoding and up-sampling the occupancy map after reconstruction. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising determining an occupancy map having a first resolution; grouping pixels of the occupancy map into non-overlapping blocks; and generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

According to a second aspect, there is provided a method comprising receiving a down-sampled occupancy map, wherein at least one representative value of the down- sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down- sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

According to a third aspect, there is provided an apparatus comprising means for determining an occupancy map having a first resolution; means for grouping pixels of the occupancy map into non-overlapping blocks; and means for generating a down- sampled occupancy map for the occupancy map by generating one representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

According to an embodiment, the down-sampled occupancy map has a second resolution, and wherein the representative value comprises a binary value indicating whether a pixel in the down-sampled occupancy map is occupied or unoccupied. According to an embodiment, examining pixel occupancy comprises determining a number of occupied pixels and/or locations of occupied pixels in the current block and the at least one adjacent block.

According to an embodiment, if the number of occupied pixels in the current block is greater than the number of unoccupied pixels, the representative down-sampled pixel is determined to be occupied; if the number of occupied pixels in the current block is smaller than the number of unoccupied pixels, the representative down-sampled pixel is determined to be unoccupied; and/or if the number of occupied pixels in the current block equals with the number of unoccupied pixels, the representative down-sampled pixel is determined according to one or more adjacent pixels of the at least one adjacent block in the occupancy map.

According to an embodiment, the apparatus further comprises means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining pixel occupancy associated with the current representative value and pixel occupancy associated with at least one representative value in the down-sampled occupancy map adjacent to the current representative value.

According to an embodiment, the apparatus further comprises means to generate up- sampling guidance information to be signaled in a Chroma channel, the guidance information being based on an analysis of the occupancy map and the up-sampled occupancy map.

According to a fourth aspect, there is provided an apparatus comprising means for receiving a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and means for up-sampling representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

According to an embodiment, the up-sampling comprises means for determining two occupied and diagonally adjacent blocks in the down-sampled occupancy map; means for determining if the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied, and in response to determining that the two other blocks adjacent to the occupied and diagonally adjacent blocks are un-occupied; means for up-sampling the current representative values of each occupied and diagonally adjacent blocks in the down-sampled occupancy map to respective up-sampled blocks of pixels in the up-sampled occupancy map; means for determining two pixels adjacent to both up-sampled blocks of pixels in the up-sampled occupancy map; and means for assigning the two determined pixels to be occupied in the up-sampled occupancy map.

According to an embodiment, one or more pixels adjacent to either the two determined pixels or up-sampled blocks of pixels are assigned to be occupied in the up-sampled occupancy map.

According to an embodiment, the up-sampling comprises means for determining two occupied and diagonally aligned blocks in the down-sampled occupancy map; means for determining if the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map, and in response to determining that the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map; means for up-sampling the current representative values of each occupied and diagonally aligned blocks in the down-sampled occupancy map to respective up- sampled blocks of pixels in the up-sampled occupancy map; means for determining the least number of pixels which diagonally connect the two up-sampled blocks of pixels in the up-sampled occupancy map; and means for assigning the determined pixels to be occupied in the up-sampled occupancy map.

According to an embodiment, the means comprises at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.

According to a fifth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to determine an occupancy map having a first resolution; group pixels of the occupancy map into non-overlapping blocks; and generate a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where the generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

According to a sixth aspect, there is provided an computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive a down-sampled occupancy map, wherein a representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sample representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down- sampled occupancy map comprises examining pixel occupancy of at least one representative value adjacent to the current representative value.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an example of a compression process;

Fig. 2 shows an example of a decompression process;

Fig. 3 shows an example of 3D to 2D projection patches;

Fig. 4 shows a table illustrating possible modes for down-sampling a 2x2 block of pixels;

Fig. 5 shows an example of pixel locations around respective block of 2x2;

Fig. 6 shows an example of up-sampling representative down-sampled pixels from a down-sampled occupancy map to create the up-sampled occupancy map;

Fig. 7 shows an example of adjacent pixels to the recently created dashed pixels;

Fig. 8 shows another example of adjacent pixels to the recently created dashed pixels;

Fig. 9 shows another example of adjacent pixels to the recently created dashed pixels;

Fig. 10 is a flowchart illustrating a method according to an embodiment; Fig. 1 1 is a flowchart illustrating a method according to another embodiment;

Fig. 12 shows an example of an apparatus according to an embodiment, and

Fig. 13 shows an example of a layout of an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments of the invention will be described in the context of volumetric video.

Volumetric video data represents a three-dimensional scene or object, and can be used as an input for augmented reality (AR), virtual reality (VR) and mixed reality (MR) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (e.g. frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI, or captured from real-world scenes using a variety of capture solutions, e.g. a multi-camera, a laser scan, a combination of video and dedicated depths sensors, and more. Also, a combination of CGI and real-world data is possible. Typical representation formats for such volumetric data are triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications, such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as a set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi-view plus depth is the use of elevation maps, and multi-level surface maps.

In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill-defined problem, as both, geometry and respective attributes may change. For example, temporal successive“frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview with depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel can be projected onto one, or more geometries. These geometries are“unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency is increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview with depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 1 illustrates an overview of an example of a compression process. Such process may be applied for example in MPEG Point Cloud Coding (PCC).

The process starts with an input point cloud frame 101 that is provided for patch generation 102, geometry image generation 104 and texture image generation 105. The patch generation 102 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1 .0, 0.0), and

- (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closes normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbors. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 102 for the input point cloud frame 101 is delivered to packing process 103, to geometry image generation 104 and to texture image generation 105. The packing process 103 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user-defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder.

PCC may use a simple packing strategy that iteratively tries to insert patches into a WxH grid. W and H are user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid is temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells. The geometry image generation 104 and the texture image generation 105 are configured to generate geometry images and texture images. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layers, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, DO+D], where D is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 107. The image padding 107 may also receive as an input an occupancy map (OM) 106 to be used with the geometry images and texture images. The occupancy map 106 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process.

The padding process 107 aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels, then the empty pixels are iteratively filled with the average value of their non-empty neighbors. The padded geometry images and padded texture images may be provided for video compression 108. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 108 also generates reconstructed geometry images to be provided for smoothing 109, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 102. The smoothed geometry may be provided to texture image generation 105 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chose projection planes, (50, sO, rO) may be computed as follows:

o Index 0, d0= xO, s0=z0 and rO = yO

o Index 1 , d0= yO, s0=z0 and rO = xO

o Index 2, d0= zO, s0=x0 and rO = yO

Also mapping information provided for each TxT block its associated patch index may be encoded for example as follows:

• For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

• The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

• Let I be index of the patch to which belongs the current TxT block and let J be the position of I in L. Instead of explicitly encoding the index I, its position J is arithmetically encoded instead, which leads to better compression efficiency.

The occupancy map compression 1 10 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1 . In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map. The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non- padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1 ). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows:

o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner

o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream.

o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

The binary value of the initial sub-block is encoded.

Continuous runs of Os and 1 s are detected, while following the traversal order selected by the encoder.

The number of detected runs is encoded.

The length of each run, except of the last one, is also encoded.

A multiplexer 112 may receive a compressed geometry video and a compressed texture video from the video compression 108, a compressed occupancy map from occupancy map compression 110 and optionally a compressed auxiliary patch information from auxiliary patch-info compression 111. The multiplexer 112 uses the received data to produce a compressed bitstream.

Figure 2 illustrates an overview of a compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 201 receives a compressed bitstream, and after de- multiplexing, provides compressed texture video and compressed geometry video to video decompression 202. In addition, the de-multiplexer 201 transmits compressed occupancy map to occupancy map decompression 203. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 204. Decompressed geometry video from the video decompression 202 is delivered to geometry reconstruction 205, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 205 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images. For example, let P be the point associated with the pixel (u, v) and let (50, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, u1 , v1 ) its 2D bounding box. P can be expressed in terms of depth 5(u, v), tangential shift s(u, v) and bi-tangential shift r(u, v) as follows:

5(u, v) = 50 + g(u, v)

s(u, v) = sO - uO + u

r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

The reconstructed geometry image may be provided for smoothing 206, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbors. The smoothed geometry may be transmitted to texture reconstruction 207, which also receives a decompressed texture video from video decompression 202. The texture reconstruction 207 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

Coding of occupancy information can be performed with the geometry image. A specific depth value, e.g. 0, or a specific depth value range may be reserved to indicate that a pixel is inpainted and not present in the source material. The specific depth value or the specific depth value range may be pre-defined, for example in a standard, or the specific depth value or the specific depth value range may be encoded into or along the bitstream and/or may be decoded from or along the bitstream. This way of multiplexing the occupancy information in the depth sample array creates sharp edges into the images, which may be subject to additional bitrate as well as compression artefacts around the sharp edges.

One way to compress a time-varying volumetric scene/object is to project 3D surfaces on to some number of pre-defined 2D planes. Regular 2D video compression algorithms can then be used to compress various aspects of the projected surfaces. For example, a time-varying 3D point cloud with spatial and texture coordinates, can be mapped into a sequence of at least two sets of planes, where one of the two sets carry the texture data and the other carries the distance of the mapped 3D surface points from the projection planes.

Typically, 2D projections of 3D data will have an arbitrary shape, i.e. no block- alignment. Figure 3 illustrates the outcome of such projection step, from 3D projection to 2D projection. To increase coding efficiency, the boundaries of the projections are blurred/padded to avoid high frequency content.

For accurate 2D to 3D reconstruction at the receiving side, the decoder must be aware which 2D points are“valid” and which points stem from interpolation/padding. This requires the transmission of additional data. The additional data may be encapsulated in the geometry image as a pre-defined depth value (e.g. 0) or a pre-defined range of depth values. This will increase the coding efficiency only on the texture image, since the geometry image is not blurred/padded. Furthermore, encoding artefacts at the object boundaries of the geometry image may create severe artefacts, which require post-processing and may not be concealable.

The additional data may also be sent separately as “occupancy map”. Such an occupancy map is costly to transmit. To minimize the cost, it may only be transmitted every 4 frames (every l-frame in a IPPP coding structure). Still it may require 8-18% of the overall bit rate budget. In addition, 3D resampling and additional motion information is required to align the l-frame occupancy map to the 3 P-frames without transmitted occupancy map. The coding and decoding of the occupancy map information and the 3D motion information also require significant computational memory, and memory access resources. Moreover, the occupancy map information uses a codec different from the video codec used for texture and geometry images. Consequently, it is unlikely that such a dedicated occupancy map codec would be hardware-accelerated.

A projection-based compression of a volumetric video data can comprise presenting and encoding different parts of the same object as 2D projections.

In MPEG, a projection-based approach was established as a text model for standardization of dynamic point cloud compression. In such test model, projected texture and geometry data is accompanied with a binary“occupancy map” to establish which projected points are valid or not. This occupancy map can be interpreted as an individual encoding of the“inpainting mask”. The texture and depth images can be bullred/inpainted to increase coding efficiently. The occupancy map is used to signal if a 2D pixel should be reconstructed to 3D space at the decoder side. This individual encoding brings several drawbacks as discussed above. As transmitting the occupancy map is costly it is only transmitted every four frames (PPP coding structure). Thus, 3D resampling and additional motion information is required to align the l-frame occupancy map to the three P-frames without transmitted occupancy map.

In the present solution, a new algorithm to down-sample and up-sample the occupancy map (OM) is disclosed. This algorithm is adaptive and non-linear and takes into account the structure of the OM and tries to preserve the OM shape while reducing the required bit-rate to encode it. The bitrate reduction is due to down-sampling applied. The down-sampling step targets removing potential pixels and the up-sampling targets to re-introduce them. The whole down/up sampling process also reduces the amount of bitrate that needs to be transmitted due to reduced number of pixels to be encoded in the OM.

In the following description of the present embodiments, the occupancy maps are referred to as follows:

- original occupancy map (OOM): the occupancy map that should be down- sampled and has the original resolution, i.e. per-pixel occupancy information is available

- down-sampled occupancy map (DOM): with ratio ½, ¼, etc., i.e. occupancy information is only available for a group of n pixels, where n is defined based on the down-sampling factor. For example, if the down-sampling ratio is ½ c ½ then n is equal to 4.

- up-sampled occupancy map (UOM): an occupancy map created by up-sampling the DOM to the resolution of OOM, recreating per-pixel occupancy information. In this case, if the down-sampling ratio is ½ c ½ then for each pixel in the DOM 4 pixels in the UOM are created.

The down-sampling takes into account the pixel values of the adjacent blocks and is not necessarily limited to the pixel values of each block separately. It means that pixel values from different blocks may be used simultaneously in the down-sampling process. This proposal is shown on a block of TxT pixels where T=2. Similar idea may also be applied to blocks which are larger, e.g. T=4, or T=8. The idea is to shrink the occupancy map whenever possible to reduce the number of processed pixels and hence, reduce the bitrate, memory buffer size, memory access operations, and other related factors.

It should be noted that the blocks may have a rectangular shape where block size of T1XT2 is considered. Moreover, the size and/or shape of the block may vary in the process. For example, in one part of the OOM the blocks may be square with size TxT, in another part of the OOM the blocks may be rectangular with size T1XT2, and in another part of the OOM the blocks may be rectangular with size Ϊ3cT 4 .

The present solution provides steps of down-sampling and up-sampling which are described in the following:

Down-sampling:

In this step, an adaptive non-linear down-sampling is proposed. The pixels of the occupancy map are grouped into non-overlapping blocks having a size TxT, where T=2 pixels or any other number of pixels. Therefore, several different modes may be present as shown in Table of Figure 4. The left column 401 presents the mode assigned number. The middle column 402 shows which pixels are occupied and which pixels are unoccupied (i.e. empty) in the OOM. The right column 403 shows how the block of 2x2 is presented in the DOM by showing how the representative down- sampled pixel will be in the DOM. In Table of Figure 4, the solid black squares 413 refer to occupied pixels while the white squares 412 refer to unoccupied (i.e. empty) pixels.

For modes 1 to 5, where one (or the minority) of the pixels are occupied, the representative down-sampled pixel (RDP) in the DOM will be an unoccupied pixel. For modes 6 to 10, where all or the majority of the pixels are occupied, the RDP will be an occupied pixel. For modes 1 1 and 12, RDP will be unoccupied unless both of the RDPs along with the diagonal direction of occupied pixels in the block of TxT (in the OOM) are occupied in the DOM. If the adjacent RDPs are not known yet, this block of TxT in the OOM will be marked to be assessed later. This is aligned with the general approach, i.e. shrinking the OM in the down-sampling process to reduce the number of occupied RDPs, targeting better compression of the OM. For modes 13 to 16, the RDP may be occupied or unoccupied according to some criteria. For example, determining pixel occupancy for a pixel of a DOM may comprise determining a number of occupied pixels and/or locations of occupied pixels in a first block of OOM and one or more pixels of at least one adjacent block of the first block. Embodiments of the decision-making process for mode 13 are described below. The rest of the modes (14- 16) will be processed similarly.

Mode 13:

1 . If adjacent pixels in OOM are not available (around edges in the OM) they may be determined to be unoccupied in the following process. 2. RDP may be determined to be unoccupied if both left and right RDPs are unoccupied. (The RDPs along the direction of the two occupied pixels) (RDPs representing the blocks containing pixels e and f in Figure 5.)

3. RDP may be determined to be unoccupied if at least two adjacent pixels (in the OOM) to either of the occupied pixels in Mode 13 are unoccupied (two of a, b and e to be unoccupied or two of c, d, and f in figure 5 to be unoccupied). In another embodiment, and alternatively, this can be two of the adjacent pixels (two of a, b, c, d, e, f in Figure 5) in the OOM to be unoccupied.

4. RDP may be determined to be occupied if all top, left and right RDPs are occupied.

5. RDP may be determined to be occupied if e, b, c, and f are occupied and at least one of a or d is also occupied.

6. RDP may be determined to be unoccupied otherwise.

In another embodiment, the order of steps (1 to 6, as mentioned above) introduced for down-sampling process of mode 13 may change.

In another embodiment, different similar criteria for steps 2, 3, 4, and/or 5 may be defined. The criteria may target less shrinking or more shrinking of the OM.

In another embodiment, when the down-sampling is applied on the OM, the DOM may be up-sampled in encoder side according to the algorithm which is to be used in the decoder side to create the UOM. The texture and geometry images may be then aligned with the UOM to make sure than no extra information is to be encoded and transmitted to the decoder side. Considering that the current down-sampling algorithm favours shrinking the OM, the number of pixels in the texture and geometry images should also decrease and hence, less pixels occupied pixels to be encoded.

In another embodiment, the UOM created in encoder side will be compared with the OOM and if any border misalignment is noticed with misalignment larger than a specific threshold th (e.g. th =4) then the down-sampling process for all blocks involving that misalignment is to be performed again making the UOM and OOM misalignment to be equal or smaller than th. This may be achieved by using another down-sampling algorithm targeting less shrinking.

The down-sampled OM may be encoded and the respective bitstream may be transmitted to the decoder side. In this case, the DOM will be decoded and up-sampled to an UOM. Up-sampling

In the step of up-sampling, an adaptive non-linear up-sampling method may be applied.

Considering that the down-sampling process favours shrinking the occupancy map, the up-sampling process tries to somehow compensate this. The process may comprise one or more of the following operations: If any two diagonally adjacent and occupied RDPs are to be up-sampled (current RDPs), then if the other two RDPs, which are adjacent to both of current RDPs are unoccupied, then one pixel on each side which is adjacent to both up-sampled RDPs will be considered occupied in the UOM. This is shown in Figure 6, where a block of 3x3 in DOM is up-sampled to a 6x6 block in UOM. In the UOM the blocks of dark black squares 61 1 are direct up-sampling of RDPs 601 from DOM while the dashed pixels 613 in the UOM are added targeting better and smoother visual representation. Similarly, an unoccupied RDPs 602 in DOM are upsampled to a block of white squares 612.

For example, it may be determined that there are two occupied and diagonally adjacent blocks in the down-sampled occupancy map. It may be also determined whether the two other blocks adjacent to the occupied and diagonally adjacent blocks are un- occupied, and in response, current representative values of each occupied and diagonally adjacent blocks in the down-sampled occupancy map may be up-sampled to respective up-sampled blocks of pixels in the up-sampled occupancy map. It may be further determined that there are two pixels adjacent to both up-sampled blocks of pixels in the up-sampled occupancy map. The two determined pixels may then be assigned to be occupied in the up-sampled occupancy map.

In a further embodiment, one or more pixels adjacent to either the two determined pixels or up-sampled blocks of pixels may be assigned to be occupied in the up- sampled occupancy map.

According to another embodiment, it may be determined that there are two occupied and diagonally aligned blocks in the down-sampled occupancy map. It may be further determined whether the diagonal separation between said two occupied and diagonally aligned blocks is one un-occupied block in the down-sampled occupancy map, and in response, the current representative values of each occupied and diagonally aligned blocks in the down-sampled occupancy map may be up-sampled to respective up-sampled blocks of pixels in the up-sampled occupancy map. It may be further determined that there are the least a predetermined number of pixels which diagonally connect the two up-sampled blocks of pixels in the up-sampled occupancy map. These pixels may be assigned to be occupied in the up-sampled occupancy map. Diagonally connected pixels may be defined as diagonally aligned pixels that have no un-occupied pixels, such as pixel 91 1 , between the two diagonally connected pixels.

Considering that in this process, new occupied pixels are introduced, then, respective texture and geometry pixels should be introduced too. The respective geometry pixel values may be decided to have an average value of the geometry values from adjacent pixels. Considering Figure 7, the adjacent pixels to any added dashed pixel are five (m, H Q, g, g) and any combination of those may be used for the estimation of the value of the dashed pixel, e.g. only the geometry value of pixel o may be considered or n, g, g or all m, n, g, g, g pixel geometry values may be considered. Similar process for the texture pixel values may be taken into account. Alternatively, for the texture pixel values, a weighted average based on the geometry values may be taken into account. In this case, the closer pixels in 3D space, may have a heavier weight in the weighted average calculation compared to the farther pixels in the 3D space. Such weights will be used in value calculation of the dashed pixel in the UOM.

In another embodiment, the dashed pixels may be more than one (e.g. three) adjacent pixels to the pixels in the occupied pixel in the UOM. This is shown in Figure 8 where adjacent pixels 812, 814 to the dashed pixel 813 are also considered as occupied in the up-sampling process.

In another embodiment, if two diagonally aligned pixels 910 in the DOM are diagonally separated with one pixel 91 1 , then in the UOM, the diagonally aligned single pixels 921 between the up-sampled pixels may be determined to be occupied too. This is shown in Figure 9.

If the occupancy map resolution is decreased, the respective resolution of the geometry and texture images will be decreased too. If the occupancy map resolution is increased, the respective geometry and texture image resolution will be interpolated based on the adjacent pixels. The proposed non-linear up-sampling does not have to be used in conjunction with the proposed down-sampling approach. Any other linear, or non-linear down-sampling approach may also benefit from this solution.

In another embodiment, an encoder performs the OM down-sampling and the selected OM up-sampling, thus creating the UOM as it would be recreated at the decoder. Any differences between OOM and UOM can now be signaled to the decoder for a perfect OM reconstruction. Such information can be signaled in run-length coding. With the expected large number of zeros for this approach, the bitrate overhead should be rather small. Other ways of signaling this information can be envisioned, e.g. signaling it in an unused chroma channel of the geometry picture. Rate-distortion-optimization (RDO) at the encoder can be utilized to determine if this additional information shall be signaled or not.

According to yet another embodiment, one of the unused chroma channels of the geometry picture is used to signal up-sampling guidance information to the decoder. This information is based on the analysis of the OOM and reconstructed UOM during the encoding. As a chroma channel of the geometry picture has the same resolution as the DOM, each chroma channel pixel can give simple up-sampling instructions. I.e. if modes 1 1 -16 shall be used for up-sampling from an unoccupied or occupied DOM pixel. This approach may allow for more accurate up-sampling at the cost of slight bitrate increase. The requirements on buffer size may remain the same. Again, RDO may be used to determine the optimal signaling structure.

Figure 10 is a flowchart illustrating a method according to an embodiment. A method comprises determining 101 1 an occupancy map having a first resolution; grouping 1012 pixels of the occupancy map into non-overlapping blocks; and generating 1013 a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non- overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block.

An apparatus according to an embodiment comprises means for determining an occupancy map having a first resolution; means for grouping pixels of the occupancy map into non-overlapping blocks; and means for generating a down-sampled occupancy map for the occupancy map by generating a representative value in the down-sampled occupancy map for at least one of the non-overlapping blocks, where generating the representative value for a current block comprises examining pixel occupancy in the current block and at least one block adjacent to the current block. The means comprises a processor, a memory, and a computer program code residing in the memory. The processor may further comprise a processing circuit.

Figure 1 1 is a flowchart illustrating a method according to another embodiment. A method comprises receiving 1 1 1 1 a down-sampled occupancy map, wherein at least one representative value of the down-sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and up-sampling 1 1 12 representative values in the down-sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value.

An apparatus according to an embodiment comprises means for receiving a down- sampled occupancy map, wherein at least one representative value of the down- sampled occupancy map is indicative of pixel occupancy of a block of pixels in an occupancy map; and means for up-sampling representative values in the down- sampled occupancy map, wherein up-sampling of a current representative value in the down-sampled occupancy map comprises examining at least one representative value adjacent to the current representative value. The means comprises a processor, a memory, and a computer program code residing in the memory. The processor may further comprise a processing circuit.

An apparatus according to an embodiment is disclosed with reference to Figures 12 and 13. Figure 12 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. Figure 13 shows a layout of an apparatus according to an embodiment. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device. The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.

The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 may be a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.

The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus may further comprise any suitable short-range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection. The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia Interface (HDMI), Mobile High-definition Link (MHL), or Digital Visual Interface (DVI). The various embodiments of the present solution may provide advantages. For example, the whole down-/up-sampling process reduces the amount of bitrate that needs to be transmitted due to reduced number of pixels to be encoded in the OM. In addition, the present embodiments may decrease the compression complexity, memory usage and buffer allocation. The up-sampling process of OM is a normative process in the de(coding) process as any interprediction between geometry and texture images require to clarify what are the valid pixels in the geometry map and 2D grid. Such information is achieved based on the up-sampled OM and hence, the up- sampling process on OM is considered to be normative.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.