Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND APPARATUS FOR DEPTH ENCODING AND DECODING
Document Type and Number:
WIPO Patent Application WO/2020/072842
Kind Code:
A1
Abstract:
The present application relates to a method of encoding a floating- point depth value of a pixel of a picture according to a bit-depth. The pixel is obtained by projecting a point of a 3D scene onto said picture. If the point is a non-contour pixel, then, the quantization function is defined on a first range depending on a first value lower than the bit- depth. Otherwise, the quantization function is defined on a second range depending on a second value greater than the bit-depth; and the value is encoded in association with the first value and an offset value associated with the picture.

Inventors:
DORE RENAUD (FR)
FLEUREAU JULIEN (FR)
THUDOR FRANCK (FR)
Application Number:
PCT/US2019/054597
Publication Date:
April 09, 2020
Filing Date:
October 04, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INTERDIGITAL VC HOLDINGS INC (US)
International Classes:
H04N19/124; H04N19/14; H04N19/182; H04N19/597; H04N19/98
Domestic Patent References:
WO2019199714A12019-10-17
WO2018099571A12018-06-07
WO1999037096A11999-07-22
Foreign References:
EP2424242A12012-02-29
EP2410747A12012-01-25
EP2076048A22009-07-01
Other References:
KANG JINMI ET AL: "High-performance depth map coding for 3D-AVC", SIGNAL, IMAGE AND VIDEO PROCESSING, SPRINGER LONDON, LONDON, vol. 10, no. 6, 7 December 2015 (2015-12-07), pages 1017 - 1024, XP036021242, ISSN: 1863-1703, [retrieved on 20151207], DOI: 10.1007/S11760-015-0853-6
Attorney, Agent or Firm:
DORINI, Brian J. et al. (US)
Download PDF:
Claims:
CLAIMS

1. A method of encoding a floating-point depth value of a pixel of a picture according to a bit-depth, said pixel being obtained by projecting a point of a 3D scene onto said picture, the method comprising:

- if said pixel is a non-contour pixel, then, encoding said floating-point depth value by using a quantization function defined on a first range depending on a first value lower than said bit-depth;

- otherwise, encoding said floating-point depth value by using:

• said quantization function defined on a second range depending on a second value greater than said bit-depth;

• said first value; and

• an offset value associated with the picture, said offset value being determined according to said first value and an interval of values comprising every floating-point depth value of pixels of the picture quantized by said quantization function defined on said second range.

2. A device for encoding a floating-point depth value of a pixel of a picture according to a bit-depth, said pixel being obtained by projecting a point of a 3D scene onto said picture, the device comprising a processor configured for:

- testing if said pixel is a non-contour pixel, and if so, encoding said floating- point depth value by using a quantization function defined on a first range depending on a first value lower than said bit-depth;

- otherwise, encoding said floating-point depth value by using:

• said quantization function defined on a second range depending on a second value greater than said bit-depth;

· said first value; and

• an offset value associated with the picture, said offset value being determined according to said first value and an interval of values comprising every floating-point depth value of pixels of the picture quantized by said quantization function defined on said second range.

3. A method of decoding an integer depth value of a pixel of a picture, the integer depth value being encoded according to a bit-depth, the picture being associated with an offset value, the method comprising:

- if said integer depth value belongs to a first range depending on a first value lower than said bit-depth, decoding said integer depth value of the pixel by using a de-quantization function defined on said first range;

- otherwise, decoding said integer depth value of the pixel by using:

• said de-quantization function defined on a second range depending on a second value greater than said bit-depth;

· said first value; and

• said offset value.

4. A device for decoding an integer depth value of a pixel of a picture, the integer depth value being encoded according to a bit-depth, the picture being associated with an offset value, the device comprising a processor configured for:

- testing if said integer depth value belongs to a first range depending on a first value lower than said bit-depth, and if so, decoding said integer depth value of the pixel by using a de-quantization function defined on said first range;

- otherwise, decoding said integer depth value of the pixel by using:

• said de-quantization function defined on a second range depending on a second value greater than said bit-depth;

• said first value; and

• said offset value.

Description:
METHODS AND APPARATUS FOR DEPTH ENCODING AND

DECODING

1. Technical field

The present disclosure relates to the domain of tree-dimensional (3D) scene and volumetric video content. The present disclosure is also understood in the context of the encoding and/or the formatting and/or the decoding of data representative of the depth of a 3D scene, for example for the rendering of volumetric content on end-user devices such as mobile devices or Head-Mounted Displays.

2. Background

This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, these statements are to be read in this light, and not as admissions of prior art.

Recently there has been a growth of available large field-of-view content (up to 360°). Such content is potentially not fully visible by a user watching the content on immersive display devices such as Head Mounted Displays, smart glasses, PC screens, tablets, smartphones and the like. That means that at a given moment, a user may only be viewing a part of the content. However, a user can typically navigate within the content by various means such as head movement, mouse movement, touch screen, voice and the like. It is typically desirable to encode and decode this content.

Immersive video, also called 360° flat video, allows the user to watch all around himself through rotations of his head around a still point of view. Rotations only allow a 3 Degrees of Freedom (3DoF) experience. Even if 3DoF video is sufficient for a first omnidirectional video experience, for example using a Head- Mounted Display device (HMD), 3DoF video may quickly become frustrating for the viewer who would expect more freedom, for example by experiencing parallax. In addition, 3DoF may also induce dizziness because of a user never only rotates his head but also translates his head in three directions, translations which are not reproduced in 3DoF video experiences.

A large field-of-view content may be, among others, a three-dimension computer graphic imagery scene (3D CGI scene), a point cloud or an immersive video. Many terms might be used to design such immersive videos: Virtual Reality (VR), 360, panoramic, 4p steradians, immersive, omnidirectional or large field of view for example.

Volumetric video (also known as 6 Degrees of Freedom (6DoF) video) is an alternative to 3DoF video. When watching a 6DoF video, in addition to rotations, the user can also translate his head, and even his body, within the watched content and experience parallax and even volumes. Such videos considerably increase the feeling of immersion and the perception of the scene depth and prevent from dizziness by providing consistent visual feedback during head translations. The content is created by the means of dedicated sensors allowing the simultaneous recording of color and depth of the scene of interest. The use of rig of color cameras combined with photogrammetry techniques is a common way to perform such a recording.

While 3DoF videos comprise a sequence of images resulting from the un mapping of texture images (e.g. spherical images encoded according to latitude/longitude projection mapping or equirectangular projection mapping), 6DoF video frames embed information from several points of views. They can be viewed as a temporal series of point clouds resulting from a three-dimension capture. Two kinds of volumetric videos may be considered depending on the viewing conditions. A first one (i.e. complete 6DoF) allows a complete free navigation within the video content whereas a second one (aka. 3DoF+) restricts the user viewing space to a limited volume called viewing bounding box, allowing limited translation of the head and parallax experience. This second context is a valuable trade-off between free navigation and passive viewing conditions of a seated audience member.

3DoF videos may be encoded in a stream as a sequence of rectangular color images generated according to a chosen projection mapping (e.g. cubical projection mapping, pyramidal projection mapping or equirectangular projection mapping). This encoding has the advantage to make use of standard image and video processing standards. 3DoF+ and 6DoF videos require additional data to encode the depth of colored points of point clouds. The kind of rendering (i.e. 3DoF or volumetric rendering) for a volumetric scene is not known a priori when encoding the scene in a stream. Up to date, streams are encoded for one kind of rendering or the other. There is a lack of a stream, and associated methods and devices, that can carry data representative of a volumetric scene that can be encoded at once and decoded either as a 3DoF video or as a volumetric video (3DoF+ or 6D0F).

Beyond the specific case of volumetric video, the encoding, formatting and decoding of depth information of a 3D scene or a volumetric content may be an issue, especially when the range of depth values to be encoded is large and the bit depth available for the encoding does not provide with a sufficient amount of encoding values. For a 3DoF+ content, points and surfaces of the 3D scene which may be seen at grazing incidence from a point of the viewing bounding box require high quality encoding as visual artifacts often occur in these parts of the 3D scene when displayed after decoding. 3. Summary

The following presents a simplified summary of the present principles to provide a basic understanding of some aspects of the present principles. This summary is not an extensive overview of the present principles. It is not intended to identify key or critical elements of the present principles. The following summary merely presents some aspects of the present principles in a simplified form as a prelude to the more detailed description provided below.

The present principles relate to a method of encoding a floating-point depth value of a pixel of a picture according to a bit-depth. The pixel is obtained by proj ecting a point of a 3D scene onto said picture. The method comprises:

- if the pixel is a non-contour pixel, then, encoding the floating-point depth value by using a quantization function defined on a first range depending on a first value lower than said bit-depth;

- otherwise, encoding the floating-point depth value by using:

• the same quantization function but defined on a second range depending on a second value greater than said bit-depth; • the first value; and

• an offset value associated with the picture, said offset value being determined according to said first value and an interval of values comprising every floating-point depth value of pixels of the picture quantized by said quantization function defined on said second range.

The present principles also relate to a device implementing this method.

The present principles also relate to a method of decoding an integer depth value of a pixel of a picture. The integer depth value is encoded according to a bit- depth known by the decoder. The picture is associated with an offset value. The method comprises:

- if said integer depth value belongs to a first range depending on a first value lower than said bit-depth, decoding said integer depth value of the pixel by using a de-quantization function defined on this first range;

- otherwise, decoding said integer depth value of the pixel by using the three data:

• the same de-quantization function, but defined on a second range depending on a second value greater than said bit-depth;

• the first value; and

· the offset value.

The present principles also relate to a device implementing this method.

4. Brief Description of Drawings

The present disclosure will be better understood, and other specific features and advantages will emerge upon reading the following description, the description making reference to the annexed drawings wherein:

- Figure 1 shows an image representing a three-dimensional (3D) scene 10 comprising a surface representation of several, according to a non-limiting embodiment of the present principles; - Figure 2 shows a three-dimension (3D) model of an object and points of a point cloud corresponding to the 3D model, according to a non-limiting embodiment of the present principles;

- Figure 3 shows an example of a picture comprising the texture information of the points of the 3D scene, according to a non-limiting embodiment of the present principles;

- Figure 4 shows an example of a picture 40 comprising the depth information of the points of the 3D scene 10, according to a non-limiting embodiment of the present principles;

- Figure 5 shows an example of a picture 50 comprising the depth information of the points of the 3D scene 10, for example in the case wherein the 3D scene is acquired from a single point of view, according to a non-limiting embodiment of the present principles;

- Figure 6 shows a non-limitative example of the encoding, transmission and decoding of data representative of the depth of a 3D scene in a format that may be, at the same time, compatible for 3DoF and 3DoF+ rendering, according to a non-limiting embodiment of the present principles;

- Figure 7A shows a first example of a quantization function used for quantizing the data representative of depth stored in the first pixels of pictures of figures 4 and 5, according to a non-limiting embodiment of the present principles;

- Figure 7B shows a second example of a quantization function used for quantizing the data representative of depth stored in the first pixels of pictures of figures 4 and 5, according to a non-limiting embodiment of the present principles;

- Figure 7C shows a third example of a quantization function 73 used for quantizing the data representative of depth stored in the first pixels of pictures of figures 4 and 5, according to a non-limiting embodiment of the present principles;

- Figure 8 shows the concept of visual acuity of human vision, according to a non-restrictive embodiment of the present principles; - Figure 9 shows an example of a process to determine quantization parameters from the picture of figure 4 or 5, according to a non-restrictive embodiment of the present principles;

- Figure 10 shows an example of an image containing information representative of the quantization parameters associated with the picture of figure 4 or 5, according to a non-restrictive embodiment of the present principles;

- Figure 11 shows an example of a table mapping identifiers with the quantization parameters associated with the picture of figure 4 or 5, according to a non-restrictive embodiment of the present principles;

- Figure 12 shows an example of a process to determine a plurality of quantization parameters from a single block of pixels of the picture of figure 4 or 5, according to a non-restrictive embodiment of the present principles;

- Figure 13 shows examples of methods to encode quantized depth values of the 3D scene of figure 1, according to two non-restrictive embodiments of the present principles;

- Figure 14 shows an example of the syntax of a bitstream carrying the information and data representative of the depth of the 3D scene of figure 1, according to a non-restrictive embodiment of the present principles;

- Figure 15 shows an example architecture of a device which may be configured to implement a method or process described in relation with figures 9, 12, 13, 16 and/or 21, according to a non-restrictive embodiment of the present principles;

- Figure 16 illustrates a contour band as a part of the scene surface which may be seen at grazing incidence from at least one point of view of a viewing bounding box, according to a non-restrictive embodiment of the present principles;

- Figure 17 illustrates contour band on a 3D view, where some peripheral parts of character surface are contour band, according to a non-restrictive embodiment of the present principles; - Figure 18 shows a hybrid block which is partially covering the character’s face, and this face part is also composed of contour and non-contour zones, according to a non-restrictive embodiment of the present principles;

- Figure 19 illustrates a combination of depths ranges of figure 18, according to a non-restrictive embodiment of the present principles;

- Figure 20 illustrates a method of encoding a floating-point depth value of a pixel of a picture according to a bit-depth, according to a non-restrictive embodiment of the present principles;

- Figure 21 illustrates a method of decoding an integer depth value of a pixel of a picture, according to a non-restrictive embodiment of the present principles.

5. Detailed description

The present principles will be described more fully hereinafter with reference to the accompanying figures, in which examples of the present principles are shown. The present principles may, however, be embodied in many alternate forms and should not be construed as limited to the examples set forth herein. Accordingly, while the present principles are susceptible to various modifications and alternative forms, specific examples thereof are shown by way of examples in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the present principles to the particular forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present principles as defined by the claims.

The terminology used herein is for the purpose of describing particular examples only and is not intended to be limiting of the present principles. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises", "comprising," "includes" and/or "including" when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Moreover, when an element is referred to as being "responsive" or "connected" to another element, it can be directly responsive or connected to the other element, or intervening elements may be present. In contrast, when an element is referred to as being "directly responsive" or "directly connected" to other element, there are no intervening elements present. As used herein the term "and/or" includes any and all combinations of one or more of the associated listed items and may be abbreviated as"/".

It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element without departing from the teachings of the present principles.

Although some of the diagrams include arrows on communication paths to show a primary direction of communication, it is to be understood that communication may occur in the opposite direction to the depicted arrows.

Some examples are described with regard to block diagrams and operational flowcharts in which each block represents a circuit element, module, or portion of code which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in other implementations, the function(s) noted in the blocks may occur out of the order noted. For example, two blocks shown in succession may, in fact, be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending on the functionality involved.

Reference herein to“in accordance with an example” or“in an example” means that a particular feature, structure, or characteristic described in connection with the example can be included in at least one implementation of the present principles. The appearances of the phrase in accordance with an example” or“in an example” in various places in the specification are not necessarily all referring to the same example, nor are separate or alternative examples necessarily mutually exclusive of other examples.

Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims. While not explicitly described, the present examples and variants may be employed in any combination or sub combination.

Points of the 3D scene to be rendered are projected onto surfaces according to a projection mapping (e.g. cubical projection mapping, pyramidal projection mapping or equirectangular projection mapping). Projection of points of the 3D scene onto surfaces are used to generated pictures that may be images (i.e. a full view of the 3D scene as seen from a point of view) or patch images (i.e. pieces, fragments of the 3D scene as seen from a point of view). Preparing a scene for a 3DoF+ rendering includes projecting every point visible from any point of view comprised in the viewing bounding box onto at least one picture. So, every point visible from any point of view comprised in the viewing bounding box may be decoded and de-projected to re-build the 3D scene for rendering and displaying.

A projected point is a pixel in a picture, that is a data comprising coordinates (x,y) in said picture and a depth value and, optionally a color value (otherwise the 3D scene will be displayed in mono-color, typically gray). Coordinates (x,y) are a couple of integers locating the pixel within the 2D space of the rectangular picture as well- known in the state of the art. A depth value is a floating-point value representative of the distance in the 3D space of the 3D scene between the view point of the projection surface and the projected point. For example, a depth value is a distance comprised between 0 and 50 meters or between 0 and 100 meters or even bigger depth ranges. The color value that may be expressed in various color spaces, for example RGB (Red, Green and Blue) or YUV (Y being the luma component and UV two chrominance components).

To be encoded in a picture (in preparation for compression, transport, decompression and decoding according to standard image and video processing standards), floating-point depth values has to be quantized into an integer value. Quantizing process is performed according to a quantization function (e.g. l/z) and a range of quantized values (e.g. [0, 4095] or [0, 1023] or [512, 1023]). In standard images, the quantization range is a power of 2, so it can be encoded in a given number of bits in the bitstream. For this reason, the size of the quantization range is also called “bit depth”. For example, the bit depth is set to 10 bits in HEVC coding technology, so quantization range is [0, 1023] The quality of depth encoding depends on the size of the quantization range and the width of the depth range.

At the decoding, a pixel of a picture is used to de-project a point in the 3D space of the 3D scene to render. The view point and view direction associated with the picture and coordinates (x,y) of a pixel are used to determine a direction in the 3D space from the point of view. The depth value is used to place a point onto this direction. The integer depth value stored in the pixel is de-quantized to retrieve a floating-point value representative of the distance in the 3D space of the 3D scene between the view point of the projection surface and the projected point. The color value is used to provide the generated point with a color.

Figure 1 shows an image representing a three-dimensional (3D) scene 10 comprising a surface representation of several. The scene may have been acquired using any suitable technology. For example, it may have been created using computer graphics interface (CGI) tools. It may have been acquired with color and depth image acquisition devices. In such a case, it is possible that one or more parts of the objects that are not visible from the acquisition devices (e.g. cameras) may not be represented in the scene as described in relation to figure 1. The example scene illustrated in figure 1 comprises characters and objects in a room. The 3D scene 10 is represented according to a determined point of view in figure 1. This point of view may for example be part of a space of view from which a user may observe the 3D scene. According to a variant, the content of the 3D scene (depth and/or texture information) that is available corresponds only to the elements of the scene (e.g. points) that are visible from the determined point of view of figure 1. Figure 2 shows a three-dimension (3D) model of an object 20 and points of a point cloud 21 corresponding to the 3D model 20. The 3D model 20 and the point cloud 21 may for example correspond to a possible 3D representation of an object of the 3D scene 10, for example the head of a character. The model 20 may be a 3D mesh representation and points of point cloud 21 may be the vertices of the mesh. Points of the point cloud 21 may also be points spread on the surface of faces of the mesh. The model 20 may also be represented as a splatted version of the point cloud 21, the surface of the model 20 being created by splatting the points of the point cloud 21. The model 20 may be represented by a lot of different representations such as voxels or splines. Figure 2 illustrates the fact that a point cloud may be defined with a surface representation of a 3D object and that a surface representation of a 3D object may be generated from a point of cloud. As used herein, projecting points of a 3D object (by extension points of a 3D scene) onto an image is equivalent to projecting any image representation of this 3D object to create an object.

A point cloud may be seen as a vector-based structure, wherein each point has its coordinates (e.g. three-dimensional coordinates XYZ, or a depth/distance from a given viewpoint) and one or more attributes, also called component. An example of component is the color component that may be expressed in various color spaces, for example RGB (Red, Green and Blue) or YUV (Y being the luma component and UV two chrominance components). The point cloud is a representation of the object as seen from a given viewpoint, or a range of viewpoints. The point cloud may be obtained by many ways, e.g.:

- from a capture of a real object shot by a rig of cameras, optionally complemented by depth active sensing device;

- from a capture of a virtual/synthetic object shot by a rig of virtual cameras in a modelling tool;

- from a mix of both real and virtual objects.

The volumetric parts of the 3D scene may for example be represented with one or several point clouds such as the point cloud 21.

Figure 3 shows an example of a picture 30 comprising the texture information (e.g. RGB data or YUV data) of the points of the 3D scene 10, according to a non- limiting embodiment of the present principles.

The picture 30 comprises a first part 301 comprising the texture information of the elements (points) of the 3D scene that are visible from a first viewpoint and one or more second parts 302. The texture information of the first part 301 may for example be obtained according to an equirectangular projection mapping, an equirectangular projection mapping being an example of spherical projection mapping. In the example of figure 3, the second parts are arranged at the left and right borders of the first part 301 but the second parts may be arranged differently. The second parts 302 comprise texture information of parts of the 3D scene that are complementary to the part visible from the first viewpoint. The second parts may be obtained by removing from the 3D scene the points that are visible from the first viewpoint (the texture of which being stored in the first part) and by projecting the remaining points according to the same first viewpoint. The latter process may be reiterated iteratively to obtain at each time the hidden parts of the 3D scene. According to a variant, the second parts may be obtained by removing from the 3D scene the points that are visible from the first viewpoint (the texture of which being stored in the first part) and by projecting the remaining points according to a viewpoint different from the first viewpoint, for example from one or more second viewpoints of a space of view centered onto the first viewpoint.

The first part 301 may be seen as a first large texture patch (corresponding to a first part of the 3D scene) and the second parts 302 comprises smaller textures patches (corresponding to second parts of the 3D scene that are complementary to the first part).

Figure 4 shows an example of a picture 40 comprising the depth information of the points of the 3D scene 10, according to a non-limiting embodiment of the present principles. The picture 40 may be seen as the depth picture corresponding to the texture picture 30.

The picture 40 comprises a first part 401 comprising the depth information of the elements (points) of the 3D scene that are visible from the first viewpoint and one or more second parts 402. The picture 40 may be obtained in a same way as the picture 30 but contains the depth information associated with the points of the 3D scene instead of the texture information as in the picture 30.

The first part 401 may be seen as a first large depth patch (corresponding to a first part of the 3D scene) and the second parts 402 comprises smaller textures patches (corresponding to second parts of the 3D scene that are complementary to the first part).

For 3DoF rendering of the 3D scene, only one point of view, for example the first viewpoint, is considered. The user may rotate his head in three degrees of freedom around the first point of view to watch various parts of the 3D scene, but the user cannot move the first viewpoint. Points of the scene to be encoded are points which are visible from this first viewpoint, and only the texture information is needed to be encoded / decoded for the 3DoF rendering. There is no need to encode points of the scene that are not visible from this first viewpoint as the user cannot access to them by moving the first viewpoint.

With regard to 6D0F rendering, the user may move the viewpoint everywhere in the scene. In this case, it is valuable to encode every point (depth and texture) of the scene in the bitstream as every point is potentially accessible by a user who can move his/her point of view. At the encoding stage, there is no means to know, a priori, from which point of view the user will observe the 3D scene 10.

With regard to 3DoF+ rendering, the user may move the point of view within a limited space around a point of view, for example around the first viewpoint. For example, the user may move his point of view within a determined space of view centered on the first viewpoint. This enables to experience parallax. Data representative of the part of the scene visible from any point of the space of view is to be encoded into the stream, including the data representative of the 3D scene visible according to the first viewpoint (i.e. the first parts 301 and 401). The size and shape of the space of view may for example be decided and determined at the encoding step and encoded in the bitstream. The decoder may obtain this information from the bitstream and the Tenderer limits the space of view to the space determined by the obtained information. According to another example, the Tenderer determines the space of view according to hardware constraints, for example in relation to capabilities of the sensor(s) that detects the movements of the user. In such a case, if, at the encoding phase, a point visible from a point within the space of view of the Tenderer has not been encoded in the bitstream, this point will not be rendered. According to a further example, data (e.g. texture and/or geometry) representative of every point of the 3D scene is encoded in the stream without considering the rendering space of view. To optimize the size of the stream, only a subset of the points of the scene may be encoded, for instance the subset of points that may be seen according to a rendering space of view. Figure 5 shows an example of a picture 50 comprising the depth information of the points of the 3D scene 10, for example in the case wherein the 3D scene is acquired from a single point of view, according to a non-limiting embodiment of the present principles. The picture 50 corresponds to an array of first pixels, each first pixel comprising data representative of depth. The picture 50 may also be called a depth map. The data corresponds for example to a floating-point value indicating, for each first pixel, the radial distance to z to the viewpoint of the picture 50 (or of the center of projection when the picture 50 is obtained by projection). The depth data may be obtained with one or more depth sensors or may be known a priori, for example for CGI parts of the scene. The depth range [zmin, z ma x] contained in picture 50, i.e. the range of depth values comprised between the minimal depth value zmin and the maximal depth value zmax of the scene, may be large, for example from 0 to 100 meters. The depth data is represented with a shade of grey in figure 5, the darker the pixel (or point), the closer to the viewpoint.

The picture 50 may be part of a group of temporally successive pictures of the scene, called GOP (Group of Pictures). A GOP may for example comprise pictures of different types, for example a I picture (i.e. intra coded picture), a P picture (i.e. predictive coded picture) and Έ’ pictures to 708 (i.e. bipredictive code picture). There is a coding relationship between pictures. For example, a P picture may be coded by referring to a I picture, a B picture may be coded by using reference to I and P pictures. The GOP may be part of an intra period, i.e. a sequence of pictures comprised between two I pictures, the first I picture belonging to said intra period and indicating the beginning of the intra period while the second (temporally speaking) I picture not belonging to said intra period but to the following intra period.

A I picture is a picture that is coded independently of all other pictures. Each intra period begins (in decoding order) with this type of picture.

A P picture comprises motion-compensated difference information relative to previously decoded pictures. In compression standards such as MPEG-l, H.262/MPEG-2, each P picture can only reference one picture, and that picture must precede the P picture in display order as well as in decoding order and must be an I or P picture. These constraints do not apply in more recent standards such as H.264/MPEG-4 AYC and HEVC. A B picture comprises motion-compensated difference information relative to previously decoded pictures. In standards such as MPEG-l and H.262/MPEG-2, each B picture can only reference two pictures, the one which precedes the B picture in display order and the one which follows, and all referenced pictures must be I or P pictures. These constraints do not apply in more recent standards such as H.264/MPEG-4 AVC and HEVC.

Pictures 30 and 40 may also be each part of a GOP, like picture 50.

Figure 6 shows a non-limitative example of the encoding, transmission and decoding of data representative of the depth of a 3D scene in a format that may be, at the same time, compatible for 3DoF and 3DoF+ rendering.

A picture of a 3D scene 60 (or a sequence of pictures of the 3D scene) is encoded in a stream 62 by an encoder 61. The stream 62 comprises a first element of syntax carrying data representative of a 3D scene for a 3DoF rendering (data of the first part of the picture 30 for example) and at least a second element of syntax carrying data representative of the 3D scene for 3DoF+ rendering (e.g. data of the second parts of the picture 30 and picture 40).

The encoder 61 is for example compliant with an encoder such as:

- JPEG, specification ISO/CEI 10918-1 UIT-T Recommendation T.81, https : //www. itu. int/ rec/T -REC-T.81 /en;

- AVC, also named MPEG-4 AVC or h264. Specified in both UIT-T H.264 and ISO/CEI MPEG-4 Part 10 (ISO/CEI 14496-10), http://www.itu.int/rec/T- REC-H.264/en,HEVC (its specification is found at the ITU website, T recommendation, H series, h265, http: //www. itu. int/rec/T-R EC-H .265- 2Qt6t2-!/en);

- 3D-HEVC (an extension of HEVC whose specification is found at the ITU website, T recommendation, H series, h265, http://www.itu.int/rec/T-REC- H.265-20l6l2-I/en annex G and I);

- VP9 developed by Google; or

- AV1 (AO Media Video 1) developed by Alliance for Open Media. A decoder 63 obtains the stream 62 from a source. For example, the source belongs to a set comprising:

- a local memory, e.g. a video memory or a RAM (or Random- Access Memory), a flash memory, a ROM (or Read Only Memory), a hard disk;

- a storage interface, e.g. an interface with a mass storage, a RAM, a flash memory, a ROM, an optical disc or a magnetic support;

- a communication interface, e.g. a wireline interface (for example a bus interface, a wide area network interface, a local area network interface) or a wireless interface (such as a IEEE 802.11 interface or a Bluetooth® interface); and

- a user interface such as a Graphical User Interface enabling a user to input data.

The decoder 63 decodes the first element of syntax of the stream 62 for 3DoF rendering 64. For 3DoF+ rendering 65, the decoder decodes both the first element of syntax and the second element of syntax of the stream 62.

The decoder 63 is compliant with the encoder 61, for example compliant with a decoder such as:

- JPEG;

- AVC;

- HEVC;

- 3D-HEVC (an extension of HEVC);

- VP9; or

- AV1. Figures 7A, 7B and 7C show examples of quantization functions that may be used for quantizing the depth data of pictures 40 and/or 50, according to anon-limiting embodiment of the present principle.

Figure 7A shows a first example of a quantization function 71 used for quantizing the data representative of depth stored in the first pixels of pictures 40 and/or 50. The abscissa axis represents the depth (expressed with floating-point values, between zmin = 0 and z max = 50 meters) and the ordinate axis represents the quantized depth values (between 0 and 65536). The quantization function 71 q'(z) is an affine transform of the depth z, for example: wherein D represents the encoding bit depth, D being for example equal to 16 in the example of figure 7A to obtain enough quantized depth values enabling a small quantization step, which is needed to represent the depth with a good quality, especially for objects close to the viewpoint. According to another example, D may be set to 32.

With such a quantization function 71, the quantization error is the same whatever the depth z. For example, for a lO-bits encoding (1024 values available for coding the quantized depth over the whole range of depth of the 3D scene, for example 50 meters), the error is 5 cm which may generate visible artefacts, especially for foreground objects. For a l2-bits encoding, the error is 0.8 mm. The quantization error may be: e q = (z max- Zmin)/2^.

Figure 7B shows a second example of a quantization function 72 used for quantizing the data representative of depth stored in the first pixels of pictures 40 and/or 50. The abscissa axis represents the depth (expressed with floating-point values, between z m in = 0.1 and zmax = 50 meters) and the ordinate axis represents the quantized depth values (between 0 and 65536). The quantization function 72 q'(z) is an affine transform of the inverse of the depth l/z, for example:

wherein D represents the encoding bit depth, D being for example equal to 16 in the example of figure 7B to obtain enough quantized depth values enabling a small quantization step, which is needed to represent the quantized depth with a good quality, especially for objects far away from the viewpoint. According to another example, D may be set to 32. With such a quantization function 72, the quantization error is minimal for low values of depth but very high for higher values of depth. For example, for a 10-bits encoding (1024 values available for coding the quantized depth over the whole range of depth of the 3D scene, for example 50 meters), the error is 24 meters at Zmax, which may generate visible artefacts for background objects. For a l2-bits encoding, the error is 38 cm. The quantization error may be: e q = z 2 . (zmax - Zmin)/(2 D . Zmin . Zmax).

The quantization functions 71 and 72 are not perceptually consistent, i.e. they do not account for the visual acuity of human vision. Figure 7C shows a third example of a quantization function 73 used for quantizing the data representative of depth stored in the first pixels of pictures 40 and/or 50, the quantization function 73 being perceptually consistent.

To explain the perceptual consistence of a quantization function, let’s introduce the quantization error function:

This latter quantity represents the amount of depth variation that occurs when a quantification delta equal to 1 happens, at a given depth z, and for a given quantization function q x . The error function basically helps understanding how a depth value could vary when it is obtained from an erroneous quantized input depth (typically due to compression artifacts).

Let’s also introduce the concept of visual acuity of human vision, illustrated in

3

Figure 8. This latter represents the minimal angle 82 g~— from which a human eye

81 can distinguish 2 distinct points 85, 86 in the 3D space. From this angle 82, one can compute the minimal perceptible distance d(z) 84 between 2 points 85, 86 that can be distinguished by a human eye at a certain distance z 83:

A parameter a representative of the perceptual consistence may be defined with: a = 2 tan ( j A perceptually consistent quantization scheme q x should insure that an error of quantization e q* (z) is constant regarding the minimal perceptible distance whatever the considered depth. In other word one should have: eq x (. z )

- = constant

az

which is the case neither for c/ ( nor q l . In contrast, the quantization function 73 defined with q a (z) is perceptually consistent:

This quantization function 73 implies the following reciprocal recursive sequence z i+1 = (1 + a)z ( (with z 0 = z min ) and the associated reciprocal function z a (q) = z min ( 1 + a) q . Moreover, it is straightforward that

which insures that the associated quantization error is perceptually consistent, whatever the depth z (the error is linear with regard to the depth).

Nevertheless, the quantization function 73, like the quantization functions 71 and 72, does not feet the encoding bit-depth (typically 8 or 10 bits or 12 bits) constraint imposed by legacy video encoders such as HEVC encoder. As it appears on figure 7C, more than 5000 quantized depth values are needed for a depth range comprised between zmin = 0 and zmax = 50 meters, that is to say more than 2 12 values.

Figure 9 shows an example of a process to determine quantization parameters from the picture 40 or 50, according to a non-restrictive embodiment of the present principles.

In a first operation 90, the picture 40 (or 50) comprising the data representative of depth (a depth value being associated with (or stored in) each first pixel of the picture 40, 50) is divided into a plurality of blocks of first pixels, for example in blocks of 8x8 or 16x16 first pixels, forming an array of blocks of first pixels. The picture 40 (or 50) may optionally be part of a group of pictures 901. A same quantization function, such as the quantization function 71, 72 or 73, is applied to the picture 40, i.e. to all blocks of the picture 40. When the picture 40 is part of a GOP 901, the same quantization function is applied to each picture of the GOP, each picture of the GOP being divided in a same way, in a same plurality of blocks of first pixels. The process of figure 5 will be described by referring to one block when the process is performed on a single picture 40 (or 50), for example the upper right block of the picture, the same process being applied to each block in a same manner. When the picture 40 is part of the GOP 901, the process of figure 5 will be described by referring to one spatio-temporal block 902, which represents the set union of all upper right blocks of each picture of the GOP 901. The block 902 encompasses all quantized depth values of all upper rights blocks of each picture of the GOP 901. Each block of first pixels of the picture 40 may be represented by the row index“i” and the column index“j” the block belongs to. The blocks of first pixels represented by the block 902 have the same row and columns indices in their respective pictures.

In a second operation 91, the quantization function 73 is applied to the depth data of the upper right block of the picture 40 (or to each upper right block of all pictures of the GOP 901). The depth data is represented with a range of depth values 912 with a minimal depth value Zmin and a maximal depth value zmax, which correspond to the limits of the range 912. The applying of the quantization function 73 to the range 912 enables to obtain a range of quantized depth values 911.

In a third operation 92, the range of quantized value 911 is analyzed to determine a unique per-block quantizer, this quantizer corresponding to a quantization parameter representative of the range 911 of quantized depth values. To reach that aim, a set of candidate quantization parameters is determined for the range 911. The set of candidate quantization parameters comprises a plurality of quantized depth values that may be used as a reference to represent the range of quantized depth values 911. A candidate quantization parameter q 1 ·' is a special value of the quantization scale such that the range + N— 1] covers the range 911 of quantized depth value previously identified for the considered block of first pixels, N corresponding to the number of encoding values allowed by the bit-depth of the encoder used to encode the depth data of the picture 40 or 50 (or of the pictures of the GOP 901). As an example, the bit-depth is 10 bits and N = 2 10 = 1024. For example, considering that the range 911 of quantized depth values is [3990, 4700], with Zmin = 3990 and Zmax = 4700. Considering a value for N 921 equal to 1024, the set 922 of candidate quantization parameters corresponds to the range of values [3676, 4699], with 3676 = Zmax - N. The same process is reiterated for each block of first pixels of the picture 40 (or of the GOP 901) to obtain a first set of candidate quantization parameters for each block of the picture 40 (or of the GOP 901).

In a fourth operation 93, a second set of quantization parameters that corresponds to a subset of the first set of candidate quantization parameters is determined. To reach that aim, it is determined within the first set of candidate quantization parameters the minimal number of candidate quantization parameters that may be used to represent all blocks of first pixels, i.e. all ranges of quantized depth values. This second set may for example be obtained by applying a greedy algorithm to get an optimally sized second set for the whole picture 40 (or for the whole GOP 901). A greedy algorithm iteratively makes locally the optimal choice at each stage. According to a variant, the second set may be obtained with a genetic algorithm or with an evolutionary algorithm or with a particle swarms algorithm. The obtained second set of quantization parameters may for example be stored in an image 100 described in relation to figure 10. Figure 10 shows an example of an image 100 containing information representative of the quantization parameters associated with the picture 40 or 50 (or with the GOP comprising the picture 40 or 50), according to a non-restrictive embodiment of the present principles.

The image 100 corresponds to a matrix of second pixels arranged in rows and columns. The number of columns of the image 100 corresponds to the number of first pixels in a row of the picture 40 divided by the block size and the number of rows corresponds to the number of first pixels in a column of the picture 40 divided by the block size. For example, if the size of the picture 40 is 4096 x 2160 first pixels and the size of a block is 8 x 8 first pixels, the number of columns of the image 100 is 512 and the number of rows is 270. The number of second pixels is therefore 512 x 270. Each second pixel of the image 100 is associated with a corresponding block of first pixels of the picture 40 or 50 (or of the GOP 901). For example, the upper left second pixel lOOoo is associated with the upper left block of first pixels of the picture 40 or 50, the index 00 of the reference number lOOoo of this upper left second pixel corresponding to the row and columns indices this upper left second pixel belongs to. The second pixels may be identified with the reference numbers lOOoo to l OOmn with their index corresponding to the index (row and column in the array of blocks of first pixels of the picture 40 or 50) of the blocks of first pixels the second pixels are associated with. Each second pixel may receive the quantization parameter determined in operation 93 to represent the range of quantized depth values of the block of first pixel said each second pixel is associated with. As several second pixels receive the same quantization parameter (as at least spatially adjacent blocks of the first pixels share common (identical) quantization parameters of the second), compression efficiency of the image 100 is high when encoding this image 100.

To reduce the bit rate when transmitting the encoded image 100, identifiers may be mapped to the quantization parameters of the second set and these identifiers are stored in the second pixels instead of the quantization parameters. It has been observed that a few dozen of quantization parameters may be sufficient to represent all ranges of quantized depth values comprised in the block of first pixels of the picture 40 or 50. According to this specific embodiment, the image 100 may contain one identifier per second pixel, the image enabling to map the identifiers with the blocks of first pixels, as one second pixel of the image 100 is associated with (or corresponds to) one block of first pixels of the picture 40 or 50. The mapping between the identifiers and the quantization parameters may be for example stored on a LUT (Look-Up-Table) such as the table 110 described in relation to figure 11. Figure 11 shows an example of a table 110 mapping identifiers with the quantization parameters associated with the picture 40 or 50, according to a non- restrictive embodiment of the present principles.

The table 110 comprises a list of identifiers Td’ mapping to values of the quantization parameters of the second set, one identifier mapping to one quantization parameter in a first part 111 of the table and one identifier mapping to a plurality of quantization parameters in a second part of the table 112. The identifiers Td’ may for example be coded on 8 bits, the Id taking the integer values 0 to 255. According to a variant, to reduce the bit rate when transmitting the table 110, the values of the plurality of quantization parameters associated with one identifier in the second part 112 are replaced with the identifiers of the first part 111 these quantization parameter values map to. For example, the identifier 128 map to the identifiers 1, 7, 0 and 0 of the first part, meaning that the quantization parameters identified with the Id 128 are the values 1200 and 5010, 1200 being identified with identifier‘G and 5010 being identified with the identifier‘7’ in the first part 111. The second part 112 refers to the first part 111 of the list or table of identifiers 110. According to this variant, the values the identifiers map to may be for example encoded on 32 bits.

Figure 12 shows an example of a process to determine a plurality of quantization parameters for a single block of pixels of the picture 40 or 50, according to a non-restrictive embodiment of the present principles.

This process is similar to the one described in relation to figure 9 but applies to block of first pixels comprising a plurality of ranges of quantized depth values, which are not contiguous on the quantization scale. This case appears when the content of the scene covered by a block of first pixels comprises objects or parts of objects of the scene located at different depths. When a GOP is considered, this case may also appear when a motion occurs during the GOP. In such a case, one single quantification parameter is not sufficient to represent the whole quantized depth values of a considered block of first pixels.

The first operation 120 corresponds to the first operation 90 of figure 9, i.e. the division of the picture 40 (or of the pictures of the GOP) into a plurality of blocks of first pixels. According to the example of figure 12, a specific block of first pixels 1204 is considered. The block 1204 comprises several areas 1201, 1202 and 1203, each area corresponding to a part of an object of the 3D scene 10, each part being at a different depth with depth gaps between each part of object 1201 to 1203.

The second operation 121 corresponds to the second operation 91 of figure 9, i.e. the applying of a same quantization function (e.g. the quantization function 71, 72 or 73) to the picture 40 or to the GOP of pictures, specifically to the block 1204 in the example of figure 12. Three ranges of depth 1211, 1212 and 1213, each corresponding to the depth values/data of the areas 1201, 1202 and 1203 respectively of the block 1204. These 3 ranges are not contiguous, meaning that a gap of distance exist between these ranges, i.e. there is a gap between the upper limit of the range 1211 and the lower limit of the range 1212 and there is a further gap between the upper limit of the range 1212 and the lower limit of the range 1213. The applying of the quantization function to these ranges of depth enables to obtain three corresponding ranges of quantized depth values, one for each range of depth values 1211, 1212 and 1213, with a gap between pairs of ranges of quantized depth values.

The third operation 122 corresponds to the third operation 92 and the fourth operation 93 of figure 9, i.e. the determining of quantization parameters. The same process is applied but with only a part of the number N of encoding values allowed by the bit-depth of the encoder for each range of quantized depth values. For example, the number of encoding values used for determining the set of candidate quantization parameters is N divided by the number of ranges of depth values detected in the block, i.e. 3 in the example of figure 12. A combination of quantifiers {c/^ ,/) } 0£fe < M(i,;) is associated with each block of first pixels. In that case, the definition of a quantization parameter (that may also be called a quantizer) is slightly modified in comparison to figure 9. It becomes a value of the quantized scale such that the range [q k L , q k l + pl/- ( j) _ l] covers the range of associated quantized depth values. This latter range [q k l , q k l + Wi 1 ’^— l]may for example be called a depth mode and

the length of the mode. The dynamic of the encoder N = 2° may be harmoniously shared between each mode and the dynamic per quantifier is reduced each time a new quantifier is required, i.e. each time a new range of depth values is detected in a block. In the example of figure 12, 3 ranges 1221, 1222 and 1223 of candidate quantization parameters may be determined, one for each range of quantized depth values. The first range 1221 may be equal to [979, 1319], the second range 1222 may be equal to [2809, 3149] and the third range 1223 may be equal to [4359, 4699] A second set of quantization parameters that corresponds to a subset of the first set of candidate quantization parameters may then be determined in a same way as in the fourth operation 93 described in relation to figure 9. An example number of quantization parameters (including the case of mono- modal blocks comprising a single depth range and multi-modal blocks comprising several depth ranges) is, in the example of the 3D scene 10, close to 40 (i.e. lower than 128 corresponding to the first part 111 of the table 110), whereas the number of different combination of quantization parameters involved in multi-modal blocks (i.e. blocks with several depth ranges) is for example close to 30 (also lower than 128 corresponding to the second part 112 of the table 110). Moreover, for multi-modal blocks, the number of involved quantization parameters M (L > is rarely over 3 for each block.

A compact way to store and reference the different modes of each block (i.e. mono-modal and multi-modal) may be achieved by considering the quantization table 110 of figure 11 with 256 elements (i.e. identifiers), each element being coded on 32 bits. Depending on its position in the table, each element may be interpreted differently (depending on whether the element belongs to the first part 111 or to the second part 112). If it belongs to the first half part 111 of the table (position 1 to 127 - 0 being reserved for empty blocks), the associated values should be interpreted as a 32 bits integer which value is a quantization parameter value. If it belongs to the second half part 112 of the table (position 128 to 255), then the 32 bits should be interpreted as a combination of up to four 8 bits integers, each integer being comprised between 0 and 127 and pointing to the corresponding element of the first part of the table. This second part 112 of the table implicitly encodes a combination of up to 4 quantizers and the corresponding table positions are referenced by the quantization map (i.e. the image 100) to describe multi-modal blocks. The quantization map 100 only contains values comprised between 0 and 255 (quantization table size) and can be encoded by 8 bits per element map (i.e. second pixel).

For multi-modal blocks, the number of involved quantizers on the block may be inferred by counting the number of non-zero values in the 4 x 8 bits integers embedded in the 32 bits of the corresponding value in the quantization map 111. Depending on this latter number, the length of each associated mode (mono-modal and multi-modal) may be straightforwardly deduced Figure 13 shows examples of methods to encode quantized depth values of the picture 40 or 50, according to non-restrictive embodiments of the present principles.

In a first operation 130, depth ranges of a block are quantized using a quantization function such as the quantization function 71, 72 or 73, as in the operation 120 of figure 12. The obtained ranges of quantized depth values may be encoded according to two different methods, a first method A with operations 131 and 132 and a second method B with operations 133 and 134.

Regarding the method A, quantization parameters are determined in operation 131 in a same way as described with regard to operation 122 of figure 12. An image 100 and associated table 110 are generated as explained hereinabove.

In operation 132, the quantization of the depth values of the picture 40 may be addressed. To do so, quantization parameters ' 7 ) }o <fc < M( U)for the block located at (i,j) are determined. This second set of quantization parameters is implicitly described by the image 100 and associated table 110.

For a given depth z to quantize, let’s call q^P e { c I k ' ^}o £k < M a 'i > the quantization parameter such that

Then the quantized depth may be express by:

As one can observe in the part of figure 13 corresponding to the operation 132, this latter quantization function corresponds to a “by parts” concatenation of quantization function q a used for quantizing each depth range, considering the quantization parameter for each depth range. However, a simple concatenation as in operation 132 may generate some issues, while the encoding of the depth values enables to use the full encoding dynamic N for each block. Indeed, if“within” each part 1322 and 1323, the properties of q a guaranty a good robustness to video coding artifacts, it is not the case at the limits of each concatenated part. For example, at the level of the upper limit 1321 of the part 1322 which corresponds to the lower limit of the part 1323, coding artifacts may make a quantized value switch from one depth mode 1322 to another 1323 (or inversely) with possible unwanted visual impact at the decoding side. For example, an error of 1 in the decoded quantized depth value at the lower limit of the part 1323 may induce the use of the part 1322 of the reciprocal of the quantization function instead of the part 1323 of the reciprocal of the quantization function, or inversely.

The method B may be implemented to solve the issue of the method A. In operation 133, quantization parameters are determined. Instead of sharing the number of encoding values N between all ranges (N divided by the number of ranges as described in operation 122 of figure 12), a part of the encoding values is reserved, this part being for example called DMZ. This part is reserved in the sense that the values of this part cannot be used for encoding the quantized depth values according to the associated quantization parameters. For example, 6.25 % of the number N of encoding values is reserved, i.e. 64 values for N = 1024. Such a modification impacts the way how the depth mode length is computed. This latter, so far expressed as

has now to be computed On one hand, the introduction of this safety zone therefore reduces a bit the number of encoding values allocated per depth range (i.e. per quantization parameter), but on the other hand, it guarantees a good robustness at the limit or frontier between each encoded depth mode. With this modification, the per-block quantization function is slightly adapted and is finally expressed as: with Qf = {k z + 1 )DMZ + k z W^\

The result of this per-block quantization function is illustrated in operation 134. The ordinate axis shows the quantized depth values encoded according to the quantization parameter associated with each depth mode (or with each depth range). As it clearly appears from the operation 134, some parts DMZ of the encoding values are not used for the encoding of the quantized depth values, these parts being located at the limits of the depth mode (depth ranges). When decoding the encoded quantized depth value, if a decoded value falls within a DMZ part, this value is simply discarded, which avoid the generation of artefacts. Figure 14 shows a non-limiting example of an embodiment of the syntax of a stream carrying the data representative of the depth of the 3D scene when the data are transmitted over a packet-based transmission protocol. Figure 14 shows an example structure 14 of a video stream. The structure consists in a container which organizes the stream in independent elements of syntax. The structure may comprise a header part 141 which is a set of data common to every syntax elements of the stream. For example, the header part comprises metadata about syntax elements, describing the nature and the role of each of them. The header part may also comprise the coordinates of the viewpoint used for the encoding of the picture 40, 50 and information about the size and the resolution of the picture. The structure comprises a payload comprising a first element of syntax 142 and at least one second element of syntax 143. The first syntax element 142 comprises data representative of the quantization parameters, for example the image 100 and optionally the table 110.

The one or more second syntax elements 143 comprises geometry information, i.e. depth information. The one or more second syntax elements 143 comprise for example quantized depth values that are encoded according to the quantization parameters.

According to a variant, one or more additional second syntax elements 143 comprise the data representative of texture of the picture 30.

According to a further optional variant, the stream further comprises at least one of the following parameters, for example under the form of metadata:

- the DMZ value;

- the parameters required to compute the quantization function, for example a and zmin;

- the number of encoding value N allowed by the encoder bit depth.

- The following parameters or at least a part of them (e.g. the DMZ value or the parameters of the quantization function) may be transmitted once per GOP.

According to a variant, these parameters are stored at the decoder and are not transmitted.

For illustration purpose, in the context of ISOBMFF file format standard, texture patches, geometry patches and the metadata would typically be referenced in ISOBMFF tracks in a box of type moov, with the texture data and geometry data themselves embedded in media-data box of type mdat.

On the decoding side, the set of metadata described hereinabove is retrieved and used to dequantized each block of the received depth atlases. More precisely, for each block (r, y) the set required quantization parameters (also called quantifiers) is deduced from the quantization table 110 and map as well as the associated mode length = N M M(i ^ MZ with N = 2° . Let q be the quantized depth to be dequantized. Let’s note k z , 0 < k z < M(i,j ) such that

(with Q^' } = (k z + 1 )DMZ + k z W (L ^as defined before). Letc/j^ e the associated quantizer, then the depth value z associated with a first pixel (i,j) of the picture 40 or 50 may be dequantized by:

z a being the reciprocal of the quantization function q a .

The latter example corresponds to the method implementing a DMZ. For the method without DMZ, the same formulas apply with DMZ = 0.

Figure 15 shows an example architecture of a device 15 which may be configured to implement a method described in relation with figures 9, 12, 13, 19 and/or 20. The device 15 may be configured to be an encoder 61 or a decoder 63 of figure 6.

The device 15 comprises following elements that are linked together by a data and address bus 151 :

- a microprocessor 152 (or CPU), which is, for example, a DSP (or Digital Signal Processor);

- a ROM (or Read Only Memory) 153;

- a RAM (or Random- Access Memory) 154;

- a storage interface 155; - an I/O interface 156 for reception of data to transmit, from an application; and

- a power supply, e.g. a battery.

In accordance with an example, the power supply is external to the device. In each of mentioned memory, the word « register » used in the specification may correspond to area of small capacity (some bits) or to very large area (e.g. a whole program or large amount of received or decoded data). The ROM 153 comprises at least a program and parameters. The ROM 153 may store algorithms and instructions to perform techniques in accordance with present principles. When switched on, the CPU 152 uploads the program in the RAM and executes the corresponding instructions.

The RAM 154 comprises, in a register, the program executed by the CPU 152 and uploaded after switch-on of the device 15, input data in a register, intermediate data in different states of the method in a register, and other variables used for the execution of the method in a register.

The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a computer program product, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method or a device), the implementation of features discussed may also be implemented in other forms (for example a program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.

In accordance with an example of encoding or an encoder 61 of figure 6, the depth data of the three-dimension scene is obtained from a source. For example, the source belongs to a set comprising:

- a local memory (153 or 154), e.g. a video memory or a RAM (or Random-

Access Memory), a flash memory, a ROM (or Read Only Memory), a hard disk; - a storage interface (155), e.g. an interface with a mass storage, a RAM, a flash memory, a ROM, an optical disc or a magnetic support;

- a communication interface (156), e.g. a wireline interface (for example a bus interface, a wide area network interface, a local area network interface) or a wireless interface (such as a IEEE 802.11 interface or a Bluetooth® interface); and

- a user interface such as a Graphical User Interface enabling a user to input data.

In accordance with examples of the decoding or decoder(s) 63 of figure 6, the stream is sent to a destination; specifically, the destination belongs to a set comprising:

- a local memory (153 or 154), e.g. a video memory or a RAM, a flash memory, a hard disk;

- a storage interface (155), e.g. an interface with a mass storage, a RAM, a flash memory, a ROM, an optical disc or a magnetic support; and

- a communication interface (156), e.g. a wireline interface (for example a bus interface (e.g. USB (or Universal Serial Bus)), a wide area network interface, a local area network interface, a HDMI (High Definition Multimedia Interface) interface) or a wireless interface (such as a IEEE 802.11 interface, WiFi ® or a Bluetooth ® interface). In accordance with examples of encoding or encoder, a bitstream comprising data representative of the depth of the 3D scene is sent to a destination. As an example, the bitstream is stored in a local or remote memory, e.g. a video memory or a RAM, a hard disk. In a variant, the bitstream is sent to a storage interface, e.g. an interface with a mass storage, a flash memory, ROM, an optical disc or a magnetic support and/or transmitted over a communication interface, e.g. an interface to a point to point link, a communication bus, a point to multipoint link or a broadcast network.

In accordance with examples of decoding or decoder or Tenderer 63 of figure 6, the bitstream is obtained from a source. Exemplarily, the bitstream is read from a local memory, e.g. a video memory, a RAM, a ROM, a flash memory or a hard disk. In a variant, the bitstream is received from a storage interface, e.g. an interface with a mass storage, a RAM, a ROM, a flash memory, an optical disc or a magnetic support and/or received from a communication interface, e.g. an interface to a point to point link, a bus, a point to multipoint link or a broadcast network.

In accordance with examples, the device 15 is configured to implement a method described in relation with figures 9, 12, 13, 19 and/or 20, and belongs to a set comprising:

- a mobile device;

- a communication device;

- a game device;

- a tablet (or tablet computer);

- a laptop;

- a still picture camera;

- a video camera;

- an encoding chip;

- a server (e.g. a broadcast server, a video-on-demand server or a web server).

In pictures generated for encoding a 3D scene for a 3DoF+ rendering and displaying, it is possible to distinguish between two categories of pixels. Some pixels result from the projection of points belonging to surfaces of objects that cannot be seen with a grazing incidence wherever from the viewing bounding box. That is, for every view point comprised in the viewing bounding box and for every view direction from this view point, the normal of the surface of the object at this point is not close to 90°. These pixels are categorized as Non-Contour pixels (NC pixels). Other pixels result from the projection of points belonging to surfaces of objects that can be seen with a grazing incidence from at least one view point of the viewing bounding box. These pixels are categorized as Contour pixels (C pixels). A Boolean flag is associated with every pixel of a picture, the flag being set to 0 (i.e. false) for NC pixels and to 1 (i.e. true) for C pixels.

Several methods may be considered for categorizing pixels of a pixel as NC or C pixels. It is possible to identify contours, for example by sampling the viewing bounding box, lighting the 3D scene from these sampled view points, and determining the angle of each scene surface with the light rays. It is also possible to integrate all the zone of high depth gradient from all original capturing views, the contour zones being the merge of all of them. It is also possible to integrate all the zone of high depth gradient from all virtual views, remembering that a multiplicity of projections is a typical way to convey 3DoF+ contents. According to the present principles, floating-point depth value of non-contour pixels are encoded by applying this value to a quantization function defined on a range depending on a first value lower than the bit-depth. For example, the bit-depth may be 10 and the chosen quantization function is l/z. The possible values to encode the depth belong to the interval [0; 1023] The first value is for example 9 (<10). In other words, the floating-point depth value of non-contour pixels will be encoded on 512 values (range [0, 511]) implying a lower precision. However, this loss of precision is not a drawback as non-contour imprecision does not lead to visual artefacts. The contour pixels are quantized using the same quantization function but now defined on a range depending on a second value greater than the bit-depth. For example, the second value is set to 12 (>10). In other words, the floating-point depth value of contour pixels will be quantized on 4096 values (range [0, 4095]) implying a greater precision. However, a 12 bits value is too big to be stored in a 10 bits value. Description in relation to figures 9, 12 and 13 describes methods to determine an offset value to be able to store a high precision value with a reduced bit-depth by associating the offset value to the picture (or the block) to encode.

According to the present principles, the offset value D is determined according to the interval I which comprises every high precision quantized values of contour pixels and the first value VI lower than the bit-depth. D is determined in a way that I is included in the interval [D, D+2 vl -l] So, given a high precision quantized value q on V2 bits, the value (q-D+2 vl ) may be encoded on the bit-depth and is always greater than or equal to 2 V1 .

At the decoding, if the integer depth value of a pixel is lower than 2 V1 , then this pixel is a non-contour pixel. Its floating-point depth value is decoded by de-quantizing the integer depth-value with the same quantification function defined on the first range [0, 2 V1 ] Otherwise, if the integer depth value d of a pixel is greater than or equal to

2 V1 , then this pixel is a contour pixel. Its floating-point depth value is decoded by de- quantizing (v+D-2 vl ) with the same quantification function but defined on the second range [0, 2 V2 ]. D, VI and V2 are associated with the encoded picture or block, for example in a table and/or in matrices as described in Appendix A. Reserved indices for hybrid encoding may be defined in these metadata.

The insertion of a reserved interval in the bit-depth range, called DMZ in the present application, may be performed on the present principles. Intervals described upper are lightly smaller.

According to the present principles, a picture to encode is an entire image or a block of an image. The offset value is determined according to the high precision quantized values of contour pixels of the picture or of every picture of a Group of Pictures (GoP).

Depth Modal Analysis, as described in relation figures 9, 12 and 13, explains how to define a depth coding mode for each sub-block of the image, in order to cope, for example, with only available 10 bits HEVC coding technology. It may be adapted to any other bit-depth. Unfortunately, the related depth bit rate may be too high, compared to a simple depth coding solution in l/z which exhibits reasonable performance in a number of cases.

Depth Modal Analysis describes a per-Gop and per block (e.g. 8x8) quantization strategy. In each block is defined a mode for coding the depth, which can be split into a token and value in a quantization table, and the whole set of tokens on the full frame constitutes a quantization matrix which is coded as an image of reduced size. An analysis on the depth of each block allows to sort them into 1) mono mode 2) bi-mode 3) tri-mode. All the contour zone are from category 2 or 3.

The present application relates to coding the non-contour zone in an economic way and with a reduced bitrate penalty, because artefact do not have a perceptible impact on them. The contour zone will benefit from a more expensive coding, and 3 different means to improve the coding quality are explained in this application. The intention is to reduce visible artefacts on the object silhouette or“contour”, especially when they are observed dynamically with a grazing incidence. The specificity of 3DoF+ use case is that the scene may be viewed from any point within the viewing bounding box. The bounding box is the compact and convex volume of displacement of the human head, and therefore of the human eyes, possibly looking in all direction.

This particular setup identifies 2 types of scene surfaces, the non-contour zone or band, and the contour zone or band, which is complementary to the first.

Figure 16 illustrates in 2D that contour band 161 (in bold black) is the part of the scene surface which may be seen at grazing incidence from at least one point of view of a viewing bounding box 160, and requires a very high quality, artefact-less depth value. In this drawing, the dashed lines represent some grazing incidence rays.

Figure 17 illustrates contour band 171 on a 3D view, where some peripheral parts of character surface are contour band.

It is possible to identify those contours, for example by sampling the bounding box 170, lighting the scene from this point, and compute the angle of each scene surface with the light rays. Or it is possible to integrate all the zone of high depth gradient from all original capturing views, the contour zones being the merge of all of them. Or it is possible to integrate all the zone of high depth gradient from all virtual views, remembering that a multiplicity of projections is atypical way to convey 3DoF+ contents.

A possible way to convey 3DoF+ (or“volumetric”) VR content - for example defined by a multiplicity of Video + Depth input shot from a camera rig or by a point cloud - is to use a multiplicity of virtual cameras to operate iterative projections of this visual stuff. One meaningful configuration is to use a plain view from a central position to project the most important part of the visual stuff and complement with other projections from virtual cameras at different positions, and pack the related patches.

As a result, the video stream is composed of

- A main plain view

- A pack of residual views

It is typically applied an overlapping factor between those mapping, so that no artefacts appear at the patch seams. The contour zones are almost exclusively conveyed through these residues, including from the central position considering this overlapping factor.

The present application discloses different 3DoF+ coding parameters, whether the picture (image, patch or block) to encode comprises contour zones or not.

For the contour zones for which artefacts could be detected at once by the viewer, the depth coding should be very high quality and will therefore have a penalty in terms of bitrate. However, for the non-contour zones, the coding can be of lower quality, thus reducing the global streaming bitrate.

Those 3DoF+ coding parameters are threefold and can be applied exclusively or inclusively: Increased resolution of the patches by re-projecting the same points onto a bigger picture (i.e. with a better definition, that is a greater number of pixels in width and/or in height); Map of differentiated Quantification Parameters (QP), with low QPs (= high quality) for contours and vice-versa; and/or increased cost of depth coding.

Some pictures comprise only contour pixels (respectively non-contour) pixels. Increasing (respectively decreasing) resolution and/or decreasing (respectively increasing) QP are suitable methods for these kinds of pictures.

For hybrid pictures, comprising at the same time contour and non-contour pixels, the present application discloses a method to increasing the quality of the encoding of contour pixels according to the chosen bit-depth by decreasing the quality of the encoding of non-contour pixels.

Figure 18 illustrates the notion of hybrid picture where a character stands in front of a remote background.

Hybrid pictures may be considered over a sequence of frames, as explained in the Appendix A: all the depth values in floating point are accumulated on the total number Ngop of the current Group of Pictures (GoP) frames (typically Ngop=32) and quantified. Then the quantified qi depth population is analyzed within each spatio- temporal block of size Ngopx8x8. This is illustrated in Figure 9 where a spatio- temporal block is shown built from the upper right comer block of a sequence of pictures. Figure 18 shows a hybrid block which is partially covering the character’s face, and this face part is also composed of contour and non-contour zones. As a result, there are 3 classes of depth pixels which will be coded in depth in the curves on the right (note that we put number examples on the curves that will be retaken in the next figure):

1. The background depth will be coded in a cheap way according to a quantization function defined on a range determined according to a first value lower than the bit-depth, e.g. in 9 bits only inv(z) function.

2. The non-contour face pixel depth will also be coded in a cheap way with the same quantization function defined on a range determined according to the same first value lower than the bit-depth, e.g. in 9 bits only on the same inv(z) function.

3. The contour face pixel depth will be coded in an expensive way according to the same quantization function but defined on a range determined according to a second value greater than the bit-depth, e.g. on a 12 bits inv(z) function.

Then, a mix of those 2 curves (same quantization function defined on two different ranges) will be done in order to align with the available coding bit-depth, for example 10 bits or 8 bits or 12 bits. One simple way for combining - and illustrated in figure 19 - is for example to split alO bits [0; 1023] range in 2 parts, one part being devoted to the 9 bits inv(z) and this other part being devoted to the coding of the local depth. The latter coding requires:

- the transmission of a table explicating the offset used for this local coding with high precision, each different offset corresponding to a“mode”. Each picture, image or block is associated with the offset value used to shift the quantized value of contour pixels in the second range to the first range;

- The map of those“modes” for the whole image and the whole GoP; and

- A separation zone between the 2 low and high precision ranges, in order to overcome compression depth noise, and labelled as“reserved” in figure 19, and named“DMZ” in figure 13.

The benefit of the present principles is to put information - and therefore the bitrate penalty- where it matters the most, typically at 10% only the images. Figure 20 illustrates a method 200 of encoding a floating-point depth value of a pixel of a picture according to a bit-depth. The pixel is obtained by projecting a point of a 3D scene onto said picture. At a step 201, the floating-point depth value of a point proj ected on a pixel of a picture is obtained. This floating-point value is to be quantized to be encoded. A test is performed to determine whether this pixel is a non-contour pixel according to the present principles described above in the present application. If so, step 202 is executed, otherwise step 204 is performed. At a step 202, knowing that the pixel is a non-contour pixel, the floating-point value depth value is applied to a quantization function defined on a range of 2 V values, v being an integer lower than the bit-depth used by the encoder to encode depth. At a step 203, the integer result is encoded in the stream using the codec of the encoder.

At a step 204, knowing that the pixel is a contour pixel according to the present principles, the depth has to be encoded with a higher precision. At this step, the floating-point value is encoded, using the same quantization function than at step 202, but defined on a range of 2 W values, with w being an integer greater than the bit-depth used by the encoder. At a step 205, an offset is determined and associated to the picture as described above in the present application. At a step 206, the quantized integer is encoded in the data stream in association with the value v of step 202 and the determined offset.

Figure 21 illustrates a method 210 of decoding an integer depth value of a pixel of a picture. The integer depth value is encoded according to a bit-depth. The picture is associated with an offset value. At a step 211, an integer value representative of the depth of a point projected onto the pixel is obtained. An offset associated with the picture and a bit-depth used by the encoder are also obtained. These values may be predetermined and/or encoded in the data stream. At a step 212, a test is performed to check whether the integer value belongs to a first range depending on a first value v lower than said bit-depth. First value v may be obtained from the stream in association with the picture or predetermined. If the integer value is lower than 2 V , then, the pixel is a non-contour pixel and its depth has been encoded on a decreased range. In this case, a step 213 is executed; otherwise, the pixel is a contour pixel and a step 215 is performed.

At step 213, the integer value is applied to a de-quantization function defined on a range depending on first value v, for instance, on 2 V values. At a step 214, the floating-point result is used as the depth value for the given pixel.

At a step 215, the floating-point depth value is decoded by de-quantizing the integer value plus the offset minus 2 V with the same quantification function but defined on the second range [0, 2 W ], where w is an integer greater than the bit depth. Value w may be obtained from the data stream in association with the picture or predetermined. At a step 216, the floating-point result is used as the depth value for the given pixel.

The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a computer program product, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method or a device), the implementation of features discussed may also be implemented in other forms (for example a program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, Smartphones, tablets, computers, mobile phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.

Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly, for example, equipment or applications associated with data encoding, data decoding, view generation, texture processing, and other processing of images and related texture information and/or depth information. Examples of such equipment include an encoder, a decoder, a post-processor processing output from a decoder, a pre-processor providing input to an encoder, a video coder, a video decoder, a video codec, a web server, a set-top box, a laptop, a personal computer, a cell phone, a PDA, and other communication devices. As should be clear, the equipment may be mobile and even installed in a mobile vehicle.

Additionally, the methods may be implemented by instructions being performed by a processor, and such instructions (and/or data values produced by an implementation) may be stored on a processor-readable medium such as, for example, an integrated circuit, a software carrier or other storage device such as, for example, a hard disk, a compact diskette (“CD”), an optical disc (such as, for example, a DVD, often referred to as a digital versatile disc or a digital video disc), a random access memory (“RAM”), or a read-only memory (“ROM”). The instructions may form an application program tangibly embodied on a processor-readable medium. Instructions may be, for example, in hardware, firmware, software, or a combination. Instructions may be found in, for example, an operating system, a separate application, or a combination of the two. A processor may be characterized, therefore, as, for example, both a device configured to carry out a process and a device that includes a processor- readable medium (such as a storage device) having instructions for carrying out a process. Further, a processor-readable medium may store, in addition to or in lieu of instructions, data values produced by an implementation.

As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry as data the rules for writing or reading the syntax of a described embodiment, or to carry as data the actual syntax-values written by a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application.