Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING
Document Type and Number:
WIPO Patent Application WO/2022/219230
Kind Code:
A1
Abstract:
The embodiments relate to a method for encoding, comprising receiving inputs relating to three- dimensional content (1110); generating one or more two-dimensional patches from the inputs(1120); determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface (1130); determining surface normal direction of a patch (1140); incorporating the determined parameters and information on the determined surface normal direction to metadata (1150); and associating metadata with a coded bitstream (1160). The embodiments also relate to a method for decoding, and to apparatuses for carrying out the methods.

Inventors:
RONDAO ALFACE PATRICE (BE)
NAIK DEEPA (FI)
MALAML VADAKITAL VINOD KUMAR (FI)
Application Number:
PCT/FI2022/050084
Publication Date:
October 20, 2022
Filing Date:
February 11, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N13/15; G06T7/40; G06T7/90; G06T15/50; H04N13/161; H04N13/178; H04N19/597
Foreign References:
US20200351484A12020-11-05
Other References:
SALAHIEH, B. ET AL.: "Test Model 8 for MPEG Immersive Video, w20002", 133RD MEETING OF THE MPEG. ISO/IEC JTC 1/SC 29/WG 4, 30 January 2021 (2021-01-30), XP030293031, Retrieved from the Internet > [retrieved on 20220515]
RONDAO-ALFACE, P. ET AL.: "Multiple Texture Patches Per Geometry Patch, m55977", 133RD MEETING OF THE MPEG. ISO/IEC JTC1/SC29/WG 04, 12 January 2021 (2021-01-12), XP030290853, Retrieved from the Internet [retrieved on 20220515]
HAN, C. ET AL.: "Frequency Domain Normal Map Filtering. In: SIGGRAPH 2007", ACM, 29 July 2007 (2007-07-29), XP058336086, Retrieved from the Internet [retrieved on 20220515], DOI: 10.1145/1275808.1276412
NAIK, D. ET AL.: "Surface Lightfield Support in Video-based Point Cloud Coding", 2020 IEEE 22ND INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP, 16 December 2020 (2020-12-16), pages 1 - 6, XP055982435, Retrieved from the Internet [retrieved on 20220515], DOI: 10.1109/MMSP48831.2020.9287115
ANONYMOUS: "MPEG-I VIDEO CODING SUBGROUP Information technology-Coded Representation of Immersive Media - Part 12: Immersive Video, w20001", 133RD MEETING OF THE MPEG. ISO/IEC JTC 1/SC 29/WG 04, 30 January 2021 (2021-01-30), Retrieved from the Internet [retrieved on 20220515]
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1. An apparatus for encoding, comprising: - means for receiving inputs relating to three-dimensional content;

- means for generating one or more two-dimensional patches from the inputs;

- means for determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface;

- means for determining surface normal direction of a patch;

- means for incorporating the determined parameters and information on the determined surface normal direction to metadata; and - means for associating metadata with a coded bitstream.

2. The apparatus according to claim 1, wherein inputs comprise at least two input frames from different views. 3. The apparatus according to claim 2, wherein the inputs further comprises depth information and camera extrinsic and/or intrinsic parameters.

4. The apparatus according to claim 2, wherein the inputs comprise surface light field or view dependent point cloud.

5. The apparatus according to any of the claims 1 to 4, wherein parameters and the surface normal direction are determined at a three- dimensional point of a non-diffuse patch.

6. The apparatus according to any of the claims 1 to 5, further comprising means for encoding information on the surface normal direction to a patch description unit (PDU).

7. The apparatus according to any of the claims 1 to 6, wherein parameters relating to distribution of reflected light are determined by von Mises-Fisher (vMF) distribution.

8. The apparatus according to claim 7, further comprising determining the parameters either per point or per patch.

9. An apparatus for decoding, comprising:

- means for receiving an encoded bitstream;

- means for determining a viewing direction;

- means for determining parameters relating to distribution of reflected light ray for a three-dimensional point of a surface;

- means for decoding surface normal direction from a patch direction unit;

- means for computing a distribution for each viewing direction based on the determined viewing direction and the determined parameters; and

- means for using the computed distribution and the surface normal direction to render a patch.

10. The apparatus according to claim 9, wherein parameters relating to distribution of reflected light are determined by von Mises-Fisher (vMF) distribution.

11.A method for encoding, comprising:

- receiving inputs relating to three-dimensional content;

- generating one or more two-dimensional patches from the inputs;

- determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface;

- determining surface normal direction of a patch;

- incorporating the determined parameters and information on the determined surface normal direction to metadata; and

- associating metadata with a coded bitstream.

12. A method for decoding, comprising

- receiving an encoded bitstream;

- determining a viewing direction;

- determining parameters relating to distribution of reflected light ray for a three-dimensional point of a surface;

- decoding surface normal direction from a patch direction unit;

- for each viewing direction, computing a distribution based on the determined viewing direction and the determined parameters; and

- using the computed distribution and the surface normal direction to render a patch.

13. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive inputs relating to three-dimensional content;

- generate one or more two-dimensional patches from the inputs;

- determine parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface;

- determine surface normal direction of a patch;

- incorporate the determined parameters and information on the determined surface normal direction to metadata; and

- associate metadata with a coded bitstream.

14. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive an encoded bitstream;

- determine a viewing direction;

- determine parameters relating to distribution of reflected light ray for a three-dimensional point of a surface;

- decode surface normal direction from a patch direction unit; - for each viewing direction, compute a distribution based on the determined viewing direction and the determined parameters; and

- use the computed distribution and the surface normal direction to render a patch.

Description:
A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND VIDEO DECODING

The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Netherlands, Czech Republic, Finland, Spain, Italy.

Technical Field

The present solution generally relates to encoding and decoding of digital volumetric video.

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view and displayed as a rectangular scene on flat displays. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future.

For volumetric video, a scene may be captured using one or more cameras. The cameras are in different positions and orientations within a scene. One issue to consider is that compared to 2D (two-dimensional) video content, volumetric 3D video content has much more data, so viewing it requires lots of bandwidth (with or without transferring it from a storage location to a viewing device): disk I/O, network traffic, memory bandwidth, GPU (Graphics Processing Unit) upload. Capturing volumetric content also produces a lot of data, particularly when there are multiple capture devices used in parallel.

Summary

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.

Various aspects include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments are disclosed in the dependent claims.

According to a first aspect, there is provided a method for encoding comprising receiving inputs relating to three-dimensional content; generating one or more two-dimensional patches from the inputs; determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three- dimensional point of a surface; determining surface normal direction of a patch; incorporating the determined parameters and information on the determined surface normal direction to metadata; and associating metadata with a coded bitstream.

According to a second aspect, there is provided a method for decoding comprising receiving an encoded bitstream; determining a viewing direction; determining parameters relating to distribution of reflected light ray for a three- dimensional point of a surface; decoding surface normal direction from a patch direction unit; for each viewing direction, computing a distribution based on the determined viewing direction and the determined parameters; and using the computed distribution and the surface normal direction to render a patch.

According to a third aspect, there is provided an apparatus for encoding comprising means for receiving inputs relating to three-dimensional content; means for generating one or more two-dimensional patches from the inputs; means for determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface; means for determining surface normal direction of a patch; means for incorporating the determined parameters and information on the determined surface normal direction to metadata; and means for associating metadata with a coded bitstream.

According to a fourth aspect, there is provided an apparatus for decoding comprising means for receiving an encoded bitstream; means for determining a viewing direction; means for determining parameters relating to distribution of reflected light ray for a three-dimensional point of a surface; means for decoding surface normal direction from a patch direction unit; means for computing a distribution for each viewing direction based on the determined viewing direction and the determined parameters; and means for using the computed distribution and the surface normal direction to render a patch.

According to a fifth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive inputs relating to three-dimensional content; generate one or more two-dimensional patches from the inputs; determine parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface; determine surface normal direction of a patch; incorporate the determined parameters and information on the determined surface normal direction to metadata; and associate metadata with a coded bitstream. According to a sixth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive an encoded bitstream; determine a viewing direction; determine parameters relating to distribution of reflected light ray for a three-dimensional point of a surface; decode surface normal direction from a patch direction unit; for each viewing direction, compute a distribution based on the determined viewing direction and the determined parameters; and use the computed distribution and the surface normal direction to render a patch.

According to a seventh aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive inputs relating to three-dimensional content; generate one or more two-dimensional patches from the inputs; determine parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface; determine surface normal direction of a patch; incorporate the determined parameters and information on the determined surface normal direction to metadata; and associate metadata with a coded bitstream.

According to an eighth aspect, there is provided a computer program product comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive an encoded bitstream; determine a viewing direction; determining parameters relating to distribution of reflected light ray for a three-dimensional point of a surface; decode surface normal direction from a patch direction unit; for each viewing direction, compute a distribution based on the determined viewing direction and the determined parameters; and use the computed distribution and the surface normal direction to render a patch.

According to an embodiment, the inputs comprise at least two input frames from different views. According to an embodiment, the inputs further comprises depth information and camera extrinsic and/or intrinsic parameters.

According to an embodiment, the inputs comprise surface light field or view dependent point cloud.

According to an embodiment, parameters and the surface normal direction are determined at a three-dimensional point of a non-diffuse patch. According to an embodiment, information on the surface normal direction is encoded to a patch description unit (PDU).

According to an embodiment, parameters relating to distribution of reflected light are determined by von Mises-Fisher (vMF) distribution.

According to an embodiment, the parameters are determined either per point or per patch.

According to an embodiment, the computer program product is embodied on a non-transitory computer readable medium.

Description of the Drawings

In the following, various embodiments will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an encoding process according to an embodiment;

Fig. 2 shows a decoding process according to an embodiment; Fig. 3 shows an example of a compression process of a volumetric video;

Fig. 4 shows an example of a de-compression process of a volumetric video;

Fig. 5 shows an example of an encoder of the MIV extension ofV3C; Fig. 6 shows a detailed example of the encoding process of Figure 5;

Fig. 7 shows an example of a Tenderer of the MIV extension of V3C;

Fig. 8 shows examples of vMF distributions with different concentration values;

Fig. 9 shows an example of patch normal direction and viewing camera directions that sample vMF distributions;

Fig. 10 shows an example of patch normal direction and viewing camera direction that sample vMF distributions;

Fig. 11 is a flowchart illustrating a method according to an embodiment; and

Fig. 12 shows an apparatus according to an embodiment.

Description of Example Embodiments

The following description and drawings are illustrative and are not to be construed as unnecessarily limiting. The specific details are provided for a thorough understanding of the disclosure. However, in certain instances, well- known or conventional details are not described in order to avoid obscuring the description. References to one or an embodiment in the present disclosure can be, but not necessarily are, reference to the same embodiment and such references mean at least one of the embodiments. Reference in this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosure. A video codec comprises an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can un-compress the compressed video representation back into a viewable form. An encoder may discard some information in the original video sequence in order to represent the video in a more compact form (i.e. at lower bitrate). Figure 1 illustrates an encoding process of an image as an example. Figure 1 shows an image to be encoded (In); a predicted representation of an image block (P’ n ); a prediction error signal (D n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); a transform (T) and inverse transform (P 1 ); a quantization (Q) and inverse quantization (Q 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 2. Figure 2 illustrates a predicted representation of an image block (P’ n ); a reconstructed prediction error signal (D’ n ); a preliminary reconstructed image (l’ n ); a final reconstructed image (R’ n ); an inverse transform (T -1 ); an inverse quantization (Q -1 ); an entropy decoding (E -1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

Volumetric video refers to visual content that may have been captured using one or more cameras. When multiple cameras are in use, the captured footage is synchronized so that the cameras provide different viewpoints to the same world. In contrast to traditional 2D/3D video, volumetric video describes a 3D model of the world where the viewer is free to move and observe different parts of the world.

Volumetric video enables the viewer to move in six degrees of freedom (6DOF): in contrast to common 360° video, where the user has from 2 to 3 degrees of freedom (yaw, pitch, and possibly roll), a volumetric video represents a 3D volume of space rather than a flat image plane. Volumetric video frames contain a large amount of data because they model the contents of a 3D volume instead of just a two-dimensional (2D) plane. Volumetric content, like traditional 2D video content, contains significant amount of redundancies, both spatially as well as temporally. Therefore, volumetric content, like traditional 2D video content, can use predictive and entropy coding techniques to reduce the amount of data. Volumetric video can be rendered from synthetic scenes created using 3D content creation software, reconstructed from multi-view video using 3D reconstruction techniques such as structure from motion, or captured with a combination of cameras and depth sensors such as LiDAR (Light Detection and Ranging), for example.

Volumetric video data represents a three-dimensional scene or object and can be used as input for AR (Augmented Reality), VR (Virtual Reality) and MR (Mixed Reality) applications. Such data describes geometry (shape, size, position in 3D-space) and respective attributes (e.g. colour, opacity, reflectance, ...), plus any possible temporal changes of the geometry and attributes at given time instances (like frames in 2D video). Volumetric video is either generated from 3D models, i.e. CGI (Computer Generated Imagery), or captured from real-world scenes using a variety of capture solutions, e.g. multi camera, laser scan, combination of video and dedicated depth sensors, and more. Also, a combination of CGI and real-world data is possible. Representation formats for such volumetric data comprise, for example, triangle meshes, point clouds, or voxel. Temporal information about the scene can be included in the form of individual capture instances, i.e. “frames” in 2D video, or other means, e.g. position of an object as a function of time.

Because volumetric video describes a 3D scene (or object), such data can be viewed from any viewpoint. Therefore, volumetric video is an important format for any AR, VR, or MR applications, especially for providing 6DOF viewing capabilities.

Increasing computational resources and advances in 3D data acquisition devices has enabled reconstruction of highly detailed volumetric video representations of natural scenes. Infrared, lasers, time-of-flight and structured light are all examples of devices that can be used to construct 3D video data. Representation of the 3D data depends on how the 3D data is used. Dense Voxel arrays have been used to represent volumetric medical data. In 3D graphics, polygonal meshes are extensively used. Point clouds on the other hand are well suited for applications such as capturing real world 3D scenes where the topology is not necessarily a 2D manifold. Another way to represent 3D data is coding this 3D data as set of texture and depth map as is the case in the multi-view plus depth. Closely related to the techniques used in multi view plus depth is the use of elevation maps, and multi-level surface maps. In dense point clouds or voxel arrays, the reconstructed 3D scene may contain tens or even hundreds of millions of points. If such representations are to be stored or interchanged between entities, then efficient compression becomes essential. Standard volumetric video representation formats, such as point clouds, meshes, voxel, suffer from poor temporal compression performance. Identifying correspondences for motion-compensation in 3D-space is an ill- defined problem, as both, geometry and respective attributes may change. For example, temporal successive “frames” do not necessarily have the same number of meshes, points or voxel. Therefore, compression of dynamic 3D scenes is inefficient. 2D-video based approaches for compressing volumetric data, i.e. multiview and depth, have much better compression efficiency, but rarely cover the full scene. Therefore, they provide only limited 6DOF capabilities.

Instead of the above-mentioned approach, a 3D scene, represented as meshes, points, and/or voxel, can be projected onto one, or more, geometries. These geometries are “unfolded” onto 2D planes (two planes per geometry: one for texture, one for depth), which are then encoded using standard 2D video compression technologies. Relevant projection geometry information is transmitted alongside the encoded video files to the decoder. The decoder decodes the video and performs the inverse projection to regenerate the 3D scene in any desired representation format (not necessarily the starting format).

Projecting volumetric models onto 2D planes allows for using standard 2D video coding tools with highly efficient temporal compression. Thus, coding efficiency is increased greatly. Using geometry-projections instead of prior-art 2D-video based approaches, i.e. multiview and depth, provide a better coverage of the scene (or object). Thus, 6DOF capabilities are improved. Using several geometries for individual objects improves the coverage of the scene further. Furthermore, standard video encoding hardware can be utilized for real-time compression/decompression of the projected planes. The projection and reverse projection steps are of low complexity.

Figure 3 illustrates an overview of an example of a compression process of a volumetric video. Such process may be applied for example in MPEG Point Cloud Coding (PCC). The process starts with an input point cloud frame 301 that is provided for patch generation 302, geometry image generation 304 and texture image generation 305.

The patch generation 302 process aims at decomposing the point cloud into a minimum number of patches with smooth boundaries, while also minimizing the reconstruction error. For patch generation, the normal at every point can be estimated. An initial clustering of the point cloud can then be obtained by associating each point with one of the following six oriented planes, defined by their normals:

- (1.0, 0.0, 0.0),

- (0.0, 1.0, 0.0),

- (0.0, 0.0, 1.0),

- (-1.0, 0.0, 0.0),

- (0.0, -1.0, 0.0), and

- (0.0, 0.0, -1.0)

More precisely, each point may be associated with the plane that has the closest normal (i.e. maximizes the dot product of the point normal and the plane normal).

The initial clustering may then be refined by iteratively updating the cluster index associated with each point based on its normal and the cluster indices of its nearest neighbours. The final step may comprise extracting patches by applying a connected component extraction procedure.

Patch info determined at patch generation 302 for the input point cloud frame 301 is delivered to packing process 303, to geometry image generation 304 and to texture image generation 305. The packing process 303 aims at mapping the extracted patches onto a 2D plane, while trying to minimize the unused space, and guaranteeing that every TxT (e.g. 16x16) block of the grid is associated with a unique patch. It should be noticed that T may be a user- defined parameter. Parameter T may be encoded in the bitstream and sent to the decoder. The used simple packing strategy iteratively tries to insert patches into a WxH grid. W and H may be user-defined parameters, which correspond to the resolution of the geometry/texture images that will be encoded. The patch location is determined through an exhaustive search that is performed in raster scan order. The first location that can guarantee an overlapping-free insertion of the patch is selected and the grid cells covered by the patch are marked as used. If no empty space in the current resolution image can fit a patch, then the height H of the grid may be temporarily doubled, and search is applied again. At the end of the process, H is clipped so as to fit the used grid cells.

The geometry image generation 304 and the texture image generation 305 are configured to generate geometry images and texture images respectively. The image generation process may exploit the 3D to 2D mapping computed during the packing process to store the geometry and texture of the point cloud as images. In order to better handle the case of multiple points being projected to the same pixel, each patch may be projected onto two images, referred to as layers. For example, let H(u, y) be the set of points of the current patch that get projected to the same pixel (u, v). The first layer, also called a near layer, stores the point of H(u, v) with the lowest depth DO. The second layer, referred to as the far layer, captures the point of H(u, v) with the highest depth within the interval [DO, D0+AJ, where A is a user-defined parameter that describes the surface thickness. The generated videos may have the following characteristics:

• Geometry: WxH YUV420-8bit,

• Texture: WxH YUV420-8bit,

It is to be noticed that the geometry video is monochromatic. In addition, the texture generation procedure exploits the reconstructed/smoothed geometry in order to compute the colors to be associated with the re-sampled points.

The geometry images and the texture images may be provided to image padding 307. The image padding 307 may also receive as an input an occupancy map (OM) 306 to be used with the geometry images and texture images. The occupancy map 306 may comprise a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. In other words, the occupancy map (OM) may be a binary image of binary values where the occupied pixels and non-occupied pixels are distinguished and depicted respectively. The occupancy map may alternatively comprise a non-binary image allowing additional information to be stored in it. Therefore, the representative values of the DOM (Deep Occupancy Map) may comprise binary values or other values, for example integer values. It should be noticed that one cell of the 2D grid may produce a pixel during the image generation process. Such an occupancy map may be derived from the packing process 303.

The padding process 307, for which the present embodiment are related, aims at filling the empty space between patches in order to generate a piecewise smooth image suited for video compression. For example, in a simple padding strategy, each block of TxT (e.g. 16x16) pixels is compressed independently. If the block is empty (i.e. unoccupied, i.e. all its pixels belong to empty space), then the pixels of the block are filled by copying either the last row or column of the previous TxT block in raster order. If the block is full (i.e. occupied, i.e., no empty pixels), nothing is done. If the block has both empty and filled pixels (i.e. edge block), then the empty pixels are iteratively filled with the average value of their non-empty neighbours.

The padded geometry images and padded texture images may be provided for video compression 308. The generated images/layers may be stored as video frames and compressed using for example the HM16.16 video codec according to the HM configurations provided as parameters. The video compression 308 also generates reconstructed geometry images to be provided for smoothing 309, wherein a smoothed geometry is determined based on the reconstructed geometry images and patch info from the patch generation 302. The smoothed geometry may be provided to texture image generation 305 to adapt the texture images.

The patch may be associated with auxiliary information being encoded/decoded for each patch as metadata. The auxiliary information may comprise index of the projection plane, 2D bounding box, 3D location of the patch.

For example, the following metadata may be encoded/decoded for every patch: - index of the projection plane o Index 0 for the planes (1.0, 0.0, 0.0) and (-1.0, 0.0, 0.0) o Index 1 for the planes (0.0, 1.0, 0.0) and (0.0, -1.0, 0.0) o Index 2 for the planes (0.0, 0.0, 1.0) and (0.0, 0.0, -1.0) - 2D bounding box (uO, vO, ul, vl) 3D location (xO, yO, zO) of the patch represented in terms of depth 50, tangential shift sO and bitangential shift rO. According to the chosen projection planes, (dq, sO, rO) may be calculated as follows: o Index o Index o Index

Also, mapping information providing for each TxT block its associated patch index may be encoded as follows: - For each TxT block, let L be the ordered list of the indexes of the patches such that their 2D bounding box contains that block. The order in the list is the same as the order used to encode the 2D bounding boxes. L is called the list of candidate patches.

- The empty space between patches is considered as a patch and is assigned the special index 0, which is added to the candidate patches list of all the blocks.

- Let / be index of the patch, which the current TxT block belongs to, and let J be the position of / in L. Instead of explicitly coding the index /, its position J is arithmetically encoded instead, which leads to better compression efficiency.

The occupancy map consists of a binary map that indicates for each cell of the grid whether it belongs to the empty space or to the point cloud. One cell of the 2D grid produces a pixel during the image generation process.

The occupancy map compression 310 leverages the auxiliary information described in previous section, in order to detect the empty TxT blocks (i.e. blocks with patch index 0). The remaining blocks may be encoded as follows: The occupancy map can be encoded with a precision of a BOxBO blocks. B0 is a configurable parameter. In order to achieve lossless encoding, B0 may be set to 1. In practice B0=2 or B0=4 results in visually acceptable results, while significantly reducing the number of bits required to encode the occupancy map.

The compression process may comprise one or more of the following example operations:

• Binary values may be associated with BOxBO sub-blocks belonging to the same TxT block. A value 1 associated with a sub-block, if it contains at least a non-padded pixel, and 0 otherwise. If a sub-block has a value of 1 it is said to be full, otherwise it is an empty sub-block.

• If all the sub-blocks of a TxT block are full (i.e., have value 1). The block is said to be full. Otherwise, the block is said to be non-full.

• A binary information may be encoded for each TxT block to indicate whether it is full or not.

• If the block is non-full, an extra information indicating the location of the full/empty sub-blocks may be encoded as follows: o Different traversal orders may be defined for the sub-blocks, for example horizontally, vertically, or diagonally starting from top right or top left corner o The encoder chooses one of the traversal orders and may explicitly signal its index in the bitstream. o The binary values associated with the sub-blocks may be encoded by using a run-length encoding strategy.

The binary value of the initial sub-block is encoded.

Continuous runs of Os and 1s are detected, while following the traversal order selected by the encoder.

The number of detected runs is encoded.

The length of each run, except of the last one, is also encoded.

Figure 4 illustrates an overview of a de-compression process for MPEG Point Cloud Coding (PCC). A de-multiplexer 401 receives a compressed bitstream, and after de-multiplexing, provides compressed texture video and compressed geometry video to video decompression 402. In addition, the de-multiplexer 401 transmits compressed occupancy map to occupancy map decompression 403. It may also transmit a compressed auxiliary patch information to auxiliary patch-info compression 404. Decompressed geometry video from the video decompression 402 is delivered to geometry reconstruction 405, as are the decompressed occupancy map and decompressed auxiliary patch information. The point cloud geometry reconstruction 405 process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels may be computed by leveraging the auxiliary patch information and the geometry images.

The reconstructed geometry image may be provided for smoothing 406, which aims at alleviating potential discontinuities that may arise at the patch boundaries due to compression artifacts. The implemented approach moves boundary points to the centroid of their nearest neighbours. The smoothed geometry may be transmitted to texture reconstruction 407, which also receives a decompressed texture video from video decompression 402. The texture reconstruction 407 outputs a reconstructed point cloud. The texture values for the texture reconstruction are directly read from the texture images.

The point cloud geometry reconstruction process exploits the occupancy map information in order to detect the non-empty pixels in the geometry/texture images/layers. The 3D positions of the points associated with those pixels are computed by levering the auxiliary patch information and the geometry images. More precisely, let P be the point associated with the pixel (u, v) and let (SO, sO, rO) be the 3D location of the patch to which it belongs and (uO, vO, ul, vl) its 2D bounding box. P can be expressed in terms of depth S(u, v), tangential shift y, v) and bi-tangential shift r(u, v) as follows:

S(u, v) = SO + g(u, v) s(u, v) = sO - uO + u r(u, v) = rO - vO + v where g(u, v) is the luma component of the geometry image.

For the texture reconstruction, the texture values can be directly read from the texture images. The result of the decoding process is a 3D point cloud reconstruction.

Visual volumetric video-based Coding (V3C) relates to a core part shared between ISO/I EC 23090-5 (formerly V-PCC (Video-based Point Cloud Compression)) and ISO/IEC 23090-12 (formerly MIV (MPEG Immersive Video)). V3C will not be issued as a separate document, but as part of ISO/IEC 23090-5 (expected to include clauses 1-8 of the current V-PCC text). ISO/IEC 23090-12 will refer to this common part. ISO/IEC 23090-5 will be renamed to V3C PCC, ISO/IEC 23090-12 renamed to V3C MIV. MIV relates to the compression of immersive video content, also known as volumetric video, in which a real or virtual 3D scene is captured by multiple real or virtual cameras. MIV enables storage and distribution of immersive video content over existing and future networks, for playback with 6 degrees of freedom (6 DoF) of view position and orientation within a limited viewing space and with different fields of view depending on the capture setup.

Figure 5 shows an example of an encoding process in the MIV extension of V3C. The encoding process comprises preparation of source material; per- group encoding; bitstream formatting and video encoding. The source material comprises source views 500, including view parameters, geometry component, attribute components, and optionally also entity map. The source material is processed by geometry quality assessment 505; split source in groups 510; synthesize inpainted background 515; and view labelling 520. In per-group encoding, the groups are encoded 525, the details of which - according to an example - are given in Figure 6. Bitstream formatting 530 is performed on parameter set, view parameters list and atlas data. Video sub bitstream encoding 550 is based on raw geometry, attribute and occupancy video data, which - after encoding - are packed 535 and multiplexed 540 with formatted bitstream 530.

Figure 6 shows a detailed example of the encoding process of Figure 5 (element 525). The process comprises automatic parameter selection 605; and optionally separation to entity layers 610. After these, the encoding comprises pixel pruning 615 and aggregating pruning masks 620. For creating atlas data, the clusters are split 630; patches are packed 635; patch attribute average value is modified 640; and color correction 645 is optionally performed. To create raw geometry, attribute and occupancy video data, video data is generated 650; geometry is quantized 655 and scaled 660. Optionally also occupancy is scaled 665. Figure 7 shows an example of a Tenderer of the MIV extension of V3C. The rendering is based on decoded access unit 700. In block to patch filtering, the entity filtering 705 is optionally performed, followed by patch culling 710. Reconstruction process comprises occupancy reconstruction 715; optional attribute average value restoration 725; which together with the output of patch culling are reconstructed as pruned view 720. Geometry processes comprises optional geometry scaling 730; optional depth value decoding 735 and optional depth estimation 740. For view synthesis, reconstructed pruned views are unprojected to global coordinate system 745 and reprojected and merged into a viewport 750. Finally for viewpoint filtering, inpainting 755 and view space handling 760 are performed.

The present embodiments are related generally to volumetric video coding and more specifically MPEG-I Immersive Video (MIV). From a set of geometry (depth) and attribute information (texture, occupancy, normals, transparency, etc.) captured or estimated from a set of sparse input camera, the goal of MIV is to project regions of the 3D scene into 2D patches, organize these patches into atlases, and provide the required metadata along with coded video bitstreams which would enable a client to synthesize dense views in a so- called “viewing volume”. A viewer can observe the surrounding scene from any position and angle within the viewing volume. A viewer can observe the surrounding scene from any position and angle within the viewing volume. The fact that the viewer can freely move and rotate his/her head and observe the correctly rendered scene at any time in a fluid manner is referred to as a “6- DOF” (six degrees of freedom) experience.

Thus, the aim of the present embodiments is to provide view-dependent patches for 6-DOF (Degrees-of-Freedom) experiences supporting specular content.

MIV provides satisfactory results for Multiview and depth scenes when the content is diffuse. MIV is based on Depth Image-based Rendering (DIBR) that can address the challenge of correctly rendering motion parallax based on the viewer position and viewing direction within a viewing volume. For non-Lambertian (or non-diffuse) content, the appearance of a 3D surface patch varies with the observer’s viewpoint. For example, glossy reflections and specular highlights positions on the patch texture depend on the viewer’s viewing direction. These specular features are partially challenging for DIBR as the reflections are baked in the input reference views, and it would require additional modelling to be able to correctly synthesize these reflections in new virtual views. DIBR approached typically blend these baked-in reflections causing in virtual views multiple reflections that are physically not correct and significantly degrade the perceived quality.

MIV (MPEG-I Immersive Video) has currently no way to signal view-dependent content, nor normal per patch directions that would allow the decoder and renderer to correctly render the patch in an arbitrary viewpoint orientation, in a 6-DoF scenario.

Two view synthesis approaches have been designed to better handle specularity on static content. These approaches are not exactly DIBR but they are related to it, e.g. 1 ) lumigraph and 2) light field rendering, which both use a representation geometry (e.g., a proxy geometry instead of depth maps as in DIBR) and input camera textures.

The solution based on lumigraph (1 ) uses a proxy geometry instead of depth maps, and encodes viewmaps for each proxy geometry facet. These viewmaps are used to derive blending weights that favour the cameras that are the most aligned with the virtual view. The issues are related to the memory footprint of these viewmaps and problems of continuity. For rotations, when the blending weights would change from a favoured camera view to another one, very noticeable transitions are observed. Especially, if the number of input cameras is low. The approach is also not scalable with the number of input cameras in terms of compression, streaming or rendering.

More recently, a solution related to lightfield (2) based rendering can represent specular effects. This is done by creating a very dense set of input views (thousands of views compared to 15 to 25 in MIV), that must be arranged on a sphere, and a proxy mesh is generated for all these views. Because the inter camera distances is small and the viewing volume inside the sphere of cameras is small as well, specular features can be rendered with high fidelity, by blending views with a disk-based approach that always favours the camera that is the most aligned with the virtual view to synthesize. The issue with this approach is the amount of input views that is two orders of magnitude large than for MIV, which severely limits compression and streaming experiences. Using less dense input cameras causes blending artifacts with multiple noticeable reflections.

In a different area, in the context of real-time rendering of synthetic content with ray-tracing, the von Mises-Fisher (vMF) distribution was exploited to approximate surface Bidirectional Reflectance Distribution Function (BRDF) in order to accelerate the rendering of specular highlights and inter-object reflections. The use of the vMF representation is demonstrated to provide approximations with faster processing in these works when light sources are known as well as the geometry of the scene and the surface BRDFs.

The present embodiments propose adding a level of modelling. By evaluating the Birectional Reflectance Distribution Functions (BRDFs) of the points or of the patches within MIV, a non-linear blending of input views can be produced that leads to physically correct reflections. This implies storing patch-level normal directions and patch-level BRDF representation parameters.

The present embodiments can be applied to extend MIV to view dependent content as follows:

At the encoder:

- evaluate parameters, for example BRDF, defining the relation between the incoming and outgoing radiances at a given point P on one or more surface patches. The parameters can be estimated e.g. by using Von Mises-Fisher distribution; encode the vMF parameters or other parameters relating to BRDF for each non-diffuse patch;

- additionally, for each non-diffuse patch, encoding the surface normal direction of the patch to the patch description unit (PDU). This surface normal direction can be estimated in many ways, for example, as the average of the patch vertices or face normal. - incorporating the vMF parameters or other parameters relating to BRDF, to a metadata.

At the decoder:

- receiving an encoded bitstream;

- determining a viewing direction;

- determining vMF parameters or other parameters relating to the BRDF from a metadata being associated with the encoded bitstream;

- for each viewing direction computing a distribution based on the viewing direction and the determined vMF parameters or other parameters relating to BRDF;

- determine surface normal direction from a patch description unit;

- using the computed distribution as weight with a determined surface normal direction to render a specular patch.

In the previous embodiment, a BRDF is used as an example. The BRDF describes the distribution of reflected light ray in particular direction with respect to the incident light. In the present embodiments, the light sources and BRDFs are unknown in the general context. Therefore, the estimation of a BRDF approximation may be contributed by fitting, for example, a vMF distribution by addressing an inverse problem. According to an embodiment, the glossy BRDF is treated at a surface point as a directional distribution modelled with a spherical vMF distribution. The vMF is an isotropic distribution for directional data and statistics, and a generalization of the von Mises Distribution to higher dimensions.

Compared to previous works in view of synthesis, the present embodiments have the benefit to include more advanced modelling, which enables performing non-linear view blending that is necessary to handle specular content. Indeed, approaches form the state-of-art use linear blending of views where reflections are baked-in; resulting in multiple reflections that are not physically correct. The use of a BRDF representation enables to adapt light intensity from input views in a physically correct manner that also leads to better visual quality. The present embodiments are discussed in more detailed manner in the following. A “point” used in the disclosure refers to a three-dimensional (3D) point in a space (as in point clouds), as well as its projection in camera views (pixel in multiview and depth).

In order to allow a client to synthesize reflections from sparse input views in a given viewing space, a BRDF approximation may be encoded at a patch level with a vMF distribution as well as the surface normal direction of patch. Based on these metadata, the client can perform a per-pixel blending with non-linearly estimated weights. Various embodiments for MIV and for Video Point Cloud Coding in V3C (Visual Volumetric Video-based Coding) is proposed.

Usage of vMF has the following advantages. For example, vMF is best fit for spherical data. The light distribution can be modelled using vMF. The vMF parameter can be estimated with simple non-biased estimators. In addition, the BRDF of a point on a surface can be modelled using a vMF or a mixture of vMFs.

The probability density function of the vMF distribution for the random p- dimensional unit vector is given by: (Equation 1) where K>=0 and || m || = 1, Normalization constant C p (K ) for p=3 is given by r ( K) — — J nsinh

The mean direction ( m ) and concentration (k) are estimated as follows where Xi are un-normalized directions of observations (in Equation 2 x,·, X, m are vectors). (Equation 2) and approximation of concentration parameter is given by (in Equation 3, k and R are scalar)

R(3— R 2 ) k =

1- R 2 (Equation 3) Higher value of the concentration (k) indicates that the point is more specular, and lower value of the concentration (k) indicates that the point belongs to a diffuse surface as shown in Figure 8 showing examples of vMF distributions with different concentration (k) values from diffuse to increasingly specular. The arrows 801 in each example illustrates the mean direction ( m ).

In the context of view synthesis in MIV, vMF parameter evaluation can be performed either per point (or projected point as a pixel) or per patch, based on the 3D location of the point and its colours observed from the multiple input camera texture and depth values. In each camera view, the colour of the point is the observation of how the light in the scene has been reflected by that point in the viewing direction of the camera.

Depending on the scene lighting, and on the type of material to which it belongs to, a point or a patch BRDF can be represented with a

- single lobe (i.e., one vMF representation) or with

- multiple lobes per patch/per point (a mixture of several vMF representations).

Multiple lobes can be due to multiple reflections, such as inter-reflections in a scene. In the multiple lobes scenario, additional merging of the lobes may be necessary to simplify the representation and its estimation.

The parameter evaluation either per-point or per-patch, for single lobe or multiple lobes can be performed

• based on light sources when available (i.e., for ray tracing scenarios for synthetic content); or · based on the observed light and color intensity from multiple cameras

(i.e., solving an inverse problem from real captured data); or

• based on methods that are based or using an estimation of one or several of the following light sources, scene geometry (when depth is not available), surface material etc. (in the more general case). The vMF distribution and BRDF in general can be defined in a point or patch local coordinate system, where the z-axis is the point or patch normal direction, and the x-axis and y-axis are defined and chosen on the point or patch tangential plane. The vMF parameters can also be stored in the world coordinates of the scene, but processing the representation requires the knowledge of the point or patch normal direction in any case.

In the following, several embodiments for fitting the vMF or mixture of vMFs distribution to a point or to a patch are described. In addition, it is described how to evaluate the most relevant colors as reference for blending and to use the vMF parameters to perform high-quality blending for specular surfaces.

Per patch parameter estimation: Patch parameter estimation is described in below with reference to MIV and to point cloud coding, which are alternatives to each other.

A) MIV

According to an embodiment, an MIV encoder receives multiple camera views with depth, as well as camera extrinsic and intrinsic information. The intrinsic and extrinsic parameters may comprise one or more of the following: a focal length, an image sensor format, a principal point, an orientation of the camera, i.e., a position of the camera center and the direction of the camera.

The encoder is configured to compare the received views to each other in order to classify points into diffuse or specular. This may be done by comparing each pair of views and identify for points which are unprojected on the same 3D location up to some threshold and which exhibit different colours and light intensity in the various input cameras where it is visible. These points are then classified as specular. If these points share the same colour or light intensity in all views up to a threshold, they are classified as diffuse. For each point, a normal direction is also estimated and stored per view. Points that are classified as specular are then aggregated into masks and merged into specular patches for each view.

Point normal direction are aggregated per mask and per patch as well. vMF parameters can now be estimated on each specular patch using Equation 2 and Equation 3 by considering the luma intensity of the patch in each camera direction with respect to the patch normal.

The MIV encoder signals in the pdu_miv_extension both the patch normal direction and the vMF parameters inside the PDU (patch data unit).

B) Point Cloud Coding

According to another embodiment, an encoder receives surface light field (SLF)/view dependent point cloud, that is a point cloud with more than single colour attribute n (n>1), representing different viewing angles. Different activity regions are analyzed at the encoder for example “medium”, “high” and “low”. Comparing luma values is one of the options, where the encoder may compare the luma values Y of a 3D point p(X, Y, Z) observed from all the available cameras CO, C1, ... Cn. Based on this analysis, each point in a point cloud is classified to one of three different activity areas (low, medium and high).

The V-PCC encoder adds an extra constraint to create the patch that belongs to the same activity region and not create the patch belonging to different activity regions, i.e., every patch is either or high, medium or low SLF activity. vMF parameters are evaluated on the segmented patches as defined in Equation 2 and Equation 3.

Per patch colour selection: Any number of colours can be selected per patch. For simplicity and in this solution, only two colours are evaluated. Colour selection can be o maximum colour cMax and minimum colour cMin per patch. o or other colour selection criteria such as based on reconstruction and error minimization can be followed.

Packing and Encoding colours: Selected colours are packed as video channel and encoded.

Signalling the parameters:

1. vMF parameters signalling:

A patch data unit syntax for signalling vMF parameters, or other parameters relating to BRDF, is given in the following:

Where vde_flag is a flag needed to specify view-dependent patches are enabled. If vde_flag is larger than zero, then the following is signalled: pdu_patchAttCount represents the number attributes per patch, e.g. 3 in case of one vMF lobe and multiples of 3 based on number vMF signalled per patch. pdu_Attld(i) lists indices of the attribute and pdu_AttValue(i), values corresponding to each index.

According to an embodiment, the vMF median direction is signaled as mu_X, mu_Y and mu_Z coordinates, and patch AttCount is set to three in the case of a single lobe. According to another embodiment, the vMF concentration is also signaled as a float value kappa, in that case pdu_patch AttCount is set to four and the pdu_Attld(i) are set as mu_X, mu_Y, mu_Z and kappa.

Signalling the Patch Normals: V3C enables the encoding of surface normals as attributes but one normal direction per patch is only needed. Signalling one normal per patch instead of one normal per patch pixel in the atlas leads to significant bitrate reduction and avoids using an additional decoder for normals. In one embodiment, patch normal direction is added in the patch data unit MIV extension. vde_flag is a flag needed to specify that view-dependent patches are enabled. If vde_flag is larger than zero, then three coordinates are encoded in pdu_normal[tilelD][p][c] for c in [0,3] for OMAF (Omnidirectional MediA Format) x, y, z coordinates respectively. The following may also be signalled if vde_flag is larger than zero: pdu_patchAttCount representing the number attributes per patch, e.g.

3 in case of one vMF lobe and multiples of 3 based on number vMF signalled per patch. pdu_Attld(i) lists indices of the attribute and pdu_AttValue(i) lists values corresponding to each index

Patch normal signalling is represented in the following syntax:

According to another embodiment, the BRDF parameters mu_x, mu_Y and mu_Z can be encoded differentially with respect to the patch normal direction.

Rendering Process for specular (i.e. non-diffuse) patches

According to an embodiment, in an extension of MIV, the rendering process of a specular patch with encoded normal directions and BRDF parameters signalled as aforementioned, only the texture processing is modified. For a given virtual view where the specular patch is visible, a ray is casted from the virtual camera towards the patch centre. This direction is used to sample the patch vMF distribution that is obtained from the signalled parameters. The resulting value at that direction is used to weight the luma of the closest input camera texture of that specular patch. Alternatively, the resulting value at that direction is used to weight the luma values of part of all available colours in all available specular patches that represent the same surface.

The resulting yuv color for a pixel p in a virtual view in viewing direction ( 9 ) with respect to the point normal n p as illustrated on Figure 9 is given as (Equation 4) where 9' is the viewing direction of the input view that is the most aligned with 9 , and where yuv p (9') is the color of the corresponding patch whose viewjd camera extrinsics are the (obtained from AtlasPatchProjectionID[ patchldx]) most aligned with the virtual view direction.

Alternatively, yuv p (9 ) is obtained as the weighted sum of the k-most aligned views with weights derived from the ratio between the vMF distribution intensity at the virtual view direction and the vMF distribution intensity at those k-most aligned orientations respectively. Other strategies to define weights from the vMF or mixture of vMF values are possible, for example only taking into account the respective views of the min and max luma intensities that are available in the bitstream (i.e. when several patches are encoded and from which pixel p can be reconstructed).

According to another embodiment and in case of highly specular surfaces, a further extension is needed. An environment map can be provided in the bitstream as a patch or set of patches as illustrated on Figure 10. Environment maps can be parameterized as a cube map, a sphere map etc. that can be represented by patches. They can also be estimated from a MIV scene by computing a background representation with patches.

In case of a glossy material, environment maps are typically filtered to approximate BRDF lobes convolutions. This effect may not be more effective than the cosine derived weights as above. When the patch material is like a mirror, blending cannot work, and remapping reflections is needed. This can be done thanks to the patch normal direction, and casting rays from the view v position towards the environment map following the reflection angle with the normal.

The signalled vMF distribution of a specular patch can also be used to filter the environment map texture that is reflected on the patch in a virtual view when the vMF concentration is not maximal (glossy reflection).

Figure 10 illustrates that in case of a highly glossy or specular patch, the appearance of the patch changes drastically between viewing directions. Thus, it is necessary to estimate reflections by casting rays and by estimating intersections with an estimated environment map. The vMF distribution may then be used to filter (blur) the reflected colors from the environment map.

Reconstruction:

According to another embodiment, in the case of a V-PCC or V3C point cloud decoder, the decoder receives a V-PCC bit stream with two attribute video streams, as well as the necessary patch data unit information VMF parameters required for reconstruction. The reconstruction may be performed on a per- patch basis.

Distribution of the camera to be reconstructed is evaluated using VMF parameters as mentioned in Equation 1 . Distribution used as weight and the decoded colours (cMax and cMin ) may then be used to reconstruct the colour as shown in Equation 5 below.

Interpolation of color using cMax and cMin based on the vMF density f: c = cMax * f + cMin * (1 - f) (Equation 5)

It should be noted that other approaches to derive blending weights for the interpolated color are possible in the rendering process using the vMF density f. The method according to an embodiment is shown in Figure 11 . The method generally comprises receiving inputs relating to three-dimensional 1110; generating one or more two-dimensional patches from the inputs 1120; determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three-dimensional point of a surface 1130; determining surface normal direction of a patch 1140; incorporating the determined parameters and information on the determined surface normal direction to metadata 1150; and associating metadata with a coded bitstream 1160. Each of the steps can be implemented by a respective module of a computer system.

An apparatus according to an embodiment comprises means for receiving inputs relating to three-dimensional; means for generating one or more two- dimensional patches from the inputs; means for determining parameters relating to distribution of reflected light ray in a direction with respect to incident light defining a relation between incoming and outgoing radiances at a three- dimensional point of a surface; means for determining surface normal direction of a patch; means for incorporating the determined parameters and information on the determined surface normal direction to metadata; and means for associating metadata with a coded bitstream. The means comprises at least one processor, and a memory including a computer program code, wherein the processor may further comprise processor circuitry. The memory and the computer program code are configured to, with the at least one processor, cause the apparatus to perform the method of Figure 11 according to various embodiments.

An apparatus according to an embodiment is illustrated in Figure 12. The apparatus is a user equipment for the purposes of the present embodiments. The apparatus 90 comprises a main processing unit 91 , a memory 92, a user interface 94, a communication interface 93. The apparatus according to an embodiment, shown in Figure 12, may also comprise a camera module 95. Alternatively, the apparatus may be configured to receive image and/or video data from an external camera device over a communication network. The memory 92 stores data including computer program code in the apparatus 90. The computer program code is configured to implement the method according various embodiments by means of various computer modules. The camera module 95 or the communication interface 93 receives data, in the form of images or video stream, to be processed by the processor 91. The communication interface 93 forwards processed data, i.e. the image file, for example to a display of another device, such a virtual reality headset. When the apparatus 90 is a video source comprising the camera module 95, user inputs may be received from the user interface.

An example of a device for content consumption, i.e. an apparatus according to another embodiment, is a virtual reality headset, such as a head-mounted display (HMD) for stereo viewing. The head-mounted display may comprise two screen sections or two screens for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module for determining the head movements and direction of the head. The head-mounted display is able to show omnidirectional content (i.e., 3DoF/6DoF content) of the recorded/streamed image file to a user.

The various embodiments can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the method. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of various embodiments.

A computer program product according to an embodiment can be embodied on a non-transitory computer readable medium. According to another embodiment, the computer program product can be downloaded over a network in a data packet. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.