Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
STEREOSCOPIC HIGH DYNAMIC RANGE VIDEO
Document Type and Number:
WIPO Patent Application WO/2023/215108
Kind Code:
A1
Abstract:
Methods and systems for stereoscopic 3D video are described. Input HDR stereoscopic views in a first codeword representation are merged together by a first merging function to generate an input merged view to optimize a reshaping operation which generates a reshaped merged view in a second codeword representation and associated composer metadata. The reshaped merged view may be split and re-merged by a second frame packing function to optimize video encoding efficiency of an output coded bitstream based on the reshaped merged view. In a decoder, after extracting the reshaped merged view from the coded bitstream, a composer function applies the composer metadata to the decoded reshaped merge view to generate an output merged view in the first codeword representation. Then it generates output HDR stereoscopic views based on the output merged view.

Inventors:
HUSAK WALTER J (US)
YIN PENG (US)
SU GUAN-MING (US)
ATKINS ROBIN (US)
Application Number:
PCT/US2023/019111
Publication Date:
November 09, 2023
Filing Date:
April 19, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DOLBY LABORATORIES LICENSING CORP (US)
International Classes:
H04N19/30; H04N19/597; H04N19/59; H04N19/85
Domestic Patent References:
WO2014041355A12014-03-20
WO2021216607A12021-10-28
Foreign References:
EP3176749A22017-06-07
US10863182B22020-12-08
US10419762B22019-09-17
US10032262B22018-07-24
US10701375B22020-06-30
US10264287B22019-04-16
US11277627B22022-03-15
US20220046245A12022-02-10
US202217630901A
Other References:
OLIVIER Y ET AL: "HDR CE2-related: some experiments on ETM with dual grading input", vol. JCTVC-W0089, no. JCTVC-W0089, 15 February 2016 (2016-02-15), pages 1 - 15, XP030117867, Retrieved from the Internet
XIU (INTERDIGITAL) X ET AL: "Description of SDR, HDR and 360° video coding technology proposal by InterDigital Communications and Dolby Laboratories", no. m42388, 11 April 2018 (2018-04-11), XP030261492, Retrieved from the Internet [retrieved on 20180411]
BORDES P ET AL: "Description of SDR, HDR and 360° video coding technology proposal by Qualcomm and Technicolor "" medium complexity version", 10. JVET MEETING; 10-4-2018 - 20-4-2018; SAN DIEGO; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://PHENIX.INT-EVRY.FR/JVET/,, no. JVET-J0022-v3, 12 April 2018 (2018-04-12), XP030151186
"High efficiency video coding", ITU-T REC. H.265, August 2021 (2021-08-01)
Y. YE: "Recent trends and challenges in 360-degree video compression", 2018, ICME
Attorney, Agent or Firm:
KONSTANTINIDES, Konstantinos et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method for encoding stereoscopic high-dynamic range (HDR) video, the method comprising: receiving a first view and a second view of a scene in a first codeword representation; applying a merging function to merge the first view and the second view into an input merged view in the first codeword representation; applying a reshaping process to the input merged view to generate a reshaped merged view in a second codeword representation and composer metadata, wherein the composer metadata allow a composer function operating on the reshaped merged view to generate an output approximating the input merged view in the first codeword representation; applying a split function to the reshaped merged view to generate a first reshaped view and a second reshaped view in the second codeword representation; encoding the first reshaped view and a second reshaped view to generate a coded bitstream; and combining the coded bitstream and the composer metadata to generate a coded output.

2. The method of claim 1, wherein the input merged view comprises a side-by-side merging of the first view and the second view or a top-and-bottom merging of the first view and the second view.

3. The method of claim 2, wherein side-by-side merging is applied when the first view and the second view are in a letterbox format and top-and-bottom merging is applied when the first view and the second view are in a pillar format.

4. The method of claim 1, wherein encoding the first reshaped view and the second reshaped view comprises: frame packing the first reshaped view and the second reshaped view in a single frame according to a frame-packing format and compressing the single frame using a video coder and supplemental enhancement information (SEI) messaging indicating the frame-packing format.

5. The method of claim 1 wherein encoding the first reshaped view and the second reshaped view comprises employing temporal interleaving.

6. The method of claim 1 wherein encoding the first reshaped view and the second reshaped view comprises employing scalable video coding, multiview coding, or 3D coding.

7. The method of claim 6, wherein encoding with scalable video coding comprises: setting for the second reshaped view a Temporalld equal to maxTemporalld; and setting for the first reshaped view the Temporalld smaller than the maxTemporalld but larger or equal to 0.

8. The method of claim 1, wherein encoding the first reshaped view and the second reshaped view comprises merging the first reshaped view and the second reshaped view in a merged reshaped view with a merge format different than the merge format in the input merged view.

9. The method of claim 1, wherein the reshaping process further comprises: receiving a first SDR view and a second SDR view of the scene; applying the merging function to merge the first SDR view and the second SDR view into an SDR merged view; and applying the reshaping process using both the SDR merged view and the input merged view to generate the reshaped merged view in the second codeword representation and the composer metadata.

10. A method for decoding stereoscopic high-dynamic range (HDR) video, the method comprising: receiving a bitstream comprising reshaped coded data in a second codeword representation and composer metadata; demultiplexing the bitstream to extract the reshaped coded data and the composer metadata; decoding the reshaped coded data to generate a reshaped merged view in the second codeword representation; applying a composer function to the reshaped merged view to generate based on the composer metadata an output merged view in a first codeword representation; and generating based on the output merged view an output first view and an output second view in the first codeword representation.

11. The method of claim 10, wherein generating the output first view and the output second view comprises: receiving a supplemental enhancement information (SEI) message indicating a packing format; and generating the output first view and the output second view based on the SEI message.

12. The method of claim 10, wherein generating the output first view and the output second view comprises: receiving syntax elements in the bitstream indicating a time-interleaved coding format; and generating the output first view and the output second view based on time interleaving.

13. The method of claim 12, wherein if a Temporalld is equal to maxTemporalld, then extracting the output second view; and if the Temporalld is smaller than the maxTemporalld but larger or equal to 0, then extracting the output first view.

14. An apparatus comprising a processor and configured to perform any one of the methods recited in claims 1-13.

15. A non-transitory computer-readable storage medium having stored thereon computerexecutable instruction for executing a method with one or more processors in accordance with any one of the claims 1-13.

Description:
STEREOSCOPIC HIGH DYNAMIC RANGE VIDEO

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Application No. 63/338,781 filed on May 5, 2022, which is incorporated by reference in its entirety.

TECHNOLOGY

[0002] The present invention relates generally to images. More particularly, an embodiment of the present invention relates to techniques for the stereoscopic transmission of high dynamic range video.

BACKGROUND

[0003] As used herein, the term 'dynamic range' (DR) may relate to a capability of the human visual system (HVS) to perceive a range of intensity (e.g., luminance, luma) in an image, e.g., from darkest grays (blacks) to brightest whites (highlights). In this sense, DR relates to a 'scene- referred' intensity. DR may also relate to the ability of a display device to adequately or approximately render an intensity range of a particular breadth. In this sense, DR relates to a 'display-referred' intensity. Unless a particular sense is explicitly specified to have particular significance at any point in the description herein, it should be inferred that the term may be used in either sense, e.g., interchangeably.

[0004] As used herein, the term high dynamic range (HDR) relates to a DR breadth that spans the some 14-15 orders of magnitude of the human visual system (HVS). In practice, the DR over which a human may simultaneously perceive an extensive breadth in intensity range may be somewhat truncated, in relation to HDR. As used herein, the terms enhanced dynamic range (EDR) or visual dynamic range (VDR) may individually or interchangeably relate to the DR that is perceivable within a scene or image by a human visual system (HVS) that includes eye movements, allowing for some light adaptation changes across the scene or image.

[0005] In practice, images comprise one or more color components (e.g., luma Y and chroma Cb and Cr) wherein each color component is represented by a precision of n-bits per pixel (e.g., n = 8). For example, using gamma luminance coding, images where n < 8 (e.g., color 24-bit JPEG images) are considered images of standard dynamic range, while images where n > 10 may be considered images of enhanced dynamic range. EDR and HDR images may also be stored and distributed using high-precision (e.g., 16-bit) floating-point formats, such as the OpenEXR file format developed by Industrial Light and Magic.

[0006] Most consumer desktop displays currently support luminance of 200 to 300 cd/m 2 or nits. Most consumer HDTVs range from 300 to 500 nits with new models reaching 1000 nits (cd/m 2 ). Such conventional displays thus typify a lower dynamic range (LDR), also referred to as a standard dynamic range (SDR), in relation to HDR or EDR. As the availability of HDR content grows due to advances in both capture equipment (e.g., cameras) and HDR displays (e.g., the PRM-4200 professional reference monitor from Dolby Laboratories), HDR content may be color graded and displayed on HDR displays that support higher dynamic ranges (e.g., from 1,000 nits to 5,000 nits or more). In general, without limitation, the methods of the present disclosure relate to any dynamic range higher than SDR.

[0007] As used herein, the term “display management” refers to processes that are performed on a receiver to render a picture for a target display. For example, and without limitation, such processes may include tone-mapping, gamut-mapping, color management, frame-rate conversion, and the like.

[0008] The creation and playback of high dynamic range (HDR) content is now becoming widespread as HDR technology offers more realistic and lifelike images than earlier formats. It is expected that stereo HDR in combination with volumetric video will provide a more immersive experience. To improve existing coding schemes, as appreciated by the inventors here, improved techniques for the stereoscopic transmission and display of stereoscopic HDR video are developed.

[0009] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.

BRIEF DESCRIPTION OF THE DRAWINGS

[00010] An embodiment of the present invention is illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:

[00011] FIG. 1A depicts an encoding process for stereoscopic HDR video according to a first example embodiment of the present invention;

[00012] FIG. IB depicts a decoding process for stereoscopic HDR video according to a first example embodiment of the present invention;

[00013] FIG 2A and FIG. 2B depict encoding and decoding processes for stereoscopic HDR video according to second example embodiments of the present invention;

[00014] FIG. 2C depicts a decoding process for stereoscopic HDR video according to a third example embodiment of the present invention; and [00015] FIG. 3 depicts merging left and right stereoscopic views before reshaping according to an example embodiment of the present invention.

DESCRIPTION OF EXAMPLE EMBODIMENTS

[00016] Methods for the stereoscopic HDR video coding and decoding are described herein. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily occluding, obscuring, or obfuscating the present invention.

SUMMARY

[00017] Example embodiments described herein relate to methods for a stereoscopic HDR video pipeline. In an embodiment, a processor receives a first view and a second view of a scene in a first codeword representation; applies a merging function to merge the first view and the second view into an input merged view in the first codeword representation; applies a reshaping process to the input merged view to generate a reshaped merged view in a second codeword representation and composer metadata, wherein the composer metadata allow a composer function operating on the reshaped merged view to generate an output approximating the input merged view in the first codeword representation; applies a split function to the reshaped merged view to generate a first reshaped view and a second reshaped view in the second codeword representation; encodes the first reshaped view and a second reshaped view to generate a coded bitstream; and combines the coded bitstream and the composer metadata to generate a coded output. [00018] In a second embodiment, a processor receives a bitstream comprising reshaped coded data in a second codeword representation and composer metadata; demultiplexes the bitstream to extract the reshaped coded data and the composer metadata; decodes the reshaped coded data to generate a reshaped merged view in the second codeword representation; applies a composer function to the reshaped merged view to generate based on the composer metadata an output merged view in a first codeword representation; and generates based on the output merged view an output first view and an output second view in the first codeword representation. STEREO HDR VIDEO ARCHITECTURE

[00019] FIG. 1 depicts an example embodiment for an encoding pipeline for stereoscopic video. As used herein, the term “metadata” relates to any auxiliary information that is transmitted as part of a coded bitstream or sequence and assists a decoder to render a decoded image. Such metadata may include, but are not limited to, color space or gamut information, reference display parameters, and auxiliary signal parameters, as those described herein.

[00020] As used herein, the term “forward reshaping” denotes a process of sample-to-sample or codeword-to-codeword mapping of a digital image from its original bit depth and original codewords distribution or representation (e.g., gamma, PQ, HLG, and the like) to an image of the same or different bit depth and a different codewords distribution or representation. Reshaping allows for improved compressibility or improved image quality at a fixed bit rate. For example, without limitation, reshaping may be applied to 10-bit or 12-bit PQ-coded HDR video to improve coding efficiency in a 10-bit video coding architecture. In a receiver, after decompressing the received signal (which may or may not be reshaped), the receiver may apply an “inverse (or backward) reshaping function” to restore the signal to its original codeword distribution and/or to achieve a higher dynamic range.

[00021 ] As depicted in FIG. 1 , the source content includes a left view, a right view, and metadata (102) to assist in proper display management (DM). For example, such metadata may include “LI metadata,” denoting minimum, medium, and maximum luminance values related to an input frame or image. LI metadata may be computed by converting RGB data to a lumachroma format (e.g., YCbCr) and then computing min, mid (average), and max values in the Y plane, or they can be computed directly in the RGB space. For example, in an embodiment, LI Min denotes the minimum of the PQ-encoded mm(RGB) values of the image, while taking into consideration an active area (e.g., by excluding gray or black bars, letterbox bars, and the like). mzn(RGB) denotes the minimum of color component values { R, G, B] of a pixel. The values of LI Mid and LI Max may also be computed in a same fashion replacing the minQ function with the averageQ and maxQ functions. For example, LIMid denotes the average of the PQ-encoded mav( RGB) values of the image, and LI Max denotes the maximum of the PQ- encoded max(RGB) values of the image. In some embodiments, LI metadata may be normalized to be in [0, 1].

[00022] Given the two views, a merger unit (105) merges the left view and right view. The merging can be done horizontally or vertically, and the details (depending on detection of letterbox and pillar boxes) will be explained later. The merger packs left-view video (with dimension HxW) and right-view video (with dimension HxW) together into a single frame. This merging is performed to optimize the efficiency of the reshaper (110). The packed image is passed through the reshaper (110) to generate the reshaped video and corresponding composer metadata (112). The reshaped bitstream is passed through a video codec (125) (e.g., HEVC, VVC, and the like) for compression. A muxer (130) will multiplex the coded video bit stream (127) with the composer metadata (112), and the DM metadata (102) to generate an output bitstream (132).

[00023] The output of the reshaper (110) is the merged reshaped signal. In certain embodiments, in spatial splitter 115, it might be split back into left view and right views, and go through another frame packing process (120), this time to optimize video encoding (125). For example, depending on the required encoding format, frame packing options include:

• Full resolution of each view in a side by side packing format with a coded frame resolution of Hx2W.

• Down-sampling of each view horizontally, with dimension Hx(W/2), and packing them side by side (SbS) with final image resolution HxW.

• Down-sampling of each view vertically, with dimension (H/2)xW, and packing them top- and-bottom (TaB) with final image resolution HxW.

[00024] In an embodiment, the splitter unit (115) and the frame packing unit (120) may be removed from the pipeline if the output format of the spatial merger unit 105 matches the frame packing format to be used for video encoding.

[00025] Note that if half resolution (either horizontal down-sampling or vertical downsampling) is chosen, one can also perform the resolution reformatting before the left/right view merger to reduce the computation of reshaping by half. Note that the merger (105) and frame packing (120) can have different formats. For example, the left/right view merger can merge views using TaB to facilitate the reshaping, but the frame packing (120) can split the left and right and re-merge them as SbS for packing to improve coding efficiency or other concerns. [00026] The composer coefficients (112) will be generated from a single, merged frame, to ensure the same reshaping function is applied to both views.

[00027] FIG. IB depicts an example embodiment of the decoding process. The decoder side performs the opposite operation. After demultiplexing (135) a suitable video decoder (140) (matching the encoder 125) will decode the bitstream to get the reshaped signal. Then, a composer (145) will apply the composer metadata (112) to the decoder signal to reconstruct the HDR signal. Finally, a splitter 150 will recreate the two views. Display metadata 102 may be passed directly to the display management process (not shown).

[00028] FIG. 2A depicts an alternative embodiment of the encoding process by utilizing a scalable video encoder (125B) (e.g., using HEVC temporal scalability). At the encoder side, for a given left/right pair, one will spatially merge them into a single frame and perform the reshaping. The composer coefficients will be generated from this single merged frame. Then, in block 115, one will split this merged frame to left and right reshaped images. The left view will be encoded as the first temporal sub-layer and the right view will be encoded as the second temporal sublayer.

[00029] At the decoder side, as depicted in FIG. 2B, after demultiplexing (135), the scalable video decoder (HOB) will generate two temporal sub-layers which will be passed to the composer (145) to generate the reconstructed HDR images. Then, in unit 150B, one will temporally split the left and right view for subsequent post-processing and display.

[00030] Other decoder implementations are also feasible, such as having the output of the scalable video decoder (140B) output to two buffers, and each one processed by its own composer. Thus, one has two composers (145L, 145R) operating at half speed. This alternative is depicted in FIG. 2C.

[00031] In an alternative embodiment one may use a multi view encoder (such as the multiview HEVC encoder). In such a scenario, in FIG. 2B, instead of using a scalable video encoder (125B) one may replace it with a Multiview video encoder. Similarly, for the decoder, in FIG. 2C one may replace the scalable video decoder (1406) with a Multiview video decoder.

The Stereo Reshaper

[00032] As described earlier, the input to the reshaper is a merged representation of the left and right views. One straightforward way to perform reshaping and obtain the composer coefficients is to run a reshaper on each view individually. However, since the operation of reshaping is content dependent, this approach will generate different composer coefficients for each view. To ensure both views share the same forward and backward reshaping function, it is better if the reshaper input takes the merged left and right view as a single image. By doing so, the statistics in both views are considered together and the reshaping functions will be the same for both views. A variety of compatibility scenarios are examined first.

Non Backwards compatibility (NBC)

[00033] In the context of this disclosure, backwards compatibility refers to whether a decoder needs to be able to view an SDR version of the incoming content - that is, the incoming input is visible even if the decoder can’t apply reverse or backward reshaping. In a non-backwards compatible system, the decoder is required to apply the reshaping metadata (112) to generate a viewable HDR output.

[00034] In an embodiment, an NBC codec core takes a single image in and computes blockbased statistics and techniques (e.g., see Ref. [1-2]) to determine the required number of codewords in each luminance range. Using that information, one can build a forward look-up table (LUT) to reshape the input HDR signal to a lower bit depth in the reshaped domain. The reshaper will also output composer coefficients, which allow a decoder to construct the reversereshaping LUT, to reconstruct the HDR signal. To re-use this single-view functionality to stereo signals, one can merge the left view and the right view together and send it to a single-view NBC codec to output the reshaped stereo video and corresponding metadata. Though both side-by- side merge and top-and-bottom merge are feasible, note that a typical system may also have letterbox and pillar box detectors to exclude those non-texture dark areas from the reshaping operations. To make the existing detector work without modification, if the video has letterbox (top and bottom black bars), it is preferred to have side-by-side merge format. If the video has pillar box (left and right black bars) detection, then a top-and-bottom merge format is preferred. Thus, at the output of merger 105 the output image will be either the two views positioned side by side or the two views positioned top-to-bottom.

[00035] After reshaping, in unit 115, the reshaped merged video can be split again to reshaped left view and right views, depending on the choice of the next-stage video encoder (125).

Single Layer Backwards Compatibility (SLBC)

[00036] When coding HDR video in SLBC format, the top design priority is to preserve the fidelity of the HDR content. Thus, while an SDR version is created, the reshaping is optimized to preserve the original HDR view as much as possible. Depending on the availability of SDR- video versions during the reshaping process, there are two types of SLBC codecs. These will be discussed separately.

Reference SDR is available

[00037] When both reference SDR and HDR images for the same view are available, a singleview SLBC reshaping algorithm may take one HDR frame and one SDR frame as inputs and generates the reshaped SDR and the corresponding composer coefficients. If a device does not have HDR playback, it can still play the SDR base layer. Alternatively, an HDR decoder can apply the composer metadata to reconstruct the HDR signal.

[00038] The single-view SLBC algorithm performs CDF matching by building a mapping curve to match the histogram between HDR and SDR versions of the input images (e.g., see Refs [3-4]). The algorithm also builds a dynamic 3D mapping table (d3DMT) by scanning the chroma color components to solve for proper composer-related metadata (Ref. [5]). As depicted in FIG.

3, one can merge the left view HDR and right view HDR as a single HDR image. In addition, one also merges the left view SDR and right view SDR to a single SDR image. Then, one can reuse the forward reshaping to output the reshaped SDR signal. The corresponding composer coefficients are also outputted. [00039] Similar to the NBC case, the SLBC pipeline may have a letterbox/pillar box detector. So, depending on the box type, a control signal (302) may be used to control whether merging is performed in SbS or TaB formats. As before, the reshaped merged SDR (307) signal can be split to left and right views after the reshaper, depending on the choice of the next-stage video codec.

Reference SDR is not available

[00040] When the reference SDR is not available, the reshaper operates only on the HDR input to generate the reshaped SDR output and corresponding composer coefficients. Similarly to the NBC case, one can merge the left view HDR and right view HDR as input; and pass the merged HDR picture to a non-reference SLBC codec (e.g., see Refs [6-7]) to output the merged reshaped SDR and composer coefficients. As before, the reshaped merged SDR image can be split to left and right views after the reshaper, depending on the choice of the next- stage video codec.

Single Layer inverse Display mapping (SLiDM)

[00041] In certain HDR Profiles (as in Dolby Vision Profile 8.4), the original content is in SDR format. While a decoder can reconstruct an HDR version, the reshaping is optimized to preserve the original SDR content. In such formats an encoder may apply an inverse mapping technique (say, inverse tone mapping or inverse display management) to generate an HDR signal (Ref. [4]). If both SDR and HDR versions are available (Ref. [4]), then the merging of the left and right views is identical to the process described for the SLBC codec (see FIG. 3). If there is no HDR signal available (Ref. [8]) and static mapping from SDR to HDR is chosen, the available SDR signal is transmitted as is, thus there is no need to include units 105 and 115. Reshaping coefficients are simply multiplexed together with the coded stereoscopic signal.

[00042] For example, in Dolby Vision Profile 8.4 which convert HLG to PQ, this mode operates as an EOTF conversion, so one does not need access to the video content. The composer coefficients are static for the entire video sequence. In other profiles, one can create the HDR version from SDR using dynamic composer coefficients. The coefficients are created according to the content and features. In this case, one will need to merge the left and right view and perform content analysis to generate proper metadata 112.

Stereo Coding Details

[00043] As discussed earlier, a video encoder 125 may be one of a single-layer encoder, a scalable encoder, or a multi-view encoder. These options are discussed in this section with more detail. Without loss of generality, additional details will be described using HEVC (Ref. [9]) as an example; however, the same ideas can be applied to other codecs, such as AVC, AVI, VVC, and the like. For this discussion, it is assumed that the left view is the base view and the right view is an enhancement view. HEVC specifies three methods to support stereo 3D encoding. [00044] The first method is to use a frame packing arrangement SEI message for single layer HEVC. It supports side-by side (SbS), top-bottom (TaB) and temporal interleaving (TI), as indicated in Table D.8 of Ref. [9], for the syntax definition of the syntax parameter frame packing arrangement type.

[00045] For SbS, the spatial resolution of the original video is reduced by half horizontally. For TaB, the spatial resolution of the original video is reduced by half vertically. At a decoder, up-conversion processing is performed to recover the decoded video to the original resolution. [00046] When using temporal interleaving, the frames preserve the original resolution; however, the video frame rate is doubled to enable stereo 3D by interleaving the two views. At the decoder, a left view is extracted from the even numbered frames and the right view is extracted from the odd numbered frames, or vice versa. In practice, for 8k x 2k x 96Hz content, such an approach would requires using HEVC MainlO, Level 6.2.

Temporal Scalability

[00047] The second option is to use temporal sublayers to support stereoscopic video coding in a single layer HEVC. For example, in such a scenario the left view is coded with even frames and the right view is coded with odd frames. Such a solution also supports backward compatibility for single view case, i.e., the decoder can decode only left view.

[00048] In an embodiment, suppose the greatest value of Temporalld is set equal to maxTemporalld (maxTemoralId>0), the right view is coded with the Temporalld equal to maxTemporalld and the left view is coded with Temporallds smaller than maxTemporallD. [00049] To be specific, in the sequence parameter set (SPS), the syntax sps_max_sub_layers_minusl is signalled. Signal sps_max_sub_layers_minusl plus 1 specifies the maximum number of temporal sub-layers that may be present in each coded video sequence (CVS) referring to the SPS. In an embodiment, one can set maxTemporalld equal to the value of sps_max_sub_layers_minus 1.

[00050] In the network abstraction layer (NAL) unit header, Temporalld is signalled by the syntax nuh_tcmporal_id_plusl, defined as: nuh_temporal_id_plusl minus 1 specifies a temporal identifier for the NAL unit.

The value of nuh_temporal_id_plusl is set equal to Temporalld + 1.

[00051 ] Hence, for the right view, Temporalld is set equal to maxTemporalld. For the left view, Temporalld is set smaller than maxTemporalld but larger or equal to 0. With this setting, backward compatibility is supported by extracting a bitstream with Temporalld smaller than maxTemporalld. [00052] To code the temporal interleaving video more efficiently, the following encoder settings are proposed:

1) for inter-view prediction, if the left view decoded picture is used for predicting the right view picture, the left view reference picture should be marked as long-term reference pictures. In particular, to align with MV-HEVC design, only the left view decoded picture with the same time instance should be used for the long- term reference picture for the right view.

2) QP adaptation: for conventional temporal scalability, in general higher QP is used for highest temporal sublayer. In this application, since the highest temporal sublayer is used to code the right view, QP rule should be adjusted to have the best stereoscopic quality.

[00053] The main reason behind 1) is because of picture order count (POC) setting in the proposed embodiment.

1) POC: By definition, at the same time instance, left view and right view should have the same POC. In the proposed solution, one needs to assign the POC based on temporal interleaving and the meaning of POC between left view and right view does not have real temporal meaning anymore.

2) POC impact on coding: in HEVC, in merge or advance motion vector prediction (AVMP) mode, a list of candidates is created from the motion information of spatial or temporal neighbour prediction blocks. In this process, motion vectors (MVs) from neighbour blocks may be temporally scaled using POC. Since the POC is fake now, to not hurt coding efficiency, one needs to mark the inter- view reference picture as the long term reference pictures which disable scaling of MVs associated with long-term reference pictures.

[00054] As an example, in HDR coding, to associate metadata with the left and right views, one needs to check nuh_temporal_id_plusl to differ between the left and right view and use POC order and POC difference to associate left and right view which has the same time instance. [00055] For example, if POC1 is assigned to left view, one needs to look for the associated right view. The right view POC2 should have the following properties:

1) POC2 is larger than POC1 and has the smallest POC gap with POC1;

2) nuh_temporal_id_plusl equal to maxTemporalld+l.

[00056] In the case of supporting temporal scalability, the proposed solution needs to be adjusted to allow additional temporal sub-layer indication within the left and right views. In this case, it might be much simpler to use frame packing arrangement SEI message and MV-HEVC solution to not confuse the intention. [00057] Such temporal scalability case can be as follows: for example, most of a film is created at 24fps, but a few scenes at 96fps. Not all decoders can support 96fps so one may want a compatible 24 and extra 96 just for compatible decoders. One could set a 24fps base frame rate and use temporal scalability just for the scenes at 96fps.

In practice, for 4k x 2k x 192fps, such an implementation requires HEVC MainlO Level 6.1.

MV-HEVC

[00058] HEVC has two extensions to support 3D video: MV-HEVC and 3D-HEVC. The multi-view extension, MV-HEVC, allows efficient coding of multiple camera views and associated auxiliary pictures by reusing single-layer decoders without changing the block- level processing modules. It allows inter-view prediction by only high-level syntax (HLS) changes. The 3D video extension, 3D-HEVC, targets a coded representation for both multiple views and associated depth maps. It involves the changes of low-level coding tool modules. MV-HEVC currently only supports Multiview Main Profile. In practice, to support HDR, it requires defining Multiview Main 10 Profile, or Stereo Main 10 Profile for stereoscopic video.

Support of 360 degree video (3DoF)

[00059] The above solutions can also be extended to support 360 degree video. The standardized compression scheme for a 360 degree video is to first project it to a 2D plane and then apply a compression scheme for a 2D video. A typical delivery workflow is as follows (Ref. [10]):

- A multi-camera array captures video, then image stitching is applied to obtain spherical video

- Spherical video is “unfolded” to the 2D plane, e.g., using the projection, such as the equirectangular (ERP) or cube map, and the like

- Video encoding using HEVC, AVC, VVC, and the like, followed by packaging and delivery

- At the receiver, receiving 2D video and unpackage and decode it

- Project 2D plane back to sphere given specific viewpoint

- Render on a display

Therefore, from a coding and delivery point of view, a 2D 360-degree video has no difference from 2D plane video. For stereoscopic delivery for 360-degree HDR video, the above solutions can be applied. References

Each of these references is included by reference in its entirety.

1. US 10,419,762, “Content-adaptive perceptual quantizer for high dynamic range images.”

2. US 10,032,262, “Block-based content-adaptive reshaping for high dynamic range images.”

3. US 10701375, “Encoding and decoding reversible production-quality single-layer video signals.”

4. US 10,264,287, “Inverse Luma/Chroma mappings with histogram transfer and approximation.”

5. US 11,277,627, “High-fidelity full reference and high-efficiency reduced reference encoding in end-to-end single-layer backward compatible encoding Pipepeline.”

6. US Patent Application Publication 2022-0046245- Al, “Interpolation of reshaping functions.”

7. WIPO PCT Patent Application Publication WO 2021/216607, “Reshaping functions for HDR imaging with continuity and reversibility constraints.”

8. US Patent Application Ser. No. 17/630,901, “Electro-optical transfer function conversion and signal legalization,” filed on 27 Jan 2022, G-M. Su, et al.

9. ITU-T Rec. H.265, “High efficiency video coding,” ITU, version 08/2021.

10. Ref: “Recent trends and challenges in 360-degree video compression”, Y. Ye, ICME 2018, Hot3D talk.

EXAMPLE COMPUTER SYSTEM IMPLEMENTATION

[00060] Embodiments of the present invention may be implemented with a computer system, systems configured in electronic circuitry and components, an integrated circuit (IC) device such as a microcontroller, a field programmable gate array (FPGA), or another configurable or programmable logic device (PLD), a discrete time or digital signal processor (DSP), an application specific IC (ASIC), and/or apparatus that includes one or more of such systems, devices or components. The computer and/or IC may perform, control, or execute instructions related to image transformations, such as those described herein. The computer and/or IC may compute any of a variety of parameters or values that relate to stereoscopic HDR video processes described herein. The image and video embodiments may be implemented in hardware, software, firmware and various combinations thereof.

[00061] Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention. For example, one or more processors in a display, an encoder, a set top box, a transcoder or the like may implement methods related to stereoscopic HDR video processes as described above by executing software instructions in a program memory accessible to the processors. The invention may also be provided in the form of a program product. The program product may comprise any tangible and non-transitory medium which carries a set of computer-readable signals comprising instructions which, when executed by a data processor, cause the data processor to execute a method of the invention. Program products according to the invention may be in any of a wide variety of tangible forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like. The computer-readable signals on the program product may optionally be compressed or encrypted.

[00062] Where a component (e.g. a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to a "means") should be interpreted as including as equivalents of that component any component which performs the function of the described component (e.g., that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated example embodiments of the invention.

EQUIVALENTS, EXTENSIONS, ALTERNATIVES AND MISCELLANEOUS

[00063] Example embodiments that relate to stereoscopic HDR video processes are thus described. In the foregoing specification, embodiments of the present invention have been described with reference to numerous specific details that may vary from implementation to implementation. Thus, the sole and exclusive indicator of what is the invention, and what is intended by the applicants to be the invention, is the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.