Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VIRTUAL REALITY VIEWPOINT VIEWPORT CENTER POINT CORRESPONDENCE SIGNALING
Document Type and Number:
WIPO Patent Application WO/2020/068935
Kind Code:
A1
Abstract:
A mechanism for virtual reality (VR) video coding is disclosed. The mechanism includes receiving a bitstream including at least a portion of a coded virtual reality (VR) video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints. The portion of the VR video at a center point of a source viewport at a source viewpoint is decoded and forwarded toward a display. The source viewpoint is switched to a destination viewpoint. A destination viewport is determined at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints. The portion of the VR video at a center point of the destination viewport at the destination viewpoint is decoded and forwarded to the display.

Inventors:
WANG YE-KUI (US)
FAN YUQUN (CN)
DI PEIYUN (CN)
Application Number:
PCT/US2019/052894
Publication Date:
April 02, 2020
Filing Date:
September 25, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FUTUREWEI TECHNOLOGIES INC (US)
International Classes:
H04N13/117; H04N13/194; H04N13/282; H04N21/236; H04N21/2365
Domestic Patent References:
WO2018021067A12018-02-01
Foreign References:
US20180063505A12018-03-01
US20170359624A12017-12-14
US20180041715A12018-02-08
US20170310723A12017-10-26
US20170269685A12017-09-21
Attorney, Agent or Firm:
DIETRICH, William H. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method implemented in a decoder, the method comprising:

processing a virtual reality (VR) video stream, wherein the VR video stream comprises a plurality of viewpoints included in a viewpoint set, wherein each of the viewpoints corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location, and wherein the VR video stream contains viewport center information indicative of a plurality of viewport centers for the viewpoints in the viewpoint set;

presenting a first viewport of a first viewpoint of the viewpoint set to a user;

switching from the first viewpoint to a second viewpoint of the viewpoint set; and determining a second viewport of the second viewpoint based on the information indicative of the plurality of viewport centers for the viewpoints in the viewpoint set.

2. The method of claim 1, wherein the viewport center information indicates the second viewport and the first viewport include corresponding viewport centers.

3. The method of any of claims 1-2, wherein the viewport center information is coded as a pair in a viewport center point correspondence (vcpc) sample entry.

4. The method of any of claims 1-2, wherein the viewport center information is coded as a set containing a plurality of viewport centers in a viewport center point correspondence (vcpc) sample entry.

5. The method of any of claims 1-4, wherein the viewport center information is coded in a timed metadata track related to the plurality of viewpoints.

6. The method of any of claims 1-5, wherein the viewport center information is coded in a sphere region structure.

7. A method implemented in an encoder, the method comprising: receiving, at a processor of the encoder, a virtual reality (VR) video signal filmed from a plurality of viewpoints;

determining, by the processor, a correspondence between viewport centers for the viewpoints;

encoding, by the processor, the correspondence between the viewport centers for the viewpoints in a bitstream; and

transmitting, by a transmitter of the encoder, the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal.

8. The method of claim 7, wherein the correspondence between the viewport centers is coded as a pair in a viewport center point correspondence (vcpc) sample entry.

9. The method of claim 7, wherein the correspondence between the viewport centers is coded as a set containing a plurality of viewport centers in a viewport center point correspondence (vcpc) sample entry.

10. The method of any of claims 7-9, wherein the correspondence between the viewport centers is coded in a timed metadata track related to the plurality of viewpoints.

11. The method of any of claims 7-10, wherein the correspondence between the viewport centers is coded in a sphere region structure.

12. The method of any of claims 7-11, wherein the correspondence between the viewport centers indicates a correspondence between a center point of a source viewport at a source viewpoint and a center point of a destination viewport at a destination viewpoint to maintain a consistent object view upon viewpoint switching.

13. A method implemented in a decoder, the method comprising:

receiving, by a receiver of the decoder, a bitstream including at least a portion of a coded virtual reality (VR) video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints;

decoding, by a processor of the decoder, the portion of the VR video at a center point of a source viewport at a source viewpoint; forwarding, by the processor, the portion of the VR video at the source viewport toward a display;

determining, by the processor, to switch from the source viewpoint to a destination viewpoint;

determining, by the processor, a destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints;

decoding, by the processor, the portion of the VR video at a center point of the destination viewport at the destination viewpoint; and

forwarding, by the processor, the portion of the VR video at the destination viewport toward the display.

14. The method of claim 13, wherein the correspondence between the viewport centers is coded as a pair in a viewport center point correspondence (vcpc) sample entry.

15. The method of claim 13, wherein the correspondence between the viewport centers is coded as a set containing a plurality of viewport centers in a viewport center point correspondence (vcpc) sample entry.

16. The method of any of claims 13-15, wherein the correspondence between the viewport centers is coded in a timed metadata track related to the plurality of viewpoints.

17. The method of any of claims 13-16, wherein the correspondence between the viewport centers is coded in a sphere region structure.

18. A video coding device comprising:

a processor, a receiver coupled to the processor, and a transmitter coupled to the processor, the processor, receiver, and transmitter configured to perform the method of any of claims 1-17.

19. A non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non-transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of claims 1-17.

20. An encoder comprising:

a receiving means for receiving a virtual reality (VR) video signal filmed from a plurality of viewpoints;

a correspondence determination means for determining a correspondence between viewport centers for the viewpoints;

an encoding means for encoding the correspondence between the viewport centers for the viewpoints in a bitstream; and

a transmitting means for transmitting the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal.

21. The encoder of claim 20, wherein the encoder is further configured to perform the method of any of claims 7-12.

22. A decoder comprising:

a receiving means for receiving a bitstream including at least a portion of a coded virtual reality (VR) video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints;

a decoding means for:

decoding the portion of the VR video at a center point of a source viewport at a source viewpoint, and

decoding the portion of the VR video at a center point of a destination viewport at a destination viewpoint;

a determining means for:

determining to switch from the source viewpoint to the destination viewpoint, and

determining the destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints; and

a forwarding means for:

forwarding the portion of the VR video at the source viewport toward a display, and

forwarding the portion of the VR video at the destination viewport toward the display.

23. The decoder of claim 22, wherein the decoder is further configured to perform the method of any of claims 1-6 and 13-17.

Description:
Virtual Reality Viewpoint Viewport Center Point Correspondence Signaling

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This patent application claims the benefit of U.S. Provisional Patent Application No. 62/737,651, filed September, 27 2018 by Ye-Kui Wang, et. al., and titled“Virtual Reality (VR) Viewpoint Center Point Signaling,” which is hereby incorporated by reference.

TECHNICAL FIELD

[0002] The present disclosure is generally related to virtual reality (VR), also referred to as omnidirectional media, immersive media, and 360 degree video, and is specifically related to mechanisms for signaling viewport center point correspondences between VR video viewpoints.

BACKGROUND

[0003] Virtual reality (VR) is the ability to be virtually present in a non-physical world created by the rendering of natural and/or synthetic images and sounds correlated by the movements of the immersed user allowing the user to interact with that world. With the recent progress made in rendering devices, such as head mounted displays (HMD) and VR video (often also referred to as 360 degree video or omnidirectional video) creation, a significant quality of experience can be offered. VR applications include gaming, training, education, sports video, online shopping, adult entrainment, and so on.

SUMMARY

[0004] In an embodiment, the disclosure includes a method implemented in a decoder. The method comprises processing a virtual reality (VR) video stream, wherein the VR video stream comprises a plurality of viewpoints included in a viewpoint set, wherein each of the viewpoints corresponds to one particular omnidirectional video camera for capturing an omnidirectional video at a particular location, and wherein the VR video stream contains information indicative of a plurality of viewport centers for the viewpoints in the viewpoint set. The method further comprises presenting a first viewport of a first viewpoint of the viewpoint set to a user. The method further comprises switching from the first viewpoint to a second viewpoint of the viewpoint set. The method further comprises determining a second viewport of the second viewpoint based on the information indicative of a plurality of viewport centers for the viewpoints in the viewpoint set. In some VR systems viewpoints receive a default viewport. The effect is that a user looking at a first object at a first viewpoint via a first viewport can switch to a second viewpoint. However, when the switch is made, the user has to manually reorient from the default viewport to a second viewport in order to find the object being watched. The present disclosure includes a mechanism to signal correspondences between viewport centers of related viewpoints. In this way, a user viewing an object at a first viewpoint can automatically be reoriented to that object upon switching to a second viewpoint based on the viewport center correspondences between the viewpoints. Accordingly, the mechanisms support increased functionality at the decoder. Further, some systems indicate correspondences between viewports according to spatial regions. However, viewpoint center points can be encoded using less data than encoding spatial regions. As such, the present disclosure supports increased coding efficiency. Hence, the disclosed mechanisms provide for reduced memory usage at the encoder and the decoder, as well as the reduced network resource usage to communicate such data.

[0005] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the plurality of viewport centers indicates the second viewport and the first viewport are corresponding viewpoint centers.

[0006] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the plurality of viewport centers is coded as a pair in a viewport center point correspondence (vcpc) sample entry.

[0007] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the plurality of viewport centers is coded as a set containing a plurality of viewport centers in a vcpc sample entry.

[0008] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the plurality of viewport centers is coded in a timed metadata track related to the plurality of viewpoints.

[0009] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the information indicative of the plurality of viewport centers is coded in a sphere region structure.

[0010] In an embodiment, the disclosure includes a method implemented in an encoder. The method comprises receiving, at a processor of the encoder, a VR video signal filmed from a plurality of viewpoints. The method further comprises determining, by the processor, a correspondence between viewport centers for the viewpoints. The method further comprises encoding, by the processor, the correspondence between the viewport centers for the viewpoints in a bitstream. The method further comprises transmitting, by a transmitter of the encoder, the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal. In some VR systems viewpoints receive a default viewport. The effect is that a user looking at a first object at a first viewpoint via a first viewport can switch to a second viewpoint. However, when the switch is made, the user has to manually reorient from the default viewport to a second viewport in order to find the object being watched. The present disclosure includes a mechanism to signal correspondences between viewport centers of related viewpoints. In this way, a user viewing an object at a first viewpoint can automatically be reoriented to that object upon switching to a second viewpoint based on the viewport center correspondences between the viewpoints. Accordingly, the mechanisms support increased functionality at the decoder. Further, some systems indicate correspondences between viewports according to spatial regions. However, viewpoint center points can be encoded using less data than encoding spatial regions. As such, the present disclosure supports increased coding efficiency. Hence, the disclosed mechanisms provide for reduced memory usage at the encoder and the decoder, as well as the reduced network resource usage to communicate such data.

[0011] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded as a pair in a vcpc sample entry.

[0012] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded as a set containing a plurality of viewport centers in a vcpc sample entry.

[0013] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded in a timed metadata track related to the plurality of viewpoints.

[0014] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded in a sphere region structure.

[0015] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers indicates a correspondence between a center point of a source viewport at a source viewpoint and a center point of a destination viewport at a destination viewpoint to maintain a consistent object view upon viewpoint switching. [0016] In an embodiment, the disclosure includes a method implemented in a decoder. The method comprises receiving, by a receiver of the decoder, a bitstream including at least a portion of a coded VR video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints. The method further comprises decoding, by a processor of the decoder, the portion of the VR video at a center point of a source viewport at a source viewpoint. The method further comprises forwarding, by the processor, the portion of the VR video at the source viewport toward a display. The method further comprises determining, by the processor, to switch from the source viewpoint to a destination viewpoint. The method further comprises determining, by the processor, a destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints. The method further comprises decoding, by the processor, the portion of the VR video at a center point of the destination viewport at the destination viewpoint. The method further comprises forwarding, by the processor, the portion of the VR video at the destination viewport toward the display. In some VR systems viewpoints receive a default viewport. The effect is that a user looking at a first object at a first viewpoint via a first viewport can switch to a second viewpoint. However, when the switch is made, the user has to manually reorient from the default viewport to a second viewport in order to find the object being watched. The present disclosure includes a mechanism to signal correspondences between viewport centers of related viewpoints. In this way, a user viewing an object at a first viewpoint can automatically be reoriented to that object upon switching to a second viewpoint based on the viewport center correspondences between the viewpoints. Accordingly, the mechanisms support increased functionality at the decoder. Further, some systems indicate correspondences between viewports according to spatial regions. However, viewpoint center points can be encoded using less data than encoding spatial regions. As such, the present disclosure supports increased coding efficiency. Hence, the disclosed mechanisms provide for reduced memory usage at the encoder and the decoder, as well as the reduced network resource usage to communicate such data.

[0017] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded as a pair in a vcpc sample entry.

[0018] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded as a set containing a plurality of viewport centers in a vcpc sample entry. [0019] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded in a timed metadata track related to the plurality of viewpoints.

[0020] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the correspondence between the viewport centers is coded in a sphere region structure.

[0021] In an embodiment, the disclosure includes video coding device comprising a processor, a receiver coupled to the processor, and a transmitter coupled to the processor, the processor, receiver, and transmitter configured to perform the method of any of the preceding aspects.

[0022] In an embodiment, the disclosure includes a non-transitory computer readable medium comprising a computer program product for use by a video coding device, the computer program product comprising computer executable instructions stored on the non- transitory computer readable medium such that when executed by a processor cause the video coding device to perform the method of any of the preceding aspects.

[0023] In an embodiment, the disclosure includes an encoder comprising a receiving means for receiving a VR video signal filmed from a plurality of viewpoints. The encoder further comprises a correspondence determination means for determining a correspondence between viewport centers for the viewpoints. The encoder further comprise an encoding means for encoding the correspondence between the viewport centers for the viewpoints in a bitstream. The encoder further comprise a transmitting means for transmitting the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal.

[0024] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the encoder is further configured to perform the method of any of the preceding aspectec.

[0025] In an embodiment, the disclosure includes a decoder comprising a receiving means for receiving a bitstream including at least a portion of a coded VR video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints. The decoder further comprises a decoding means for decoding the portion of the VR video at a center point of a source viewport at a source viewpoint, and decoding the portion of the VR video at a center point of a destination viewport at a destination viewpoint. The decoder further comprises a determining means for determining to switch from the source viewpoint to the destination viewpoint, and determining the destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints. The decoder further comprises a forwarding means for forwarding the portion of the VR video at the source viewport toward a display, and forwarding the portion of the VR video at the destination viewport toward the display.

[0026] Optionally, in any of the preceding aspects, another implementation of the aspect provides, wherein the decoder is further configured to perform the method of any of the preceding aspects.

[0027] For the purpose of clarity, any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure.

[0028] These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0029] For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.

[0030] FIG. 1 is a schematic diagram of an example system for VR based video coding.

[0031] FIG. 2 is a flowchart of an example method of coding a VR picture bitstream.

[0032] FIG. 3 is a flowchart of an example method of coding a video signal.

[0033] FIG. 4 is a schematic diagram of an example coding and decoding (codec) system for video coding.

[0034] FIG. 5 is a schematic diagram illustrating an example video encoder.

[0035] FIG. 6 is a schematic diagram illustrating an example video decoder.

[0036] FIG. 7 is a schematic diagram illustrating an example system for capturing VR video from multiple viewpoints.

[0037] FIG. 8 is a schematic diagram of a pair of viewpoints with corresponding viewport center points.

[0038] FIG. 9 is a schematic diagram of a set of viewpoints with corresponding viewport center points.

[0039] FIG. 10 is a schematic diagram of an example VR video file for multiple viewpoints.

[0040] FIG. 11 is an embodiment of a method of displaying a VR video at a decoder based on a viewport center point correspondence between multiple viewpoints. [0041] FIG. 12 is another embodiment of a method of displaying a VR video at a decoder based on a viewport center point correspondence between multiple viewpoints.

[0042] FIG. 13 is an embodiment of a method of signaling a viewport center point correspondence between multiple viewpoints in a VR video from an encoder.

[0043] FIG. 14 is a schematic diagram of an example video coding device.

[0044] FIG. 15 is a schematic diagram of an embodiment of a system for signaling a viewport center point correspondence between multiple viewpoints in a VR video.

DETAILED DESCRIPTION

[0045] It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

[0046] Video coding standards include International Telecommunication Union Telecommunication Standardization Sector (ITU-T) document H.261, International Organization for Standardization/ International Electrotechnical Commission (ISO/IEC) Motion Picture Experts Group (MPEG)-l Part 2, ITU-T FI.262 or ISO/IEC MPEG-2 Part 2, ITU-T FI.263, ISO/IEC MPEG-4 Part 2, Advanced Video Coding (AVC), also known as ITU-T FI.264 or ISO/IEC MPEG-4 Part 10, and High Efficiency Video Coding (FIEVC), also known as ITU-T FI.265 or MPEG-FI Part 2. AVC includes extensions such as Scalable Video Coding (SVC), Multiview Video Coding (MVC) and Multiview Video Coding plus Depth (MVC+D), and three dimensional (3D) AVC (3D-AVC). FIEVC includes extensions such as Scalable FIEVC (SHVC), Multiview HEVC (MV-HEVC), and 3D HEVC (3D-HEVC).

[0047] File format standards include the ISO base media file format (ISOBMFF) (ISO/IEC 14496-12, hereinafter“ISO/IEC 14996-12”) and other file format standards derived from ISOBMFF, including MPEG-4 file format (ISO/IEC 14496-14), 3rd Generation Partnership Project (3GPP) file format (3GPP TS 26.244), and AVC file format (ISO/IEC 14496-15, hereinafter“ISO/IEC 14996-15”). Thus, ISO/IEC 14496-12 specifies the ISO base media file format. Other documents extend the ISO base media file format for specific applications. For instance, ISO/IEC 14496-15 describes the carriage of Network Abstraction Layer (NAL) unit structured video in the ISO base media file format. FI.264/AVC and FIEVC, as well as their extensions, are examples of NAL unit structured video. ISO/IEC 14496-15 includes sections describing the carriage of H.264/AVC NAL units. Additionally, section 8 of ISO/IEC 14496- 15 describes the carriage of ffEVC NAL units. Thus, section 8 of ISO/IEC 14496-15 is said to describe the ffEVC file format.

[0048] ISOBMFF is used as the basis for many codec encapsulation formats, such as the AVC File Format, as well as for many multimedia container formats, such as the MPEG-4 File Format, the 3GPP File Format, and the DVB File Format. In addition to continuous media, such as audio and video, static media, such as images, as well as metadata, can be stored in a file conforming to ISOBMFF. Files structured according to ISOBMFF may be used for many purposes, including local media file playback, progressive downloading of a remote file, segments for Dynamic Adaptive Streaming over ffyper Text Transfer Protocol (ffTTP) (DASff), containers for content to be streamed and corresponding packetization instructions, and recording of received real-time media streams. Thus, although designed for storage, ISOBMFF can be employed for streaming, e.g., for progressive download or DASff. For streaming purposes, movie fragments defined in fSOBMFF can be used fn addition to continuous media, such as audio and video, static media, such as images, as well as metadata can be stored in a file conforming to ISOBMFF.

[0049] Such file formats and streaming mechanisms can be employed to encode, signal, decode, and display a VR video. In some cases, a VR video can be recorded from multiple viewpoints. As used herein, a viewpoint is position of a camera used to capture video. For example, multiple cameras can be positioned at multiple locations to record a scene, an event, etc. In a VR context, such cameras may include a camera array and/or fisheye camera(s) capable of capturing wide angle video. For example, a VR camera mechanism can capture a sphere of video, or sub-portions thereof. Only a portion of the sphere may be displayed to a user. Upon viewing, a user can control a viewing orientation from the viewpoint. This allows the user to react to the filmed environment as if the user were present at the viewpoint at the time of filming. When multiple viewpoints are employed, the user may be allowed to switch between the viewpoints. This allows the user to virtually move around the scene. As an example, a VR video can be taken of a basketball game from multiple viewpoints on, around, and/or above the court. In this case, a user may be allowed to view the game from a viewpoint of choice and at an orientation/angle of choice from the selected viewpoint.

[0050] A default viewing orientation/angle can be employed for each viewpoint. Accordingly, when a user switches to a viewpoint, the decoder can employ the default angle to orient the user until the user can select the desired viewing orientation. This implementation has certain drawbacks. For example, a user may wish to pay attention to a particular object in a scene, such as a basketball or a particular player in a basketball game. When default viewing orientations are employed, the user’s viewing angle is reset to the default value each time the user switches between viewpoints. Accordingly, a user viewing a basketball at a first viewpoint would be reoriented to a default angle upon switching to a second viewpoint. This would likely result in losing sight of the basketball. The user would then likely have to search for the current location of the basketball from the new viewpoint. The result is that default viewing orientations may create discontinuities in a user’s viewing experience and create a poor viewing experience in some cases.

[0051] Disclosed herein are mechanisms to encode viewport center point (also referred to herein as viewport centers) correspondences between VR viewpoints. For example, video data related to the viewpoints may be included in tracks of a video file. A timed metadata track that contains data relevant to multiple viewpoints can also be included in the video file. Viewport center point correspondences between the viewpoints may be included in the timed metadata track and/or in the tracks containing video data related to the associated viewpoints. Such information can indicate correspondences between viewpoint pairs and/or viewpoint sets. Specifically, such information can denote that a first viewport at a first viewpoint orients toward the same location in the VR space as a corresponding second viewport at a second viewpoint. The correspondence is indicated by including pairs and/or sets of center points of the corresponding viewports at the associated viewpoints. Signaling viewport center point correspondences may be employed as an alternative to viewpoint spatial region correspondences. Specifically, viewport center point correspondences can be encoded using fewer bits and associated actions can be computed more simply than when denoting similar information using viewpoint spatial region correspondences (e.g., described in terms of viewport boundaries and/or viewpoint angles). Using such information, a user can switch between viewpoints. The decoder can check the relevant viewpoint and/or metadata track(s) to determine a correspondence between a center point of a source viewport at a source viewpoint and a center point of a destination viewport at a destination viewpoint. Accordingly, the decoder can automatically orient the user toward a destination viewport at the destination viewpoint that corresponds to the source viewport selected by the user at the source viewpoint. As a specific example, a user watching a basketball at a source viewpoint can be automatically oriented toward the basketball upon switching to the destination viewpoint. This allows the decoder to provide a consistent view to a user upon switching between viewpoints. As a specific example, a viewport center points correspondence (Vcpc) sample function (VcpcSample()) and/or a Vcpc sample entry function (VcpcSampleEntry()) can be employed to describe the viewport center correspondences between viewpoints. For example, the VcpcSample() and/or VcpcSampleEntry() can be included in a sphere region structure (SphereRegionStructO) object and/or in a sphere region sample (SphereRegionSample()) object, which can then be included in a timed metadata track and/or in corresponding video tracks. Such data can be included at the beginning of the relevant track to indicate initial correspondences between the viewpoints. In the event a viewpoint moves (e.g., a mobile camera), viewport center point correspondence information can be updated at a corresponding temporal location in the relevant track to denote such changes over time.

[0052] FIG. 1 is a schematic diagram of an example system 100 for VR based video coding. System 100 includes a multi-directional camera 101, a VR coding device 104 including an encoder 103, a decoder 107, and a rendering device 109. The multi-directional camera 101 comprises an array of camera devices. Each camera device is pointed at a different angle so that the multi-directional camera 101 can take multiple directional video streams of the surrounding environment from a plurality of angles. For example, multi-directional camera 101 can take video of the environment as a sphere with the multi-directional camera 101 at the center of the sphere. As used herein, sphere and spherical video refers to both a geometrical sphere and sub-portions of a geometrical sphere, such as spherical caps, spherical domes, spherical segments, etc. For example, a multi-directional camera 101 may take one hundred and eighty degree video to cover half of the environment so that a production crew can remain behind the multi-directional camera 101. A multi-directional camera 101 can also take video in three hundred sixty degrees (or any sub-portion thereof) ffowever, a portion of the floor under the multi-directional camera 101 may be omitted, which results in video of less than a perfect sphere. Flence, the term sphere, as used herein, is a general term used for clarity of discussion and should not be considered limiting from a geometrical stand point. It should be noted that in some examples a multi-directional camera 101 may include a camera that includes one or more fisheye lenses (e.g., instead of an array of cameras).

[0053] Video from the multi-directional camera 101 is forwarded to the VR coding device 104. A VR coding device 104 may be a computing system including specialized VR coding software. The VR coding device 104 may include an encoder 103 (a.k.a., a video encoder). In some examples, the encoder 103 can also be included in a separate computer system from the VR coding device 104. The VR coding device 104 is configured to convert the multiple directional video streams into a single multiple directional video stream including the entire recorded area from all relevant angles. This conversion may be referred to as image stitching. For example, frames from each video stream that are captured at the same time can be stitched together to create a single spherical image. A spherical video stream can then be created from the spherical images. For clarity of discussion, it should be noted that the terms frame, picture, and image may be used interchangeably herein unless specifically noted.

[0054] The spherical video stream can then be forwarded to the encoder 103 for compression. An encoder 103 is a device and/or program capable of converting information from one format to another for purposes of standardization, speed, and/or compression. Standardized encoders 103 are configured to encode rectangular and/or square images. Accordingly, the encoder 103 is configured to map each spherical image from the spherical video stream into a plurality of rectangular sub-pictures. The sub-pictures can then be placed in separate sub-picture video streams. As such, each sub-picture video stream displays a stream of images over time as recorded from a sub-portion of the spherical video stream. The encoder 103 can then encode each sub-picture video stream to compress the video stream to a manageable file size. The encoding process is discussed in more detail below. In general, the encoder 103 partitions each frame from each sub-picture video stream into pixel blocks, compresses the pixel blocks by inter-prediction and/or intra-prediction to create coding blocks including prediction blocks and residual blocks, applies transforms to the residual blocks for further compression, and applies various filters to the blocks. The compressed blocks as well as corresponding syntax are stored in bitstream(s), for example in ISOBMFF and/or in omnidirectional media format (OMAF).

[0055] The VR coding device 104 may store the encoded bitstream(s) in memory, locally, and/or on a server, for communication to a decoder 107 on demand. The data can be forwarded via a network 105, which may include the Internet, a mobile telecommunications network (e.g., a long term evolution (LTE) based data network), or other data communication data system.

[0056] The decoder 107 (a.k.a., a video decoder) is a device on a user’s location that is configured to reverse the coding process to reconstruct the sub-picture video streams from the encoded bitstream(s). The decoder 107 also merges the sub-picture video streams to reconstruct the spherical video stream. The spherical video stream, or sub-portions thereof, can then be forwarded to the rendering device 109. The rendering device 109 is a device configured to display the spherical video stream to the user. For example, the rendering device 109 may include a FIMD that attaches to the user’s head and covers the user’s eyes. The rendering device 109 may include a screen for each eye, cameras, motion sensors, speakers, etc. and may communicate with the decoder 107 via wireless and/or wired connections. The rendering device 109 may display a sub-portion of the spherical video stream to the user. The sub-portion shown is based on a field of view (FOV) and/or viewport of the rendering device 109. A FOV is the observable area of the recorded environment that is displayed to a user by the rendering device 109. The FOV can be described as a conical projection between a user’s eye and extending into the virtual environment. A viewport is a two dimensional plane upon which a three dimensional environment is projected. Accordingly, a viewport describes the area of a portion of the virtual environment displayed on a screen or screens of a rendering device, while a FOV describes the portion of the virtual environment seen by the user. Flence, viewport and FOV may be used interchangeably in many cases, but may include different technical details. For example, a FOV can be described in terms of pixels, coordinates, and/or bounds while a viewport can be described in terms of angles. The rendering device 109 may change the position of the FOV/viewport based on user head movement by employing the motion tracking sensors. This allows the user to see different portions of the spherical video stream depending on head movement. Further, the rendering device 109 may offset the FOV for each eye based on the user’s interpapillary distance (IPD) to create the impression of a three dimensional space. In other cases, the rendering device 109 may be a computer screen or television screen that changes a FOV/viewport based on user input.

[0057] FIG. 2 is a flowchart of an example method 200 of coding a VR picture bitstream as a plurality of sub-picture bitstreams, for example by employing the components of system 100. At step 201, a multi-directional camera set, such as multi-directional camera 101, is used to capture multiple directional video streams. The multiple directional video streams include views of an environment at various angles. For example, the multiple directional video streams may capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the horizontal plane. The multiple directional video streams may also capture video from three hundred sixty degrees, one hundred eighty degrees, two hundred forty degrees, etc. around the camera in the vertical plane. The result is to create video that includes information sufficient to cover a spherical area around the camera over some period of time.

[0058] At step 203, the multiple directional video streams are synchronized in the time domain. Specifically, each directional video stream includes a series of images taken at a corresponding angle. The multiple directional video streams are synchronized by ensuring frames from each directional video stream that were captured at the same time domain position are processed together. The frames from the directional video streams can then be stitched together in the space domain to create a spherical video stream. Flence, each frame of the spherical video stream contains data taken from the frames of all the directional video streams that occur at a common temporal position. It should be noted that a fisheye lens may capture a single video stream at a wide angle. Hence, when a fisheye lens is employed, a single multi directional stream may be captured at step 201, which may allow step 203 to be omitted in some cases.

[0059] At step 205, the spherical video stream is mapped into rectangular sub-picture video streams. This process may also be referred to as projecting the spherical video stream into rectangular sub-picture video streams. As noted above, encoders and decoders are generally designed to encode rectangular and/or square frames. Accordingly, mapping the spherical video stream into rectangular sub-picture video streams creates video streams that can be encoded and decoded by non-VR specific encoders and decoders, respectively. It should be noted that steps 203 and 205 are specific to VR video processing, and hence may be performed by specialized VR hardware, software, or combinations thereof.

[0060] At step 207, the rectangular sub-picture video streams can be forwarded to an encoder, such as encoder 103. The encoder then encodes the sub-picture video streams as sub picture bitstreams in a corresponding media file format. Specifically, each sub-picture video stream can be treated by the encoder as a video signal. The encoder can encode each frame of each sub-picture video stream via inter-prediction, intra-prediction, etc. Such encoding and corresponding decoding as well as encoders and decoders are discussed in detail with respect to the FIGS below. Regarding file format, the sub-picture video streams can be stored in ISOBMFF. For example, the sub-picture video streams are captured at a specified resolution. The sub-picture video streams can then be downsampled to various lower resolutions for encoding. Each resolution can be referred to as a representation. Lower quality representations lose image clarity while reducing file size. Accordingly, lower quality representations can be transmitted to a user using fewer network resources (e.g., time, bandwidth, etc.) than higher quality representations with an attendant loss of visual quality. Each representation can be stored in a corresponding set of tracks. Hence, tracks can be sent to a user, where the tracks include the sub-picture bitstreams at various resolutions (e.g., visual quality).

[0061] At step 209, the sub-picture bitstreams can be sent to the decoder as tracks. In some examples, all sub-picture bitstreams are transmitted at the same quality by transmitting tracks from the same representation. In other cases, the tracks containing sub-picture bitstreams with data in the users FOV may be sent at higher resolutions by selecting higher quality representations. Tracks containing sub-picture bitstreams with areas outside the users FOV can be sent at progressively lower resolutions by selecting lower quality representations. This may be referred to as viewport dependent coding. The tracks may include relatively short video segments (e.g., about three seconds), and hence the representations selected for particular areas of the video can change over time based on changes in FOV. This allows quality to change as the users FOV changes.

[0062] At step 211, a decoder, such as decoder 107, receives the tracks containing the sub picture bitstreams. The decoder can then decode the sub-picture bitstreams into sub-picture video streams for display. The decoding process involves the reverse of the encoding process (e.g., using inter-prediction and intra-prediction), and is discussed in more detail with respect to the FIGS below.

[0063] At step 213, the decoder can merge the sub-picture video streams into the spherical video stream for presentation on a rendering device. For example, the decoder can employ a so called lightweight merging algorithm that selects frames from each sub-picture video stream that occur at the same presentation time and merges them together based on the position and/or angle associated with the corresponding sub-picture video stream. The decoder may also employ filters to smooth edges between the sub-picture video streams, remove artifacts, etc. The decoder can then forward the spherical video stream to a rendering device, such as rendering device 109.

[0064] At step 215, the rendering device renders a viewport of the spherical video stream for presentation to the user. As mentioned above, areas of the spherical video stream outside of the FOV at each point in time are not rendered. As such, the user can select and view a sub portion of the virtual environment as recorded, and hence can experience the virtual environment as if present at the time of recording.

[0065] FIG. 3 is a flowchart of an example method 300 of coding a video signal. For example, method 300 may receive a plurality of sub-picture video streams from step 205 of method 200. Method 300 treats each sub-picture video stream as a video signal input. Method 300 applies steps 301-317 to each sub-picture video stream in order to implement steps 207- 211 of method 200. Flence, the output video signal from method 300 includes the decoded sub picture video streams, which can be merged and displayed according to steps 213 and 215 of method 200. As such, method 300 can be implemented on a system 100.

[0066] Method 300 encodes a video signal, for example including sub-picture video streams, at an encoder. The encoding process compresses the video signal by employing various mechanisms to reduce the video file size. A smaller file size allows the compressed video file to be transmitted toward a user, while reducing associated bandwidth overhead. The decoder then decodes the compressed video file to reconstruct the original video signal for display to an end user. The decoding process generally mirrors the encoding process to allow the decoder to consistently reconstruct the video signal.

[0067] At step 301, the video signal is input into the encoder. For example, the video signal may be an uncompressed video file stored in memory. As another example, the video file may be captured by a video capture device, such as a video camera, and encoded to support live streaming of the video. The video file may include both an audio component and a video component. The video component contains a series of image frames that, when viewed in a sequence, gives the visual impression of motion. The frames contain pixels that are expressed in terms of light, referred to herein as luma components (or luma samples), and color, which is referred to as chroma components (or color samples). It should be noted that a frame may also be referred to as a picture, a sub-frame as a sub-picture, etc.

[0068] At step 303, the video signal is partitioned into blocks. Partitioning includes subdividing the pixels in each frame into square and/or rectangular blocks for compression. For example, in FIEVC (also known as FI.265 and MPEG-FI Part 2) the frame can first be divided into coding tree units (CTUs), which are blocks of a predefined size (e.g., sixty four pixels by sixty four pixels). The CTUs contain both luma and chroma samples. Coding trees may be employed to divide the CTUs into blocks and then recursively subdivide the blocks until configurations are achieved that support further encoding. For example, luma components of a frame may be subdivided until the individual blocks contain relatively homogenous lighting values. Further, chroma components of a frame may be subdivided until the individual blocks contain relatively homogenous color values. Accordingly, partitioning mechanisms vary depending on the content of the video frames.

[0069] At step 305, various compression mechanisms are employed to compress the image blocks partitioned at step 303. For example, inter-prediction and/or intra-prediction may be employed. Inter-prediction is designed to take advantage of the fact that objects in a common scene tend to appear in successive frames. Accordingly, a block depicting an object in a reference frame need not be repeatedly described in adjacent frames. Specifically, an object, such as a table, may remain in a constant position over multiple frames. Flence the table is described once and adjacent frames can refer back to the reference frame. Pattern matching mechanisms may be employed to match objects over multiple frames. Further, moving objects may be represented across multiple frames, for example due to object movement or camera movement. As a particular example, a video may show an automobile that moves across the screen over multiple frames. Motion vectors can be employed to describe such movement, or lack thereof. A motion vector is a two-dimensional vector that provides an offset from the coordinates of an object in a frame to the coordinates of the object in a reference frame. As such, inter-prediction can encode an image block in a current frame as a set of motion vectors indicating an offset from a corresponding block in a reference frame.

[0070] Intra-prediction encodes blocks in a common frame. Intra-prediction takes advantage of the fact that luma and chroma components tend to cluster in a frame. For example, a patch of green in a portion of a tree tends to be positioned adjacent to similar patches of green. Intra-prediction employs multiple directional prediction modes (e.g., thirty three in ffEVC), a planar mode, and a direct current (DC) mode. The directional modes indicate that a current block is similar/the same as samples of a neighbor block in a corresponding direction. Planar mode indicates that a series of blocks along a row/column (e.g., a plane) can be interpolated based on neighbor blocks at the edges of the row. Planar mode, in effect, indicates a smooth transition of light/color across a row/column by employing a relatively constant slope in changing values. DC mode is employed for boundary smoothing and indicates that a block is similar/the same as an average value associated with samples of all the neighbor blocks associated with the angular directions of the directional prediction modes. Accordingly, intra-prediction blocks can represent image blocks as various relational prediction mode values instead of the actual values. Further, inter-prediction blocks can represent image blocks as motion vector values instead of the actual values. In either case, the prediction blocks may not exactly represent the image blocks in some cases. Any differences are stored in residual blocks. Transforms may be applied to the residual blocks to further compress the file.

[0071] At step 307, various filtering techniques may be applied. In ffEVC, the filters are applied according to an in-loop filtering scheme. The block based prediction discussed above may result in the creation of blocky images at the decoder. Further, the block based prediction scheme may encode a block and then reconstruct the encoded block for later use as a reference block. The in-loop filtering scheme iteratively applies noise suppression filters, de-blocking filters, adaptive loop filters, and sample adaptive offset (SAO) filters to the blocks/frames. These filters mitigate such blocking artifacts so that the encoded file can be accurately reconstructed. Further, these filters mitigate artifacts in the reconstructed reference blocks so that artifacts are less likely to create additional artifacts in subsequent blocks that are encoded based on the reconstructed reference blocks.

[0072] Once the video signal has been partitioned, compressed, and filtered, the resulting data is encoded in a bitstream at step 309. The bitstream includes the data discussed above as well as any signaling data (e.g., syntax) desired to support proper video signal reconstruction at the decoder. For example, such data may include partition data, prediction data, residual blocks, and various flags providing coding instructions to the decoder. The bitstream may be stored in memory for transmission toward a decoder upon request, for example as a track and/or track fragment in ISOBMFF. The bitstream may also be broadcast and/or multicast toward a plurality of decoders. The creation of the bitstream is an iterative process. Accordingly, steps 301, 303, 305, 307, and 309 may occur continuously and/or simultaneously over many frames and blocks. The order shown is presented for clarity and ease of discussion, and is not intended to limit the video coding process to a particular order.

[0073] The decoder receives the bitstream and begins the decoding process at step 311. For example, the decoder can employ an entropy decoding scheme to convert the bitstream into corresponding syntax and video data. The decoder employs the syntax data from the bitstream to determine the partitions for the frames at step 311. The partitioning should match the results of block partitioning at step 303. Entropy encoding/decoding, which may be employed in step 311, is now described. The encoder makes many choices during the compression process, such as selecting block partitioning schemes from several possible choices based on the spatial positioning of values in the input image(s). Signaling the exact choices may employ a large number of bins. As used herein, a bin is a binary value that is treated as a variable (e.g., a bit value that may vary depending on context). Entropy coding allows the encoder to discard any options that are clearly not viable for a particular case, leaving a set of allowable options. Each allowable option is then assigned a code word. The length of the code word is based on the number of allowable options (e.g., one bin for two options, two bins for three to four options, etc.). The encoder then encodes the code word for the selected option. This scheme reduces the size of the code words as the code words are as big as desired to uniquely indicate a selection from a small sub-set of allowable options as opposed to uniquely indicating the selection from a potentially large set of all possible options. The decoder then decodes the selection by determining the set of allowable options in a similar manner to the encoder. By determining the set of allowable options, the decoder can read the code word and determine the selection made by the encoder.

[0074] At step 313, the decoder performs block decoding. Specifically, the decoder employs reverse transforms to generate residual blocks. Then the decoder employs the residual blocks and corresponding prediction blocks to reconstruct the image blocks according to the partitioning. The prediction blocks may include both intra-prediction blocks and inter prediction blocks as generated at the encoder at step 305. The reconstructed image blocks are then positioned into frames of a reconstructed video signal according to the partitioning data determined at step 311. Syntax for step 313 may also be signaled in the bitstream via entropy coding as discussed above.

[0075] At step 315, filtering is performed on the frames of the reconstructed video signal in a manner similar to step 307 at the encoder. For example, noise suppression filters, de blocking filters, adaptive loop filters, and SAO filters may be applied to the frames to remove blocking artifacts. Once the frames are filtered, the video signal can be forwarded for merging at step 317 and then output to a display, such as a FIMD, for viewing by an end user.

[0076] FIG. 4 is a schematic diagram of an example coding and decoding (codec) system 400 for video coding. Specifically, codec system 400 provides functionality to support encoding and decoding sub-picture video streams according to methods 200 and 300. Further, codec system 400 can be employed to implement an encoder 103 and/or a decoder 107 of system 100.

[0077] Codec system 400 is generalized to depict components employed in both an encoder and a decoder. Codec system 400 receives and partitions frames from a video signal (e.g., including a sub-picture video stream) as discussed with respect to steps 301 and 303 in operating method 300, which results in a partitioned video signal 401. Codec system 400 then compresses the partitioned video signal 401 into a coded bitstream when acting as an encoder as discussed with respect to steps 305, 307, and 309 in method 300. When acting as a decoder, codec system 400 generates an output video signal from the bitstream as discussed with respect to steps 311, 313, 315, and 317 in operating method 300. The codec system 400 includes a general coder control component 411, a transform scaling and quantization component 413, an intra-picture estimation component 415, an intra-picture prediction component 417, a motion compensation component 419, a motion estimation component 421, a scaling and inverse transform component 429, a filter control analysis component 427, an in-loop filters component 425, a decoded picture buffer component 423, and a header formatting and context adaptive binary arithmetic coding (CAB AC) component 431. Such components are coupled as shown. In FIG. 4, black lines indicate movement of data to be encoded/decoded while dashed lines indicate movement of control data that controls the operation of other components. The components of codec system 400 may all be present in the encoder. The decoder may include a subset of the components of codec system 400. For example, the decoder may include the intra picture prediction component 417, the motion compensation component 419, the scaling and inverse transform component 429, the in-loop filters component 425, and the decoded picture buffer component 423. These components are now described. [0078] The partitioned video signal 401 is a captured video sequence that has been partitioned into blocks of pixels by a coding tree. A coding tree employs various split modes to subdivide a block of pixels into smaller blocks of pixels. These blocks can then be further subdivided into smaller blocks. The blocks may be referred to as nodes on the coding tree. Larger parent nodes are split into smaller child nodes. The number of times a node is subdivided is referred to as the depth of the node/coding tree. The divided blocks can be included in coding units (CUs) in some cases. For example, a CU can be a sub-portion of a CTU that contains a luma block, red difference chroma (Cr) block(s), and a blue difference chroma (Cb) block(s) along with corresponding syntax instructions for the CU. The split modes may include a binary tree (BT), triple tree (TT), and a quad tree (QT) employed to partition a node into two, three, or four child nodes, respectively, of varying shapes depending on the split modes employed. The partitioned video signal 401 is forwarded to the general coder control component 411, the transform scaling and quantization component 413, the intra picture estimation component 415, the filter control analysis component 427, and the motion estimation component 421 for compression.

[0079] The general coder control component 411 is configured to make decisions related to coding of the images of the video sequence into the bitstream according to application constraints. For example, the general coder control component 411 manages optimization of bitrate/bitstream size versus reconstruction quality. Such decisions may be made based on storage space/bandwidth availability and image resolution requests. The general coder control component 411 also manages buffer utilization in light of transmission speed to mitigate buffer underrun and overrun issues. To manage these issues, the general coder control component 411 manages partitioning, prediction, and filtering by the other components. For example, the general coder control component 411 may dynamically increase compression complexity to increase resolution and increase bandwidth usage or decrease compression complexity to decrease resolution and bandwidth usage ffence, the general coder control component 411 controls the other components of codec system 400 to balance video signal reconstruction quality with bitrate concerns. The general coder control component 411 creates control data, which controls the operation of the other components. The control data is also forwarded to the header formatting and CABAC component 431 to be encoded in the bitstream to signal parameters for decoding at the decoder.

[0080] The partitioned video signal 401 is also sent to the motion estimation component 421 and the motion compensation component 419 for inter-prediction. A frame or slice of the partitioned video signal 401 may be divided into multiple video blocks. Motion estimation component 421 and the motion compensation component 419 perform inter-predictive coding of the received video block relative to one or more blocks in one or more reference frames to provide temporal prediction. Codec system 400 may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.

[0081] Motion estimation component 421 and motion compensation component 419 may be highly integrated, but are illustrated separately for conceptual purposes. Motion estimation, performed by motion estimation component 421, is the process of generating motion vectors, which estimate motion for video blocks. A motion vector, for example, may indicate the displacement of a coded object relative to a predictive block. A predictive block is a block that is found to closely match the block to be coded, in terms of pixel difference. A predictive block may also be referred to as a reference block. Such pixel difference may be determined by sum of absolute difference (SAD), sum of square difference (SSD), or other difference metrics. HEVC employs several coded objects including a CTU, coding tree blocks (CTBs), and CUs. For example, a CTU can be divided into CTBs, which can then be divided into CBs for inclusion in CUs. A CU can be encoded as a prediction unit (PU) containing prediction data and/or a transform unit (TU) containing transformed residual data for the CU. The motion estimation component 421 generates motion vectors, PUs, and TUs by using a rate-distortion analysis as part of a rate distortion optimization process. For example, the motion estimation component 421 may determine multiple reference blocks, multiple motion vectors, etc. for a current block/frame, and may select the reference blocks, motion vectors, etc. having the best rate-distortion characteristics. The best rate-distortion characteristics balance both quality of video reconstruction (e.g., amount of data loss by compression) with coding efficiency (e.g., size of the final encoding).

[0082] In some examples, codec system 400 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer component 423. For example, video codec system 400 may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation component 421 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision. The motion estimation component 421 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a predictive block of a reference picture. Motion estimation component 421 outputs the calculated motion vector as motion data to the header formatting and CAB AC component 431 for encoding and motion to the motion compensation component 419. [0083] Motion compensation, performed by motion compensation component 419, may involve fetching or generating the predictive block based on the motion vector determined by motion estimation component 421. Again, motion estimation component 421 and motion compensation component 419 may be functionally integrated, in some examples. Upon receiving the motion vector for the PU of the current video block, motion compensation component 419 may locate the predictive block to which the motion vector points. A residual video block is then formed by subtracting pixel values of the predictive block from the pixel values of the current video block being coded, forming pixel difference values. In general, motion estimation component 421 performs motion estimation relative to luma components, and motion compensation component 419 uses motion vectors calculated based on the luma components for both chroma components and luma components. The predictive block and residual block are forwarded to transform scaling and quantization component 413.

[0084] The partitioned video signal 401 is also sent to intra-picture estimation component 415 and intra-picture prediction component 417. As with motion estimation component 421 and motion compensation component 419, intra-picture estimation component 415 and intra picture prediction component 417 may be highly integrated, but are illustrated separately for conceptual purposes. The intra-picture estimation component 415 and intra-picture prediction component 417 intra-predict a current block relative to blocks in a current frame, as an alternative to the inter-prediction performed by motion estimation component 421 and motion compensation component 419 between frames, as described above. In particular, the intra picture estimation component 415 determines an intra-prediction mode to use to encode a current block. In some examples, intra-picture estimation component 415 selects an appropriate intra-prediction mode to encode a current block from multiple tested intra prediction modes. The selected intra-prediction modes are then forwarded to the header formatting and CABAC component 431 for encoding.

[0085] For example, the intra-picture estimation component 415 calculates rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and selects the intra-prediction mode having the best rate-distortion characteristics among the tested modes. Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original unencoded block that was encoded to produce the encoded block, as well as a bitrate (e.g., a number of bits) used to produce the encoded block. The intra picture estimation component 415 calculates ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block. In addition, intra-picture estimation component 415 may be configured to code depth blocks of a depth map using a depth modeling mode (DMM) based on rate-distortion optimization (RDO).

[0086] The intra-picture prediction component 417 may generate a residual block from the predictive block based on the selected intra-prediction modes determined by intra-picture estimation component 415 when implemented on an encoder or read the residual block from the bitstream when implemented on a decoder. The residual block includes the difference in values between the predictive block and the original block, represented as a matrix. The residual block is then forwarded to the transform scaling and quantization component 413. The intra-picture estimation component 415 and the intra-picture prediction component 417 may operate on both luma and chroma components.

[0087] The transform scaling and quantization component 413 is configured to further compress the residual block. The transform scaling and quantization component 413 applies a transform, such as a discrete cosine transform (DCT), a discrete sine transform (DST), or a conceptually similar transform, to the residual block, producing a video block comprising residual transform coefficient values. Wavelet transforms, integer transforms, sub-band transforms, or other types of transforms could also be used. The transform may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain. The transform scaling and quantization component 413 is also configured to scale the transformed residual information, for example based on frequency. Such scaling involves applying a scale factor to the residual information so that different frequency information is quantized at different granularities, which may affect final visual quality of the reconstructed video. The transform scaling and quantization component 413 is also configured to quantize the transform coefficients to further reduce bitrate. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may be modified by adjusting a quantization parameter. In some examples, the transform scaling and quantization component 413 may then perform a scan of the matrix including the quantized transform coefficients. The quantized transform coefficients are forwarded to the header formatting and CABAC component 431 to be encoded in the bitstream.

[0088] The scaling and inverse transform component 429 applies a reverse operation of the transform scaling and quantization component 413 to support motion estimation. The scaling and inverse transform component 429 applies inverse scaling, transformation, and/or quantization to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block which may become a predictive block for another current block. The motion estimation component 421 and/or motion compensation component 419 may calculate a reference block by adding the residual block back to a corresponding predictive block for use in motion estimation of a later block/frame. Filters are applied to the reconstructed reference blocks to mitigate artifacts created during scaling, quantization, and transform. Such artifacts could otherwise cause inaccurate prediction (and create additional artifacts) when subsequent blocks are predicted.

[0089] The filter control analysis component 427 and the in-loop filters component 425 apply the filters to the residual blocks and/or to reconstructed image blocks. For example, the transformed residual block from the scaling and inverse transform component 429 may be combined with a corresponding prediction block from intra-picture prediction component 417 and/or motion compensation component 419 to reconstruct the original image block. The filters may then be applied to the reconstructed image block. In some examples, the filters may instead be applied to the residual blocks. As with other components in FIG. 4, the filter control analysis component 427 and the in-loop filters component 425 are highly integrated and may be implemented together, but are depicted separately for conceptual purposes. Filters applied to the reconstructed reference blocks are applied to particular spatial regions and include multiple parameters to adjust how such filters are applied. The filter control analysis component 427 analyzes the reconstructed reference blocks to determine where such filters should be applied and sets corresponding parameters. Such data is forwarded to the header formatting and CABAC component 431 as filter control data for encoding. The in-loop filters component 425 applies such filters based on the filter control data. The filters may include a deblocking filter, a noise suppression filter, a SAO filter, and an adaptive loop filter. Such filters may be applied in the spatial/pixel domain (e.g., on a reconstructed pixel block) or in the frequency domain, depending on the example.

[0090] When operating as an encoder, the filtered reconstructed image block, residual block, and/or prediction block are stored in the decoded picture buffer component 423 for later use in motion estimation as discussed above. When operating as a decoder, the decoded picture buffer component 423 stores and forwards the reconstructed and filtered blocks toward a display as part of an output video signal. The decoded picture buffer component 423 may be any memory device capable of storing prediction blocks, residual blocks, and/or reconstructed image blocks.

[0091] The header formatting and CABAC component 431 receives the data from the various components of codec system 400 and encodes such data into a coded bitstream for transmission toward a decoder. Specifically, the header formatting and CABAC component 431 generates various headers to encode control data, such as general control data and filter control data. Further, prediction data, including intra-prediction and motion data, as well as residual data in the form of quantized transform coefficient data are all encoded in the bitstream. The final bitstream includes all information desired by the decoder to reconstruct the original partitioned video signal 401. Such information may also include intra-prediction mode index tables (also referred to as codeword mapping tables), definitions of encoding contexts for various blocks, indications of most probable intra-prediction modes, an indication of partition information, etc. Such data may be encoded by employing entropy coding. For example, the information may be encoded by employing context adaptive variable length coding (CAVLC), CABAC, syntax-based context-adaptive binary arithmetic coding (SBAC), probability interval partitioning entropy (PIPE) coding, or another entropy coding technique. Following the entropy coding, the coded bitstream may be transmitted to another device (e.g., a video decoder) or archived for later transmission or retrieval.

[0092] FIG. 5 is a block diagram illustrating an example video encoder 500. Video encoder 500 may be employed to implement the encoding functions of codec system 400 and/or implement steps 301, 303, 305, 307, and/or 309 of method 300. Further, encoder 500 may be employed to implement steps 205-209 of method 200 as well as encoder 103. Encoder 500 partitions an input video signal (e.g., a sub-picture video stream), resulting in a partitioned video signal 501, which is substantially similar to the partitioned video signal 401. The partitioned video signal 501 is then compressed and encoded into a bitstream by components of encoder 500.

[0093] Specifically, the partitioned video signal 501 is forwarded to an intra-picture prediction component 517 for intra-prediction. The intra-picture prediction component 517 may be substantially similar to intra-picture estimation component 415 and intra-picture prediction component 417. The partitioned video signal 501 is also forwarded to a motion compensation component 521 for inter-prediction based on reference blocks in a decoded picture buffer component 523. The motion compensation component 521 may be substantially similar to motion estimation component 421 and motion compensation component 419. The prediction blocks and residual blocks from the intra-picture prediction component 517 and the motion compensation component 521 are forwarded to a transform and quantization component 513 for transformation and quantization of the residual blocks. The transform and quantization component 513 may be substantially similar to the transform scaling and quantization component 413. The transformed and quantized residual blocks and the corresponding prediction blocks (along with associated control data) are forwarded to an entropy coding component 531 for coding into a bitstream. The entropy coding component 531 may be substantially similar to the header formatting and CAB AC component 431.

[0094] The transformed and quantized residual blocks and/or the corresponding prediction blocks are also forwarded from the transform and quantization component 513 to an inverse transform and quantization component 529 for reconstruction into reference blocks for use by the motion compensation component 521. The inverse transform and quantization component 529 may be substantially similar to the scaling and inverse transform component 429. In-loop filters in an in-loop filters component 525 are also applied to the residual blocks and/or reconstructed reference blocks, depending on the example. The in-loop filters component 525 may be substantially similar to the filter control analysis component 427 and the in-loop filters component 425. The in-loop filters component 525 may include multiple filters as discussed with respect to in-loop filters component 425. The filtered blocks are then stored in a decoded picture buffer component 523 for use as reference blocks by the motion compensation component 521. The decoded picture buffer component 523 may be substantially similar to the decoded picture buffer component 423.

[0095] The encoder 500 may encode video into one or more tracks. As discussed in more detail below, VR video can be recorded from multiple viewpoints. Video from each viewpoint can then be encoded in a corresponding set of tracks. This allows the decoder to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the encoder 500 can be configured to encode data indicating correspondences between viewport center points of related viewpoints. This allows the decoder to determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested. The decoder can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder 500. Accordingly, when the user switches between viewpoints, the decoder can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint. For example, such correspondences can be encoded in a timed metadata track and/or in corresponding video tracks. These concepts are discussed in greater detail below.

[0096] FIG. 6 is a block diagram illustrating an example video decoder 600. Video decoder 600 may be employed to implement the decoding functions of codec system 400 and/or implement steps 311, 313, 315, and/or 317 of operating method 300. Further, decoder 600 may be employed to implement steps 211-213 of method 200 as well as decoder 107. Decoder 600 receives a plurality of tracks containing picture bitstreams and/or sub-picture bitstreams, for example from an encoder 500, generates a reconstructed output video signal, for example by merging sub-picture video streams into a spherical video stream, and forwards the spherical video stream for display to a user via a rendering device.

[0097] The bitstreams are received by an entropy decoding component 633. The entropy decoding component 633 is configured to implement an entropy decoding scheme, such as CAVLC, CABAC, SBAC, PIPE coding, or other entropy coding techniques. For example, the entropy decoding component 633 may employ header information to provide a context to interpret additional data encoded as codewords in the bitstreams. The decoded information includes any desired information to decode the video signal, such as general control data, filter control data, partition information, motion data, prediction data, and quantized transform coefficients from residual blocks. The quantized transform coefficients are forwarded to an inverse transform and quantization component 629 for reconstruction into residual blocks. The inverse transform and quantization component 629 may be similar to inverse transform and quantization component 529.

[0098] The reconstructed residual blocks and/or prediction blocks are forwarded to intra picture prediction component 617 for reconstruction into image blocks based on intra prediction operations. The intra-picture prediction component 617 may be similar to intra picture estimation component 415 and intra-picture prediction component 417. Specifically, the intra-picture prediction component 617 employs prediction modes to locate a reference block in the frame and applies a residual block to the result to reconstruct intra-predicted image blocks. The reconstructed intra-predicted image blocks and/or the residual blocks and corresponding inter-prediction data are forwarded to a decoded picture buffer component 623 via an in-loop filters component 625, which may be substantially similar to decoded picture buffer component 423 and in-loop filters component 425, respectively. The in-loop filters component 625 filters the reconstructed image blocks, residual blocks, and/or prediction blocks, and such information is stored in the decoded picture buffer component 623. Reconstructed image blocks from decoded picture buffer component 623 are forwarded to a motion compensation component 621 for inter-prediction. The motion compensation component 621 may be substantially similar to motion estimation component 421 and/or motion compensation component 419. Specifically, the motion compensation component 621 employs motion vectors from a reference block to generate a prediction block and applies a residual block to the result to reconstruct an image block. The resulting reconstructed blocks may also be forwarded via the in-loop filters component 625 to the decoded picture buffer component 623. The decoded picture buffer component 623 continues to store additional reconstructed image blocks, which can be reconstructed into frames via the partition information. Such frames may also be placed in a sequence. The sequence is output toward a display as a reconstructed output video signal.

[0099] The decoder 600 may receive a set of tracks containing VR video recorded from multiple viewpoints. This allows the decoder 600 to swap between tracks based on user input, which allows a user to swap between viewpoints as desired. A user may wish to continuously watch a particular object or location in the virtual environment when switching between viewpoints. In order to allow the user to maintain a consistent view, the tracks may contain data indicating correspondences between viewport center points of related viewpoints. This allows the decoder 600 to determine the correspondences and determine the FOV and/or viewport used by the user at a first viewpoint when a viewpoint switch is requested. The decoder 600 can then determine a FOV/viewport at a second viewpoint that corresponds to the FOV/viewport used at the first viewpoint based on the correspondences encoded by the encoder. Accordingly, when the user switches between viewpoints, the decoder 600 can display a FOV/viewport at the second viewpoint that points toward the same location previously viewed by the user at the first viewpoint. For example, such correspondences can be encoded in a timed metadata track and/or in corresponding video tracks. These concepts are discussed in greater detail below.

[00100] FIG. 7 is a schematic diagram illustrating an example system 700 for capturing VR video from multiple viewpoints 702, 703, and 704. Multiple viewpoints 702, 703, and 704 are as an example. In other examples, less or more viewpoints may be provided. As shown, the system 700 is implemented to capture activity at a particular scene 701 (e.g., a stadium) using a plurality of cameras positioned at corresponding viewpoints 702, 703, and 704. The cameras may be similar to the multi-directional cameras 101 described above in connection with FIG. 1. In an embodiment, the cameras may capture VR videos in fixed positions at viewpoint 702 and viewpoint 703, together with a camera, which has the ability to continuously change positions along a rail 705 in order to capture VR videos from a variety of different positions denoted as viewpoint 704. By sliding along the rail 705, the camera is able to capture the VR video from different positions, and hence viewpoint 704 may change over time. In practical applications, the camera at viewpoint 704 may be mounted in other ways in order to be moveable in one or more directions. [00101] The cameras may each record a sphere of video looking outward from the perspective of the corresponding viewpoint 702, 703, and 704. Hence, a viewpoint 702, 703, and 704 is the center of a sphere of video data as recorded from a specified location. For example, video (and audio) can be recorded from viewpoints 702, 703, and 704. The video for each viewpoint can then be stored in a set of corresponding tracks. For example, video from a viewpoint 702 can be downsampled and stored at various resolutions in tracks as part of an adaptation set for viewpoint 702. Adaptation sets for viewpoints 703 and 704 can also be stored in corresponding tracks. Hence, a decoder can receive user input and, based on the user input, select an adaptation set with corresponding tracks for display. This in turn allows a user to direct the decoder to switch between viewpoints 702, 703, and 704. The result is the user can experience VR video from a first viewpoint (e.g., viewpoint 702) at a first time and then switch to experience VR video from a second viewpoint (e.g., viewpoint 703 or 704) at a second time.

[00102] One mechanism to enable such a viewpoint switch is to provide a default orientation for each viewpoint 702, 703, and 704. An orientation is a direction of view pointing outward from the center of a corresponding viewpoint 702, 703, and/or 704. An orientation may be described in terms of angle, coordinates, etc. A specified orientation may result in a corresponding FOV and viewport for viewing video from the viewpoint 702, 703, and/or 704.

[00103] A system for encoding VR video from multiple viewpoints 702, 703, and 704 can be implemented as follows. Tracks belonging to the same viewpoint may have the same value of track group id for track group type (vipo). The track group id of tracks from one viewpoint may differ from the track group id of tracks from any other viewpoint. By default, when this track grouping is not indicated for any track in a file, the file is considered as containing content for one viewpoint only. Example syntax is as follows:

aligned(8) class ViewpointGroupBox extends TrackGroupTypeBox('vipo') {

V iewpointPos Struct() ;

string viewpoint label;

}

[00104] The semantics for this syntax is as follows. Tracks that have the same value of track group id within TrackGroupTypeBox with track group type equal to‘vipo’ belong to the same viewpoint. The track group id within TrackGroupTypeBox with track group type equal to‘vipo’ is therefore used as the identifier of the viewpoint. ViewpointPosStruct()is defined below viewpoint label is a null-terminated Unicode Transformation Format eight bit (UTF-8) string that provides a human readable text label for the viewpoint. [00105] A Viewpoint Information Structure (ViewpointInfoStruct()) provides information of a viewpoint, including the position of the viewpoint and the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system. The syntax is as follows.

aligned(8) ViewpointInfoStruct(gcs_rotation_flag) {

V iewpointPos Struct() ;

if (gcs_rotation_flag)

ViewpointGlobalCoordinateSysRotationStruct();

unsigned int(l) group alignment flag;

bit(7) reserved = 0;

if (group alignment flag)

V ie wpointGroupStruct() ;

}

aligned(8) ViewpointPosStruct() {

signed int(32) viewpoint_pos_x;

signed int(32) viewpoint_pos_y;

signed int(32) viewpoint_pos_z;

unsigned int(l) viewpoint_gpspos_present_flag;

bit(3 l) reserved = 0;

if(viewpoint_gpspos_present_flag) {

signed int(32) viewpoint_gpspos_longitude;

signed int(32) viewpoint_gpspos_latitude;

signed int(32) viewpoint_gpspos_altitude;

}

}

aligned(8) class ViewpointGlobalCoordinateSysRotationStruct() {

signed int(32) viewpoint_gcs_yaw;

signed int(32) viewpoint gcs pitch;

signed int(32) viewpoint_gcs_roll;

}

aligned(8) class ViewpointGroupStruct() {

unsigned int(8) vwpt group id;

string vwpt group description; [00106] The semantics for this syntax is as follows. The group alignment flag can be set equal to one to specify that the viewpoint belongs to a separate coordinate system (with its own origin) for the alignment of viewpoint groups and the ViewpointGroupStruct is present. The group alignment flag can be set equal to zero to specify that the viewpoint belongs to the common reference coorindate system. When two viewpoints have different values of vwpt group id, their position coordinates are not comparable because the viewpoints belong to different coordinate systems. viewpoint_pos_x, viewpoint_pos_y, and viewpoint pos z specify the position of the viewpoint (when the position of the viewpoint is static) or the initial position of viewpoint (when the position of the viewpoint is dynamic) in units of 10 1 millimeters, in three dimensional (3D) space with (0, 0, 0) as the center of the common reference coordinate system. If a viewpoint is associated with a timed metadata track with sample entry type 'dyvp', the position of the viewpoint is dynamic. Otherwise, the position of the viewpoint is static. In the former case, the dynamic position of the viewpoint is signalled in the associated timed metadata track with sample entry type 'dyvp'. The viewpoint_gpspos_present_flag may be set equal to one to indicate that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are present. The viewpoint_gpspos_present_flag can be set equal to zero to indicate that viewpoint gpspos longitude, viewpoint gpspos latitude, and viewpoint gpspos altitude are not present viewpoint gpspos longitude can indicate the longitude of the geolocation of the viewpoint in units of 2 23 degrees. viewpoint_gpspos_longitude shall be in range of -180 * 2 23 to 180 * 2 23 - 1, inclusive. Positive values represent eastern longitude and negative values represent western longitude.

[00107] The viewpoint gpspos latitude indicates the latitude of the geolocation of the viewpoint in units of 2 23 degrees. The viewpoint gpspos latitude may be in the range of -90 * 2 23 to 90 * 2 23 - 1, inclusive. A positive value represents northern latitude and negative value represents southern latitude. The viewpoint gpspos altitude indicates the altitude of the geolocation of the viewpoint in units of milimeters above a World Geodetic System (WGS 84) reference ellipsoid. The viewpoint gcs yaw, v i e w po i n t gcs p itch, and viewpoint gcs roll specify the yaw, pitch, and roll angles, respectively, of the rotation angles of X, Y, Z axes of the global coordinate system of the viewpoint relative to the common reference coordinate system, in units degrees. The viewpoint_gcs_yaw may be in the range of -180 * 2 16 to 180 *2 16 - 1, inclusive. The viewpoint_gcs_pitch may be in the range of -90 * 2 16 to 90 * 2 16 , inclusive. The viewpoint_gcs_roll may be in the range of -l80 * 2 16 to 180 * 2 16 - 1, inclusive. The vwpt group id indicates the identifier of a viewpoint group. All viewpoints in a viewpoint group share a common reference coordinate system. The vwpt group description is a null- terminated UTF-8 string which indicates the description of a viewpoint group. A null string is allowed. An OMAF player may be expected to start with the initial viewpoint timed metadata. Subsequently, if the user wishes to switch to a viewpoint group and the initial viewpoint information is not present, the OMAF player may switch to the viewpoint with the least value of the viewpoint identifier in the viewpoint group.

[00108] A sample group may be employed for recommended viewports of multiple viewpoints. A timed metadata track having sample entry type 'rcvp' may contain zero or one SampleToGroupBox with grouping type equal to 'vwpt'. This SampleToGroupBox represents the assignment of samples in this timed metadata (and consequently the corresponding samples in the media tracks) to viewpoints. When a SampleToGroupBox with grouping type equal to 'vwpf is present, an accompanying SampleGroupDescriptionBox with the same grouping type may be present, and may contain the identifier (ID) of the particular viewpoint this group of samples belong to. The sample group entry of grouping type equal to 'vwpt', named ViewpointEntry, is defined as follows:

class ViewpointEntryO extends SampleGroupDescriptionEntry(’vwpt’) {

unsigned int(32) viewpoint_id;

}

Where viewpoint id indicates the viewpoint identifier of the viewpoint this group of samples belong to.

[00109] Timed metadata for viewpoints may include dynamic viewpoint information. Specifically, the dynamic viewpoint timed metadata track indicates the viewpoint parameters that are dynamically changing over time. An OMAF player may use the signalled information as follows when starting playing back of one viewpoint after switching from another viewpoint. If there is a recommended viewing orientation explicitly signaled, the OMAF player may parse this information and follow the recommended viewing orientation. Otherwise, the OMAF player may keep the same viewing orientation as in the switching-from viewpoint just before the switching occurs.

[00110] The track sample entry type 'dyvp' can be used for dynamic viewpoints. The sample entry of this sample entry type is specified as follows:

class Dynamic Vie wpointSampleEntry extends MetaDataSampleEntry('dyvp') {

ViewpointPosStruct();

unsigned int(l) dynamic gcs rotation flag;

bit(7) reserved = 0; if (!dynamic_gcs_rotation_flag)

ViewpointGlobalCoordinateSysRotationStruct();

}

[00111] The ViewpointPosStmct() is defined above but indicates the initial viewpoint position. The dynamic gcs rotated flag can be set equal to zero to specify that the yaw, pitch, and roll rotation angles ofX, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system remain unchanged in all samples referring to this sample entry. The dynamic gcs rotated flag can be set equal to zero to specify that the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system are indicated in the samples. ViewpointGlobalCoordinateSysRotationStruct() is defined in clause above but indicates the yaw, pitch, and roll rotation angles of X, Y, and Z axes, respectively, of the global coordinate system of the viewpoint relative to the common reference coordinate system for each sample referring to this sample entry.

[00112] The sample syntax of the sample entry type ('dyvp') is specified as follows:

aligned(8) Dynamic ViewpointSample() {

ViewpointInfoStruct(dynamic_gcs_rotation flag);

}

The semantics of ViewpointInfoStruct() is specified above. The first sample should have a group alignment flag set equal to one. For subsequent samples, when the group information does not change, the ViewpointGroupStruct() can be absent. When the ViewpointGrouptStruct() is absent in a sample, the structure is inferred to be identical to the ViewpointGroupStruct() of the previous sample in decoding order.

[00113] The metadata indicates the initial viewpoint that should be used. In the absence of this information, the initial viewpoint may be inferred to be the viewpoint that has the least value of viewpoint identifier among all viewpoints in the file. The initial viewpoint timed metadata track, when present, should be indicated as being associated with all viewpoints in the file.

[00114] The track sample entry type 'invp' may be used. The sample entry of this sample entry type is specified as follows:

class InitialViewpointSampleEntry extends MetaDataSampleEntry('invp') {

unsigned int(32) id of initial viewpoint;

} The id of initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the first sample to which this sample entry applies.

[00115] The sample syntax of this sample entry type ('invp') is specified as follows:

aligned(8) InitialViewpointSample() {

unsigned int(32) id of initial viewpoint;

}

The id of initial viewpoint indicates the value of the viewpoint identifier of the initial viewpoint for the sample.

[00116] OMAF includes the specification of the initial viewing orientation timed metadata. This metadata indicates initial viewing orientations that may be used when playing the associated media tracks or a single omnidirectional image stored as an image item. The track sample entry type 'invo' may be used. An example syntax is as follows:

[00117] class InitialViewingOrientationSample() extends SphereRegionSample() {

unsigned int(l) refresh flag;

bit(7) reserved = 0;

}

[00118] aligned(8) SphereRegionSample() {

for (i = 0; i < num_regions; i++)

SphereRegionStruct(dynamic_range_flag)

}

[00119] aligned(8) SphereRegionStruct(range_included_flag) {

signed int(32) centre azimuth;

signed int(32) centre_elevation;

singed int(32) centre_tilt;

if (range_included_flag) {

unsigned int(32) azimuth range;

unsigned int(32) elevation_range;

}

unsigned int(l) interpolate;

bit(7) reserved = 0;

}

[00120] The default orientation approach as described in the implementation above may cause a user to view a specified default FOV and viewport upon switching to a new viewpoint 702, 703, and/or 704. Flowever, this may result in a negative user experience in some cases. For example, a user may wish to continuously view an object in the scene 701, such as a basketball, a particular player, a goal, etc. Such a consistency may not be possible using default orientations. For example, a user watching the ball at viewpoint 702 may wish to switch to viewpoint 704 to get a closer look. Flowever, the default orientation at viewpoint 704 may be toward the goal. In such a case, the user loses the ball upon switching and is forced to find the ball again.

[00121] In the present disclosure, the encoder can store viewport center point correspondences between viewpoints 702, 703, and/or 704. The decoder can determine the viewport viewed by the user at viewpoint 702 upon switching to viewpoint 704 (or viewpoint 703 in other examples). The decoder can then use the viewport center point correspondences between viewpoint 702 and viewpoint 704 to determine a viewport at viewpoint 704 that matches the viewport at viewpoint 702. The decoder can then employ the determined viewport at viewpoint 704 after making the switch. In this manner, the user is automatically oriented to the same location in the scene 701 after the switch between viewpoints 702, 703, and/or 704 as was viewed before the switch. For example, if the user is watching the ball at viewpoint 702, the user is automatically oriented to view the ball from viewpoint 704 upon switching. The viewport center point correspondences are discussed in greater detail below.

[00122] FIG. 8 is a schematic diagram 800 of a pair of viewpoints 810 and 820 with corresponding viewport center points 8l3a and 823a. A viewport correspondence, as used herein, is an indication that two or more viewports 813 and 823 are spatially related such that viewing the viewports 813 and 823 from a related viewpoint 810 and 820, respectively, provides a view of the same object 830. The correspondences support switching in the present disclosure. Further, the correspondences between viewports 813 and 823 are denoted by the center points 813a and 823a of the viewports 813 and 823, respectively. Flence, the correspondences shown in schematic diagram 800 can be used by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400. Further, the correspondences shown in schematic diagram 800 can describe relationships between viewpoints 702, 703, and/or 704. In addition, the correspondences shown in schematic diagram 800 can be encoded in a bitstream and used to support selection of tracks to decode and display, and hence can be used as part of methods 200 and 300.

[00123] As shown in diagram 800, correspondences can be stored as viewpoint 810 and 820 pairs. Viewpoints 810 and 820 each include a sphere 814 and 824, respectively, of video content in associated tracks. Specifically, a user viewing video from a viewpoint 810 and 820 has access to a sphere 814 and 824, respectively, of video content. The video content is depicted to the user by projecting a portion of the video content from the sphere 814 and 824, depending on the user’s viewpoint 810 and 820, onto a viewport based on the current FOV 811 and 821, respectively, of the user. A FOV, such as FOV 811 and 821, is an angular orientation measured from the center of the associated viewpoint 810 and 820, respectively. The angular orientation of an FOV 811 and 821 can be used to determine an area of the location 831 that a user can view from a corresponding viewpoint 810 and 820. Each FOV corresponds to a viewport. A viewport is a two dimensional planar shape with a width and a height that covers a viewable portion of the viewpoint sphere. A portion of the viewpoint sphere that can be viewed by an FOV is projected onto a corresponding viewport. As an example, a user wearing a FIMD may point their head in a particular direction. The angular orientation of the FIMD relative to the viewpoint sphere of video content in such an example is the FOV. Meanwhile, the screen displaying content to the user is the viewport. To ensure a proper user experience, the portion of the viewpoint sphere of video content displayed on the viewport should match the angular orientation of the FIMD. ffence, the viewport should match the FOV for proper operation of a VR system. The viewpoints 810 and 820 can have many viewports covering different portions of the spheres 814 and 824, respectively. Such viewports can be as varied as the number of FOVs available for the corresponding viewpoint. Many of the viewports of the viewpoints 810 and 820 are unrelated as they may allow the user a view of the same location 831, but not the same object 830. Flowever, certain viewports, such as viewports 813 and 823 correspond. This is because FOVs 811 and 821 project the same object 830 onto viewports 813 and 823.

[00124] As shown, corresponding viewports 813 and 823 may be employed to allow a user to view the same object 830 at the location 831 from a different perspective. Further, the center points 8l3a and 823a of each viewport 813 and 823, respectively, can be employed to indicate such correspondence. Specifically, a user can view the object 830 via an initial viewport 813 at an initial viewpoint 810. The user can then decide to switch to a destination viewpoint 820. The decoder can determine that the initial viewpoint 810 and the destination viewpoint 820 have corresponding viewports. Further, the decoder can employ stored viewport center point data to determine that the viewport center point 8l3a of the initial viewport 813 corresponds with the viewport center point 823a of the viewport 823. Accordingly, the decoder can default to the viewport 823 containing the indicated viewport center point 823a. In this way, the decoder can perform a switch from viewpoint 810 to viewpoint 820 while orienting the user toward the same object 830 in order to maintain viewing consistency when switching between viewpoints 810 and 820. [00125] It should be noted that groups of viewpoints 810 and 820 that have a view of the same object 830 may be referred to as a viewpoint set. Further, a center point 8l3a of viewport 813 (or a center point 823a of viewport 823) may also be referred to as a viewport center in some cases. As such, data indicative of viewpoint centers viewpoints 810 and 820 in a viewpoint set can be encoded in a VR video stream. This allows a decoder to correctly determine corresponding viewports 813 and 823 based on the viewpoint centers when switching between viewpoints 810 and 820 in the viewpoint set.

[00126] For example, the embodiment described in FIG. 8 can be implemented as discussed below. In this example, the viewing orientation information (e.g., FOV related information) can be signaled to the client/decoder in a timed metadata track. Viewpoint pairs and corresponding center points of viewport pairs are signaled in order to indicate an expected viewing orientation after switching from one viewpoint to another viewpoint. In this embodiment, the viewport center points correspondence information is signalled as a particular sample entry type, such as 'vcpc', in the timed metadata tracks. The sample entry may be specified as follows:

class VcpcSampleEntryO extends SphereRegionSampleEntry('vcpc') {

unsigned int(32) num_viewpoint_pairs;

}

class VcpcSample(){

for (i = 0; i < num_viewpoint_pairs; i++) {

unsigned int(32) viewpoint_id[2];

unsigned int(l6) num corresponding viewport centres;

for (j = 0; j < num_corresponding_viewport_centres; j++) {

Sphere RegionStruct(O) [2] ;

}

}

}

[00127] The semantics for such an implementation is as follows. num_viewpoint_pairs indicates the number of viewpoint pairs for which viewport center points correspondence is signalled in the samples to which this sample entry applies. viewpoint_id[2] indicates two viewpoints identifiers (IDs) of the viewpoint pair num corresponding viewport centres indicates the number of corresponding viewport center points signalled in this sample for the i- th viewpoint pair. SphereRegionStruct(0)[k]for k equal to 0 or 1 specifies the viewport center point corresponding to the viewpoint indicated by viewpoint_id[k]. In this embodiment, a viewport center points correspondence timed metadata track with sample entry type of 'vcpc' may have no track reference of type 'cdsc' and in this case the correspondence may apply to the entire file. Content providers can perform scene or object matching among video streams representing different viewpoints frame by frame, and choose a representative point of the scene or object, e.g., the center point of an object, as the corresponding viewport center point to be indicated by the VCPC timed metadata track. When a viewpoint switching occurs, the client checks whether the user's field of view in the switch-from viewpoint covers a corresponding viewport center point that is indicated by the time-aligned sample of the VCPC timed metadata track. If yes, just after the switching, the client may render to the user the viewport in the switching-to viewpoint for which the corresponding center point is indicated by the time- aligned sample of the VCPC timed metadata track. When the user's field of view covers more than one indicated viewport center point, one of those that are the closest to the center of the user's field of view may be chosen. If both recommended viewport metadata information for the switch-to viewpoint and the VCPC timed metadata track are available, since both types information do no impose mandatory OMAF player behaviour, then the OMAF player may choose to follow either one or no one.

[00128] FIG. 9 is a schematic diagram 900 of a set of viewpoints 910, 920, 940, and 950 corresponding viewport center points 9l3a, 923a, 943a, and 953a. The correspondences support switching in the present disclosure ffence, the correspondences shown in schematic diagram 900 can be used by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400. Further, the correspondences shown in schematic diagram 900 can describe relationships between viewpoints 702, 703, and/or 704. In addition, the correspondences shown in schematic diagram 900 can be encoded in a bitstream and used to support selection of tracks to decode and display, and hence can be used as part of methods 200 and 300.

[00129] Diagram 900 is substantially similar to diagram 800, but allows for sets including any number of viewpoints 910, 920, 940, and 950 instead of using pairs of viewpoints 810 and 820. Specifically, diagram 900 includes viewpoints 910, 920, 940, and 950, spheres 914, 924, 944, and 954, FOVs 911, 921, 941, and 951, and viewports 913, 923, 943, and 953, which are substantially similar to viewpoints 810 and 820, spheres 814 and 824, FOVs 811 and 821, and viewports 813 and 823. As shown FOVs 911, 921, 941, and 951 are all oriented toward same object 930 from different perspectives. As such, an encoder may encode a set of correspondences between viewport center point 9l3a at viewpoint 910, viewport center point 923a at viewpoint 920, viewport center point 943a at viewpoint 940, and viewport center point 953a at viewpoint 950 during VR video creation. A decoder can then use such information to maintain viewing consistency when switching between viewpoints 910, 920, 940, and 950.

[00130] In another example implementation, the viewport correspondence is signalled between a group of two or more viewports as follows. The viewport center points correspondence timed metadata tracks that have a particular sample entry type, such as 'vcpc', can be used. The sample entry can be specified as follows:

class VcpcSampleEntryO extends SphereRegionSampleEntry('vcpc') {

unsigned int(32) num viewpoint sets;

}

[00131] An example syntax can be defined is as follows:

class VcpcSample(){

for (i = 0; i < num_viewpoint_sets; i++) {

unsigned int(8) num_viewpoints_in_this_set[i];

for (j = 0; j < num_viewpoints_in_this_set[i]; j++)

unsigned int(32) viewpoint_id[i][j];

unsigned int( 16) num_corresponding_viewport_centres_in_this_set[i] ; for (k = 0; k < num_corresponding_viewport_centres_in_this_set[i]; k++)

for (j = 0; j < num_viewpoints_in_this_set[i]; j++)

SphereRegionStruct(O) [i] [k] [j] ;

}

}

[00132] The semantics for this syntax is as follows. The num viewpoint sets indicates the number of viewpoint sets for which viewport center points correspondence is signalled in the samples to which this sample entry applies. The num_viewpoints_in_this_set[i] indicates the number of viewpoints in the i-th viewpoint set. The viewpoint_id[i][j] indicates the viewpoint ID of the j-th viewpoint in the i-th viewpoint set. The num_corresponding_viewport_centres_in_this_set[i] indicates the number of corresponding viewport center points signalled in this sample for the i-th viewpoint set. The SphereRegionStruct(0)[i][k][j] specifies the k-th corresponding viewport center point of the j-th viewpoint in the i-th viewpoint set. For any particular value of k in the range of zero to num_corresponding_viewport_centres_in_this_set[i] - 1, inclusive, the sphere points indicated by SphereRegionStruct(0)[i][k][j] for j ranging from zero to num_viewpoints_in_this_set[i] - 1, inclusive, are viewport center points that correspond to each other for the viewpoints in the i-th viewpoint set.

[00133] In this embodiment, a viewport center points correspondence in the timed metadata track with a sample entry type of 'vcpc' may have no track reference of type 'cdsc'. In this case, the correspondence applies to the entire file. Content providers can perform scene or object matching among video streams representing different viewpoints frame by frame, and choose a representative point of the scene or object, such as the center point of an object, as the corresponding viewport center point to be indicated by the VCPC timed metadata track. When a viewpoint switching occurs, the client/decoder checks whether the user's field of view in the switch-from viewpoint covers a corresponding viewport center point that is indicated by the time-aligned sample of the VCPC timed metadata track. If yes, just after the switching, the client may render to the user the viewport in the switching-to viewpoint for which the corresponding center point is indicated by the time-aligned sample of the VCPC timed metadata track. When the user's field of view covers more than one indicated viewport center point, one of those that are the closest to the center of the user's field of view should be chosen. If both recommended viewport metadata information for the switch-to viewpoint and the VCPC timed metadata track are available, since both types information do no impose mandatory OMAF player behavior, then the OMAF player may choose to follow either one or no one.

[00134] FIG. 10 is a schematic diagram of an example VR video file 1000 for multiple viewpoints. For example, VR video file 1000 may be employed to contain correspondences between spatial regions of viewpoints as discussed with respect to diagrams 800 and 900. Further, the VR video file 1000 can be encoded and/or decoded by an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400. In addition, the VR video file 1000 can describe VR video from multiple viewpoints, such as viewpoints 702, 703, and/or 704. Also, the VR video file 1000 can contain encoded VR video, and hence can generated by an encoder and read by a decoder to support video display as part of methods 200 and 300.

[00135] The VR video file 1000 can contain sets of tracks for corresponding viewpoints. For example, the VR video file 1000 can contain a set of viewpoint A tracks 1010, a set of viewpoint B tracks 1020, a set of viewpoint C tracks 1040, and a set of viewpoint D tracks 1050. As a specific example, such tracks can contain video data as captured from viewpoint 910, viewpoint 920, viewpoint 940, and viewpoint 950, respectively. For example, in a DASFI context VR video recorded at a viewpoint is stored in a corresponding adaptation set. The adaptation set is downsampled to various lower resolutions. Then a track is generated for each resolution of the adaptation set. In such a case, the set of viewpoint A tracks 1010, set of viewpoint B tracks 1020, set of viewpoint C tracks 1040, and set of viewpoint D tracks 1050 contain the tracks associated with the adaptation set for the corresponding viewpoints. The relevant tracks can then be forwarded to the decoder/client depending on the viewpoint selected by the user and the desired resolution based on the availability of network resources.

[00136] The VR video file 1000 can also contain a timed metadata track 1060. The timed metadata track 1060 contains metadata relevant to all of the viewpoints and hence to all of the tracks 1010, 1020, 1040, and 1050. As such, viewport center point correspondences between viewpoints can be stored in the timed metadata track 1060, for example as one or more S VcpcSample objects/functions. For example, correspondences between each of the relevant viewport center points for the viewpoints can be stored toward the beginning of the timed metadata track 1060. Such information may be global in nature and can be used for the entire VR video file 1000. In the event that the viewport center point correspondences change, for example due to viewpoint motion, viewpoints turning on/off, etc., such changes can be coded into the timed metadata track 1060 at the temporal location in the VR video file 1000 where such changes occur. Accordingly, the timed metadata track 1060 can be employed to contain the viewport center point correspondences between the viewpoints over the entire length of the VR video file 1000. The viewport center point correspondences in the timed metadata track 1060 can then be used by a decoder when displaying VR video as contained in tracks 1010, 1020, 1040, and 1050.

[00137] FIG. 11 is an embodiment of a method 1100 of displaying a VR video at a decoder based on a viewport center point correspondence between multiple viewpoints, as discussed with respect to diagrams 800 and 900 and as applied to viewpoints such as viewpoints 702, 703, and/or 704. Flence, method 1100 may be employed by a decoder 107, a decoder 600, and/or a codec system 400. Method 1100 can also be employed to support viewpoint switching when displaying a VR video file, such as VR video file 1000, and hence can be employed to improve methods 200 and 300.

[00138] Method 1100 operates on a client including a decoder. At step 1101, the decoder begins processing a VR video stream. As described above, the VR video stream comprises a plurality of viewpoints, and the viewpoints are included in at least one viewpoint set. As used herein, a viewpoint set is a group of viewpoints that contain one or more viewport center point correspondences as described with respect to diagram 800 and/or 900. Each of the viewpoints corresponds to one particular omnidirectional video camera used for capturing an omnidirectional video at a particular location. Further, the VR video stream contains information indicative of a plurality of viewport centers for the viewpoints in the viewpoint set. For example, the information indicative of the plurality of viewport center points (e.g., view point centers) may indicate that a second viewport and a first viewport are corresponding viewpoint centers associated with a second viewpoint and a first viewpoint, respectively.

[00139] At step 1103, the decoder presents a first viewport of a first viewpoint of the viewpoint set to a user, for example by forwarding the first viewport of the first viewpoint toward a display. At step 1105, the decoder switches from the first viewpoint to a second viewpoint of the viewpoint set. For example, the decoder may determine to make such a switch upon receiving a command from the user.

[00140] At step 1107, the decoder determines a second viewport of the second viewpoint based on the information indicative of a plurality of viewport centers for the viewpoints in the viewpoint set as obtained and processed at step 1101. The decoder can then forward video data associated with the second viewport toward the display. It should be noted that the first/second viewpoint and the first/second viewport may also be referred to as a source/destination viewpoint and viewport, a switched from/switched to viewpoint and viewport, an initial/final viewpoint and viewport, initial/switched viewpoint and viewport, etc. As described above, the information indicative of the plurality of viewport centers used to determine the second viewport based on the first viewport can be coded as a pair in a vcpc sample entry in some examples. In other examples, the information indicative of the plurality of viewport centers can also be coded as a set containing a plurality of viewport centers in one or more vcpc sample entries. Further, the information indicative of the plurality of viewport centers can be coded in a timed metadata track related to the plurality of viewpoints. For example, the information indicative of the plurality of viewport centers can be coded in a sphere region structure.

[00141] FIG. 12 is another embodiment of a method 1200 of displaying a VR video at a decoder based on a viewport center point correspondence between multiple viewpoints, as discussed with respect to diagrams 800 and 900 and as applied to viewpoints such as viewpoints 702, 703, and/or 704. Flence, method 1200 may be employed by a decoder 107, a decoder 600, and/or a codec system 400. Method 1200 can also be employed to support viewpoint switching when displaying a VR video file, such as VR video file 1000, and hence can be employed to improve methods 200 and 300.

[00142] At step 1201, a client operating a decoder receives a bitstream including at least a portion of a coded VR video filmed from a plurality of viewpoints. Further, the bitstream includes a correspondence between viewport centers for the viewpoints. For example, the correspondence between the viewport centers can be coded as a pair in a vcpc sample entry. As another example, the correspondence between the viewport centers can be coded as a set containing a plurality of viewport centers in one or more vcpc sample entries. The correspondence between the viewport centers may be coded in a timed metadata track related to the plurality of viewpoints. Further, the correspondence between the viewport centers can be coded in a sphere region structure, for example in the timed metadata track and containing the vcpc sample entries.

[00143] At step 1203, the decoder decodes the portion of the VR video at a center point of a current viewport at a current viewpoint. The decoder also forwards the portion of the VR video at the current viewport toward a display.

[00144] At step 1205, the decoder determines to switch from the source viewpoint to a destination viewpoint, for example in response to user input. Accordingly, the decoder determines a destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints.

[00145] At step 1207, the decoder decodes the portion of the VR video at a center point of the destination viewport at the destination viewpoint. The decoder can then forward the portion of the VR video at the destination viewport toward the display.

[00146] FIG. 13 is an embodiment of a method 1300 of signaling a viewport center point correspondence between multiple viewpoints in a VR video from an encoder, as discussed with respect to diagrams 800 and 900 and as applied to viewpoints such as viewpoints 702, 703, and/or 704. Flence, method 1300 may be employed by an encoder 103, an encoder 500, and/or a codec system 400. Method 1300 can also be employed to support generation of a VR video file, such as VR video file 1000, and hence can be employed to improve methods 200 and 300.

[00147] Method 1300 operates on an encoder, for example on a computer system configured to encode VR video. At step 1301, the encoder receives a VR video signal filmed from a plurality of viewpoints. At step 1303, the encoder determines a correspondence between viewport centers for the viewpoints. Data to support the determination of correspondences may be input by the user, received from the cameras, determined based on global positioning system (GPS) data, etc.

[00148] At step 1305, the encoder encodes the correspondence between the viewport centers for the viewpoints in a bitstream. For example, the correspondence between the viewport centers can be coded as a pair in a vcpc sample entry. As another example, the correspondence between the viewport centers can be coded as a set containing a plurality of viewport centers in one or more vcpc sample entries. The correspondence between the viewport centers may be coded in a timed metadata track related to the plurality of viewpoints. Further, the correspondence between the viewport centers can be coded in a sphere region structure, for example in the timed metadata track and contain the vcpc sample entries.

[00149] At step 1307, the encoder can transmit the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal. For example, the encoder can transmit the bitstream toward a client with a decoder. As another example, the encoder can transmit the bitstream toward a server, which can store the bistream for further transmissions to client(s). Regardless of the example, the correspondence between the viewport centers indicates a correspondence between a center point of a source viewport at a source viewpoint and a center point of a destination viewport at a destination viewpoint to maintain a consistent object view upon viewpoint switching at the client/decoder.

[00150] FIG. 14 is a schematic diagram of an example video coding device 1400 according to an embodiment of the disclosure. The coding device 1400 is suitable for implementing the methods and processes disclosed herein. The coding device 1400 comprises downstream ports 1410 and transceiver units (Tx/Rx) 1420 for transmitting and receiving data to and from a downstream direction; a processor 1430, logic unit, or central processing unit (CPU) to process the data; upstream ports 1450 coupled to Tx/Rx 1420 for transmitting and receiving the data to and from an upstream direction; and a memory 1460 for storing the data. The coding device 1400 may also comprise optical-to-electrical (OE) components and/or electrical-to-optical (EO) components coupled to the downstream ports 1410, the Tx/Rx units 1420, and the upstream ports 1450 for egress or ingress of optical or electrical signals.

[00151] The processor 1430 is implemented by hardware and software. The processor 1430 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor), field- programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), and digital signal processors (DSPs). The processor 1430 is in communication with the downstream ports 1410, transceiver units 1420, upstream ports 1450, and memory 1460. The processor 1430 comprises a coding module 1470. The coding module 1470 implements the disclosed embodiments described above. For example, the coding module 1470 may implement an encoder 103, an encoder 500, a decoder 107, a decoder 600, and/or a codec system 400, depending on the example. Further, the coding module 1470 may implement method 200, method 300, method 1100, 1200, and/or method 1300, depending on the example. For example, coding module 1470 may generate or decode a VR video file 1000. For example, the coding module 1470 can encode or decode VR video based on a timed metadata track 1060 that contains correspondences between viewport centers of different viewpoints, such as viewpoints 702, 703, 704, 810, 820, 910, 920, 940, and/or 950 as discussed with respect to diagrams 800 and 900. For example, when acting as an encoder, coding module 1470 can determine and encode correspondences between viewport center points for viewpoints in pairs and/or sets, for example in a SphereRegionStruct object in a timed metadata track. When acting as a decoder, the coding module 1470 can determine such correspondences and use them when switching between viewpoints to provide the user with a consistent view of a location/object. Specifically, the coding module 1470 can determine a source viewport at a source viewpoint and determine a destination viewport at a destination viewpoint based on the correspondences between the center points of such viewports. Accordingly, the inclusion of the coding module 1470 therefore provides a substantial improvement to the functionality of the coding device 1400 and effects a transformation of the coding device 1400 to a different state. Alternatively, the coding module 1470 is implemented as instructions stored in the memory 1460 and executed by the processor 1430.

[00152] The video coding device 1400 may also include input and/or output (I/O) devices 1480 for communicating data to and from a user. The I/O devices 1480 may include output devices such as a display for displaying video data, speakers for outputting audio data, etc. The TO devices 1480 may also include input devices, such as a keyboard, mouse, trackball, etc., and/or corresponding interfaces for interacting with such output devices.

[00153] The memory 1460 comprises one or more disks, tape drives, and solid-state drives and may be used as an over-flow data storage device, to store programs when such programs are selected for execution, and to store instructions and data that are read during program execution. The memory 1460 may be volatile and non-volatile and may be read-only memory (ROM), random-access memory (RAM), ternary content-addressable memory (TCAM), and static random-access memory (SRAM).

[00154] FIG. 15 is a schematic diagram of an embodiment of a system 1500 for signaling a viewport center point correspondence between multiple viewpoints in a VR video. The system 1500 is suitable for implementing the methods and processes disclosed herein, for example, may implement method 200, method 300, method 1100, 1200, and/or method 1300, depending on the example. The system 1500 includes a video encoder 1502. The encoder 1502 comprises a receiver 1501 for receiving a VR video signal filmed from a plurality of viewpoints. The encoder 1502 also comprises a correspondence determination module 1503 for determining a correspondence between viewport centers for the viewpoints. The encoder 1502 also comprises an encoding module 1505 for encoding the correspondence between the viewport centers for the viewpoints in a bitstream. The encoder 1502 also comprises a transmitter 1507 for transmitting the bitstream containing the correspondence between the viewport centers for the viewpoints to support viewpoint transitions when displaying the VR video signal. The encoder 1502 is further configured to perform other encoding related mechanisms as discussed herein.

[00155] The system 1500 also includes a video decoder 1510. The decoder 1510 comprises a receiver 1511 for receiving a bitstream including at least a portion of a coded VR video filmed from a plurality of viewpoints and including a correspondence between viewport centers for the viewpoints. The decoder 1510 also comprises a decoding module 1513 for decoding the portion of the VR video at a center point of a source viewport at a source viewpoint, and decoding the portion of the VR video at a center point of a destination viewport at a destination viewpoint. The decoder 1510 also comprises a determining module 1515 for determining to switch from the source viewpoint to the destination viewpoint, and determining the destination viewport at the destination viewpoint based on the source viewport and the correspondence between viewport centers for the viewpoints. The decoder 1510 also comprises a forwarding module 1517 for forwarding the portion of the VR video at the source viewport toward a display, and forwarding the portion of the VR video at the destination viewport toward the display. The decoder 1510 is further configured to perform other decoding, display, and/or viewpoint switching related mechanisms as discussed herein.

[00156] While several embodiments have been provided in the present disclosure, it may be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.

[00157] In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, components, techniques, or methods without departing from the scope of the present disclosure. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and may be made without departing from the spirit and scope disclosed herein.