A METHOD AND AN APPARATUS FOR MOTION COMPENSATION

Title:

A METHOD AND AN APPARATUS FOR MOTION COMPENSATION

Document Type and Number:

WIPO Patent Application WO/2019/158812

Kind Code:

Abstract:

The invention relates to a technical equipment and a method for determining at least one motion vector for at least one sample in a block in a current frame (1820), the motion vector having a motion vector component for each direction of the block, determining a motion compensation filter for a direction, wherein the motion compensation filter corresponds to a motion vector component in said direction and has filter coefficients and being used for determining a predicted sample corresponding to a sample in the block (1830). The motion compensation filter is modified by adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero (1870).

Inventors:

AMINLOU ALIREZA (FI)
ZARE ALIREZA (FI)

Application Number:

PCT/FI2019/050095

Publication Date:

August 22, 2019

Filing Date:

February 08, 2019

Export Citation:

Click for automatic bibliography generation Help

Assignee:

NOKIA TECHNOLOGIES OY (FI)

International Classes:

H04N19/51; G06T7/223; H04N19/43; H04N19/523; H04N19/55; H04N19/80

Foreign References:

US20150245059A1	2015-08-27
US20130101016A1	2013-04-25
US20170085913A1	2017-03-23

Other References:

CONCOLATO, C. ET AL.: "Adaptive Streaming of HEVC Tiled Videos Using MPEG-DASH", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 28, no. 8, 28 March 2017 (2017-03-28), pages 1981 - 1992, XP055633280, ISSN: 1051-8215, Retrieved from the Internet [retrieved on 20190418]
3 July 2017 (2017-07-03), XP055633290, Retrieved from the Internet [retrieved on 20190418]
WANG, Y. ET AL., VIEWPORT DEPENDENT PROCESSING IN VR: PARTIAL VIDEO DECODING, 25 May 2016 (2016-05-25), Retrieved from the Internet [retrieved on 20190418]

Attorney, Agent or Firm:

NOKIA TECHNOLOGIES OY et al. (FI)

Download PDF:

View/Download PDF PDF Help

Claims:

1 . A method, comprising:

- obtaining a reference frame being partitioned into at least one tile;

- determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block;

- for each direction of the block:

o determining a motion compensation filter for a direction, wherein the motion compensation filter corresponds to a motion vector component in said direction, said motion compensation filter having filter coefficients and being used for determining a predicted sample corresponding to a sample in the block;

o determining a tile in the reference frame corresponding to the sample in the block;

o determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample;

o determining samples going outside the tile in said direction; and

o modifying the motion compensation filter by

• adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and

• changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

2. An apparatus comprising:

- means for obtaining a reference frame being partitioned into at least one tile;

- means for determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block;

- and means for implementing the following for each direction of the block:

o determining a tile in the reference frame corresponding to the sample in the block;

o determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample;

o determining samples going outside the tile in said direction; and

o modifying the motion compensation filter by • adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and

• changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

3. The apparatus according to claim 2, wherein the adding the original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile comprises adding the original values of the filter coefficients to a filter coefficient of the first sample inside the tile and next to the samples of the outside the tile.

4. The apparatus according to claim 2 or 3, wherein the determining the tile in the reference frame corresponding to the sample comprises determining the location of the predicted sample in the reference frame according to the location of the sample in the current frame and value of the motion vector, and determining the corresponding tile that the predicted sample is located.

5. The apparatus according to claim 2 or 3, where in the determining the tile in the reference frame corresponding to the sample comprises determining the location of a collocated sample in the reference frame and determining the corresponding tile that the predicted sample is located.

6. The apparatus according to claim 4 or 5, where in determining the corresponding tile comprises that the majority of the predicted samples of the block in the reference frame is located in the tile.

7. The apparatus according to any of the claims 2 to 6, wherein determining samples going outside the tile comprises determining a number of the samples going outside the tile as the maximum of the number of the samples going outside the tile for each sample in each direction-based row of the block.

8. The apparatus according to any of the claims 2 to 6, wherein determining samples going outside the tile comprises determining a number of the samples going outside the tile as the maximum of the number of the samples going outside of the tile for each sample in all direction-based rows in the block.

9. The apparatus according to any of the claims 2 to 6, wherein determining samples going outside the tile comprises determining a number of the samples going outside as the maximum of the number of the samples going outside for each sample in each subblock of the block, wherein a subblock is a block within the block.

10. The apparatus according to any of the claims 2 to 9, determining a motion vector for all samples in the block using the motion vector of the block. 1 1 . The apparatus according to any of the claims 2 to 9, determining a motion vector for all samples in each subblock of the block using motion information of the block.

12. The apparatus according to any of the claims 2 to 9, determining a motion vector for a sample comprises determining a motion vector for each sample in the block using motion information of the block.

13. The apparatus according to any of the claims 2 to 12, further comprising indicating with a flag whether the modification of motion compensation filter is applied. 14. The apparatus according to claim 13, comprising at least one flag for at least one tile boundary of at least one tile.

15. The apparatus according to any of the claims to 2 to 14, wherein the block has at least horizontal and vertical directions.

16. The apparatus according to any of the claims to 2 to 15, further comprising using different flags for different block sizes.

Description:

A METHOD AND AN APPARATUS FOR MOTION COMPENSATION

Technical Field

The present solution generally relates to a method, an apparatus and a computer program product for modifying motion compensation filter.

Background

A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed.

Recently, the development of various multimedia streaming applications, especially 360-degree video or virtual reality (VR) applications, has advanced with big steps. In viewport-adaptive streaming, the bitrate is aimed to be reduced e.g. such that the primary viewport (i.e., the current viewing orientation) is transmitted at the best quality/resolution, while the remaining of 360-degree video is transmitted at a lower quality/resolution. When the viewing orientation changes, e.g. when the user turns his/her head when viewing the content with a head-mounted display, another version of the content needs to be streamed, matching the new viewing orientation.

There are several alternatives to deliver the viewport-dependent omnidirectional video. It can be delivered, for example, as equal-resolution High Efficiency Video Coding (HEVC) bitstreams with motion-constrained tile sets (MCTSs). Thus, several HEVC bitstreams of the same omnidirectional source content are encoded at the same resolution but different qualities and bitrates using motion-constrained tile sets. Another option to deliver the viewport-dependent omnidirectional video is to carry out HEVC Scalable Extension (SHVC) region-of interest scalability encoding. Therein, the base layer is coded conventionally and region-of-interest (ROI) enhancement layers are encoded with SHVC Scalable Main profile. However, limited support of the available decoding hardware for inter-layer prediction, such as the SHVC extension of HEVC, restricts the usability of the SHVC ROI encoding. Various technologies for providing three-dimensional (3D) video content are currently investigated and developed. Especially, intense studies have been focused on various multiview applications wherein a viewer is able to see only one pair of stereo video from a specific viewpoint and another pair of stereo video from a different viewpoint. One of the most feasible approaches for such multiview applications has turned out to be such wherein only a limited number of input views, e.g. a mono or a stereo video plus some supplementary data, is provided to a decoder side and all required views are then rendered (i.e. synthesized) locally by the decoder to be displayed on a display.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for video encoding/decoding. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising obtaining a reference frame being partitioned into at least one tile; determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block; and for each direction of the block: determining a motion compensation filter for a direction, wherein the motion compensation filter corresponds to a motion vector component in said direction, said motion compensation filter having filter coefficients and being used for determining a predicted sample corresponding to a sample in the block; determining a tile in the reference frame corresponding to the sample in the block; determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample; determining samples going outside the tile in said direction; and modifying the motion compensation filter by adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

According to a second aspect, there is provided an apparatus comprising means for obtaining a reference frame being partitioned into at least one tile; means for determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block; and means for implementing the following for each direction of the block: determining a motion compensation filter for a direction, wherein the motion compensation filter corresponds to a motion vector component in said direction, said motion compensation filter having filter coefficients and being used for determining a predicted sample corresponding to a sample in the block; determining a tile in the reference frame corresponding to the sample in the block; determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample; determining samples going outside the tile in said direction; and modifying the motion compensation filter by adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to implement a method according to embodiments.

According to an embodiment, the adding the original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile comprises adding the original value of the filter coefficient to a filter coefficient of the first sample inside the tile and next to the samples of the outside the tile.

According to an embodiment, the determining the tile in the reference frame corresponding to the sample comprises determining the location of the predicted sample in the reference frame according to the location of the sample in the current frame and value of the motion vector, and determining the corresponding tile that the predicted sample is located.

According to an embodiment, the determining the tile in the reference frame corresponding to the sample comprises determining the location of a collocated sample in the reference frame and determining the corresponding tile that the predicted sample is located.

According to an embodiment, determining the corresponding tile comprises that the majority of the predicted samples of the block in the reference frame is located in the tile.

According to an embodiment, determining samples going outside the tile comprises determining a number of the samples going outside the tile as the maximum of the number of the samples going outside the tile for each sample in each direction-based row of the block.

According to an embodiment, determining samples going outside the tile comprises determining a number of the samples going outside the tile as the maximum of the number of the samples going outside of the tile for each sample in all direction-based rows in the block.

According to an embodiment, determining samples going outside the tile comprises determining a number of the samples going outside as the maximum of the number of the samples going outside for each sample in each subblock of the block, wherein a subblock is a block within the block.

According to an embodiment, determining a motion vector for all samples in the block using the motion vector of the block.

According to an embodiment, determining a motion vector for all samples in each subblock of the block using motion information of the block.

According to an embodiment, determining a motion vector for a sample comprises determining a motion vector for each sample in the block using motion information of the block.

According to an embodiment, further comprising indicating with a flag whether a modification of motion compensation filter is applied.

According to an embodiment, there is a flag for each tile boundary of each tile.

According to an embodiment, there is a flag for all boundaries of a tile in the reference frame.

According to an embodiment, there is a flag for all boundaries of all tiles in the reference frame.

According to an embodiment, the block has at least horizontal and vertical directions. According to an embodiment, there are different flags for different block sizes. According to an embodiment, said means of the apparatus comprises at least one processor, a memory and a computer code stored in said memory.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows an encoding process according to an embodiment; Fig. 2 shows a decoding process according to an embodiment; Figs. 3a 3b show examples of motion vector candidate positions; Fig. 4 shows an example of filtering for fractional motion compensation in one direction;

Fig. 5 shows an example of fractional motion compensation in two directions; Fig 6 shows an example of four parameter affine motion compensation; Fig. 7 shows an example of an alternative method for four parameter affine motion compensation;

Fig. 8 shows an example of fractional motion compensation near tile boundaries; Fig. 9 shows an example of merging a coded tile rectangle sequence; Fig. 10 shows an example of encoding SFIVC bitstreams; Fig. 1 1 shows an example of constrained inter-layer prediction; Fig. 12 shows an example of spatially packed constrained inter-layer prediction; Fig. 13 shows an example of packed constrained interlayer prediction of two bitstreams; Fig. 14 shows an example of switching from enhanced quality tiles at the first non- IRAP switching point in packed constrained interlayer prediction;

Fig. 15 shows an example of a possible file arrangement and a respective arrangement of Representations for streaming in packed constrained interlayer prediction;

Fig. 16 shows an example according to an embodiment of the present invention;

Fig. 17 shows an example according to another embodiment of the present invention;

Fig. 18 is a flowchart illustrating a method according to an embodiment; and Fig. 19 shows an apparatus according to an embodiment.

Description of Example Embodiments

In the following, several embodiments of the invention will be described in the context of one video coding arrangement. It is to be noted, however, that the invention is not limited to this particular arrangement. For example, the invention may be applicable to video coding systems like streaming systems, DVD players, digital television receivers, personal video recorders, systems and computer programs on personal computers, handheld computers and communication devices, as well as network elements such as transcoders and cloud computing arrangements where video data is handled.

In the following, several embodiments are described using the convention of referring to (de)coding, which indicates that the embodiments may apply to decoding and/or encoding.

The Advanced Video Coding standard (which may be abbreviated AVC or H.264/AVC) was developed by the Joint Video Team (JVT) of the Video Coding Experts Group (VCEG) of the Telecommunications Standardization Sector of International Telecommunication Union (ITU-T) and the Moving Picture Experts Group (MPEG) of International Organisation for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264/AVC standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.264 and ISO/IEC International Standard 14496-10, also known as MPEG-4 Part 10 Advanced Video Coding (AVC). There have been multiple versions of the H.264/AVC standard, each integrating new extensions or features to the specification. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).

The High Efficiency Video Coding standard (which may be abbreviated HEVC or H.265/HEVC) was developed by the Joint Collaborative Team - Video Coding (JCT- VC) of VCEG and MPEG. The standard is published by both parent standardization organizations, and it is referred to as ITU-T Recommendation H.265 and ISO/IEC International Standard 23008-2, also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC). Extensions to H.265/HEVC include scalable, multiview, three- dimensional, and fidelity range extensions, which may be referred to as SHVC, MV- HEVC, 3D-HEVC, and REXT, respectively. The references in this description to H.265/HEVC, SHVC, MV-HEVC, 3D-HEVC and REXT that have been made for the purpose of understanding definitions, structures or concepts of these standard specifications are to be understood to be references to the latest versions of these standards that were available before the date of this application, unless otherwise indicated.

Some key definitions, bitstream and coding structures, and concepts of H.264/AVC and HEVC and some of their extensions are described in this section as an example of a video encoder, decoder, encoding method, decoding method, and a bitstream structure, wherein the embodiments may be implemented. Some of the key definitions, bitstream and coding structures, and concepts of H.264/AVC are the same as in HEVC standard - hence, they are described below jointly. The aspects of the invention are not limited to H.264/AVC or HEVC or their extensions, but rather the description is given for one possible basis on top of which the invention may be partly or fully realized. In the description of existing standards as well as in the description of example embodiments, a syntax element may be defined as an element of data represented in the bitstream. A syntax structure may be defined as zero or more syntax elements present together in the bitstream in a specified order.

Similarly, to many earlier video coding standards, the bitstream syntax and semantics as well as the decoding process for error-free bitstreams are specified in H.264/AVC and HEVC. The encoding process is not specified, but encoders must generate conforming bitstreams. Bitstream and decoder conformance can be verified with the Hypothetical Reference Decoder (HRD). The standards contain coding tools that help in coping with transmission errors and losses, but the use of the tools in encoding is optional and no decoding process has been specified for erroneous bitstreams. The elementary unit for the input to an H.264/AVC or HEVC encoder and the output of an H.264/AVC or HEVC decoder, respectively, is a picture. A picture given as an input to an encoder may also be referred to as a source picture, and a picture decoded by a decoder may be referred to as a decoded picture.

The source and decoded pictures may each be comprised of one or more sample arrays, such as one of the following sets of sample arrays, wherein each of the samples represent one color component :

- Luma (Y) only (monochrome).

- Luma and two chroma (YCbCr or YCgCo).

- Green, Blue and Red (GBR, also known as RGB).

Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example, YZX, also known as XYZ).

In the following, these arrays may be referred to as luma (or L or Y) and chroma, where the two chroma arrays may be referred to as Cb and Cr; regardless of the actual color representation method in use. The actual color representation method in use may be indicated e.g. in a coded bitstream e.g. using the Video Usability Information (VUI) syntax of H.264/AVC and/or HEVC. A component may be defined as an array or a single sample from one of the three sample arrays (luma and two chroma) or the array or a single sample of the array that compose a picture in monochrome format.

In H.264/AVC and HEVC, a picture may either be a frame or a field. A frame comprises a matrix of luma samples and possibly the corresponding chroma samples. A field is a set of alternate sample rows of a frame. Fields may be used as encoder input for example when the source signal is interlaced. Chroma sample arrays may be absent (and hence monochrome sampling may be in use) or may be subsampled when compared to luma sample arrays. Some chroma formats may be summarized as follows:

- In monochrome sampling there is only one sample array, which may be nominally considered the luma array.

- In 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luma array.

- In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

- In 4:4:4 sampling when no separate color planes are in use, each of the two chroma arrays has the same height and width as the luma array.

In H.264/AVC and HEVC, it is possible to code sample arrays as separate color planes into the bitstream and respectively decode separately coded color planes from the bitstream. When separate color planes are in use, each one of them is separately processed (by the encoder and/or the decoder) as a picture with monochrome sampling.

When chroma subsampling is in use (e.g. 4:2:0 or 4:2:2 chroma sampling), the location of chroma samples with respect to luma samples may be determined in the encoder side (e.g. as pre-processing step or as part of encoding). The chroma sample positions with respect to luma sample positions may be pre-defined for example in a coding standard, such as H.264/AVC or HEVC, or may be indicated in the bitstream for example as part of VUI of H.264/AVC or HEVC.

Generally, the source video sequence(s) provided as input for encoding may either represent interlaced source content or progressive source content. Fields of opposite parity have been captured at different times for interlaced source content. Progressive source content contains captured frames. An encoder may encode fields of interlaced source content in two ways: a pair of interlaced fields may be coded into a coded frame or a field may be coded as a coded field. Likewise, an encoder may encode frames of progressive source content in two ways: a frame of progressive source content may be coded into a coded frame or a pair of coded fields. A field pair or a complementary field pair may be defined as two fields next to each other in decoding and/or output order, having opposite parity (i.e. one being a top field and another being a bottom field) and neither belonging to any other complementary field pair. Some video coding standards or schemes allow mixing of coded frames and coded fields in the same coded video sequence. Moreover, predicting a coded field from a field in a coded frame and/or predicting a coded frame for a complementary field pair (coded as fields) may be enabled in encoding and/or decoding.

A partitioning may be defined as a division of a set into subsets such that each element of the set is in exactly one of the subsets. A picture partitioning may be defined as a division of a picture into smaller non-overlapping units. A block partitioning may be defined as a division of a block into smaller non-overlapping units, such as subblocks. In some cases, term block partitioning may be considered to cover multiple levels of partitioning, for example partitioning of a picture into slices, and partitioning of each slice into smaller units, such as macroblocks of H.264/AVC. It is noted that the same unit, such as a picture, may have more than one partitioning. For example, a coding unit of HEVC may be partitioned into prediction units and separately by another quadtree into transform units. Many hybrid video codecs, including H.264/AVC and HEVC, encode information in two phases. In the first phase, predictive coding is applied for example as so-called sample prediction and/or so-called syntax prediction.

In the sample prediction, pixel or sample values in certain picture are or“block” are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways:

- Motion compensation mechanisms (which may also be referred to as temporal prediction or motion-compensated temporal prediction or motion- compensated prediction or MCP), which involve finding and indicating an area in one of the previously encoded video frames that corresponds closely to the block being coded.

- Intra prediction, where pixel or sample values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship.

In the syntax prediction, which may also be referred to as parameter prediction, syntax elements and/or syntax element values and/or variables derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier. Non-limiting examples of syntax prediction are provided below:

- In motion vector prediction (MVP), motion vectors (MVs) e.g. for inter and/or inter-view prediction may be coded differentially with respect to a block- specific motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the media of the encoded or decoded motion vectors of the adjacent blocks. Another way to created motion vector predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference frames and signalling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index or previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or co-located blocks in temporal reference frame. Differential coding of motion vectors may be disabled across slice boundaries.

- The block partitioning, e.g. from CTU to CUs and down to Pus may be predicted.

- In filter parameter prediction, the filtering parameters e.g. for sample adaptive offset may be predicted. Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation. Prediction approaches using image information within the same image can also be called as intra prediction methods.

The second phase is one of coding the error between the predicted block of pixels or samples and the original block or pixels or samples. This may be accomplished by transforming the difference in pixel or sample values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference is quantized, and entropy coded.

By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel or sample representation (i.e. the visual quality of the picture) and the size of the resulting encoded video representation (i.e. the file size or transmission bit rate).

The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel or sample blocks (using the motion or spatial information created by the encoder and included in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized error signal in the spatial domain).

After applying pixel or sample prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel or sample values) to form the output video frame.

An example of an encoding process is illustrated in Figure 1 . Figure 1 illustrates an image to be encoded (In); a predicted representation of an image block (P’n); a prediction error signal (Dn); a reconstructed prediction error signal (D’n); a preliminary reconstructed image (I’n); a final reconstructed image (R’n); a transform (T) and inverse transform (T-1 ); a quantization (Q) and inverse quantization (Q-1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 2. Figure 2 illustrates a predicted representation of an image block (P’n); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (I’n); a final reconstructed image (R’n); an inverse transform (T- 1 ); an inverse quantization (Q-1 ); an entropy decoding (E-1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).

The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence.

In many video codecs, including FI.264/AVC and FIEVC, motion information is indicated by motion vectors associated with each motion compensated image block, and an index which refers to one of the reference frames in RFM. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder) and the prediction sources block in one of the previously coded or decoded image (or picture), named reference frame. FI.264/AVC and FIEVC, as many other video compression standards, divide a picture into a mesh of rectangles, for each of which a similar block in one of the reference frames is indicated for inter prediction. The location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded.

FI.264/AVC and FIEVC include a concept of picture order count (POC). A value of POC is derived for each picture and is non-decreasing with increasing picture position in output order. POC therefore indicates the output order of pictures. POC may be used in the decoding process for example for implicit scaling of motion vectors in the temporal direct mode of bi-predictive slices, for implicitly derived weights in weighted prediction, and for reference frame list initialization. Furthermore, POC may be used in the verification of output order conformance.

Inter prediction process may comprise one or more of the following features:

- The accuracy of motion vector representation. For example, motion vectors may be of quarter-pixel accuracy, and sample values in fractional-pixel potions may be obtained using a finite impulse response (FIR) filter.

- Block partitioning for inter prediction. Many coding standards, including FI.264/AVC and FIEVC, allow selection of the size and shape of the block for which a motion vector is applied for motion-compensated prediction in the encoder, and indicating the selected size and shape in the bitstream so that decoders can reproduce the motion-compensated prediction done in the encoder.

- Number of reference frames for inter prediction. The sources of inter prediction are previously decoded pictures. Many coding standards, including FI.264/AVC and HEVC, enable storage of multiple reference frames for inter prediction and selection of the used reference frame on a block basis. For example, reference frames may be selected on macroblock or macroblock partition basis in FI.264/AVC and on PU or CU basis in HEVC. Many coding standards, such as H.264/AVC and HEVC, include syntax structures in the bitstream that enable decoders to create one or more reference frames lists. An example of the modes for implementing a block partitioning, comprises an affine motion compensation, overlapped block motion compensation (OBMC) and merge mode. A reference frame index to a reference frame list may be used to indicate which one of the multiple reference frames is used for inter prediction for a particular block. A reference frame index may be coded by an encoder into the bitstream in some inter coding modes or it may be derived (by an encoder and a decoder) for example using neighbouring blocks in some other inter coding modes, for example merge mode, OBMC, etc.

- Motion vector prediction. In order to represent motion vectors efficiently in bitstreams, motion vectors may be coded differentially with respect to a block- specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example, by calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vectors predictions, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference frames and signalling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index may be predicted, e.g. from adjacent blocks and/or co-located blocks in temporal reference frame. Differential coding of motion vectors may be disabled across slice boundaries.

- Merge mode. High efficiency video codecs may employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference frame index for each available reference frame list, is predicted and used without any modification/correction. Similarly, predicting the motion field information may be carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference frames and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.

- Multi-hypothesis motion-compensated prediction. H.264/AVC and HEVC enable the use of a single prediction block in P slices (herein referred to as uni- predictive slices) or a linear combination of two motion-compensated prediction blocks for bi-predictive slices, which are also referred to as B slices, or a weighted prediction of uni-predictive and bi-predictive block where weights may be signaled in slice or frame level. Individual blocks in B slices may be bi- predicted, uni-predicted, or intra-predicted, and individual blocks in P slices may be uni-predicted or intra-predicted. The reference frames for a bi-predictive picture may not be limited to be the subsequent picture and the previous picture in output order, but rather any reference frames may be used. In many coding standards, such as H.264/AVCE and HEVC, one reference frame list, referred to as reference frames list 0, in constructed for P slices, and two reference frame lists, list 0 and list 1 are constructed for B slices. For B slices, when prediction in forward direction may refer to prediction from a reference frame in reference frame list 0, and prediction in backward direction may refer to prediction from a reference frame in reference frame list 1 , even though the reference frames for prediction may have any decoding or output order relation to each other or to the current picture.

- Weighted prediction. Many coding standards use a prediction weight of 1 for prediction blocks of inter (P) pictures and 0.5 for each prediction block of a B picture (resulting into averaging). FI.264/AVC allows weighted prediction for both P and B slices. In implicit weighted prediction, the weights are proportional to picture order counts (POC), while in explicit weighted prediction, prediction weights are explicitly indicated.

In the HEVC standard, the motion vector prediction process may involve spatially adjacent motion vectors and/or motion vectors from other pictures (temporal, inter layer, or inter-view reference frames). The motion vector candidate positions are as shown in Figure 3, where black dots 301 indicate sample positions directly adjacent to block X, defining positions of possible MVPs. Figure 3a illustrates spatial MVPs positions, and Figure 3b illustrates temporal MVPs positions, where Y is the collocated block of X in a reference frame (does not necessarily match with a PB in this reference frame). Positions Co and Ci are candidates for the TMVP.

In H.264/AVC and HEVC a picture partitioning may be defined as a division of a set into subsets such that each element of the sets is in exactly one of the subsets.

In H.264/AVC, a macroblock is a 16x16 block of luma samples and the corresponding blocks of chroma samples. For example, in the 4:2:0 sampling pattern, a macroblock contains one 8x8 block of chroma samples per each chroma component. In H.264/AVC, a picture is partitioned to one or more slice groups, and a slice group contains one or more slices. In H.264/AVC, a slice consists of an integer number of macroblock ordered consecutively in the raster scan within a particular slice group.

When describing the operation of HEVC encoding and/or decoding, the following terms may be used. A coding block may be defined as an NxN block of samples for some value of N such that the division of a coding tree block into coding blocks is a partitioning. A coding tree block (CTB) may be defined as an NxN block of samples for some value of N such that the division of a component into coding tree blocks is a partitioning. A coding tree unit (CTU) may be defined as a coding tree block of luma samples, two tree block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples. A coding unit (CU) may be defined as a coding block of luma samples, two corresponding coding blocks of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate color planes and syntax structures used to code the samples.

In some video codecs, such as HEVC codec, video pictures are divided into coding units (CU) covering the area of the picture. A CU consists of one or more prediction units (PU) defining the prediction process for the samples within the CU and one or more transform units (TU) defining the prediction error coding process for the samples in said CU. A CU may consist of a square block of samples with a size selectable from a predefined set of possible CU sizes. A CU with the maximum allowed size may be named as LCU (largest coding unit) or coding tree unit (CTU) and the video picture is divided into non-overlapping LCUs. An LCU can be further split into a combination of smaller CUs, e.g. by recursively splitting the LCU and resultant CUs. Each resulting CU may have at least one PU and at least one TU associated with it. Each PU and TU can be further split into smaller Pus and TUs in order to increase granularity of the prediction and prediction error coding processes, respectively. Each PU has prediction information associated with it defining what kind of a prediction is to be applied for the samples within that PU (e.g. motion vector information for inter predicted PUs and intra prediction directionality information for intra predicted PUs).

Each TU can be associated with information describing the prediction error decoding process for the samples within said TU (including e.g. DCT coefficient information). It may be signaled at CU level whether prediction error coding is applied or not for each CU. In the case there is no prediction error residual associated with the CU, it can be considered that there are not TUs for said CU. The division of the image into CUs, and division of CUs into PUs and TUs may be signaled in the bitstream allowing the decoder to reproduce the intended structure of these units. In the HEVC standard, a picture can be partitioned into tiles, which are rectangular and contain an integer number of CTUs. In the HEVC standard, the partitioning to tiles forms a grid that may be characterized by a list of tile column widths (in CTUs) and a list of tile row heights (in CTUs). Tiles are ordered in the bitstream consecutively in the raster scan order of the tile grid. A tile may contain an integer number of slices.

In the HEVC, a slice consists of an integer number of CTUs. The CTUs are scanned in the raster scan order of CTUs within tiles or within a picture, if tiles are not in use. A slice may contain an integer number of tiles or a slice can be contained in a tile. Within a CTU, the CUs have a specific scan order.

In HEVC, a slice is defined to be an integer number of coding tree units contained in one independent slice segment and all subsequent dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined to be an integer number of coding tree units ordered consecutively in the tile scan and contained in a single NAL (Network Abstraction Layer) unit. The division of each picture into slice segments is a partitioning. In HEVC, an independent slice segment is defined to be a slice segment for which the values of the syntax elements of the slice segment header are not inferred from the values for a preceding slice segment, and a dependent slice segment is defined to be a slice segment for which the values of some syntax elements of the slice segment header are inferred form the values for the preceding independent slice segment in decoding order. In HEVC, a slice header is defined to be the slice segment header of the independent slice segment that is current slice segment or is the independent slice segment that precedes a current dependent slice segment, and a slice segment header is defined to be part of a coded slice segment containing the data elements pertaining to the first or all coding tree units represented in the slice segment. The CUs are scanned in the raster scan order of LCUs within tiles or within a picture, if tiles are not in use. Within an LCU, the CUs have a specific scan order.

The filtering for fractional motion compensation may be done by a finite impulse response filter as shown in Figure 4 illustrating an embodiment in one dimension, where G’s are the filter coefficients (in this example the filter tap is 8).

Fractional motion compensation may be performed in two consecutive steps, called horizontal and vertical filtering. An example of this is shown in Figure 5, where fractional motion vector is (0.5, 0.5) 1 ) Horizontal filtering: The input single samples 501 (samples of the reference frame in integer pixel positions, shown in square shape) are filtered in horizontal direction according to the horizontal component of the fractional motion vector, and intermediate samples 502 (shown in triangle shape) are generated as output. The filtering in this example is 8-tap, so for each intermediate sample 502 (i.e., triangles), four samples in the left and right side of that sample are needed.

2) Next the intermediate samples 502 (i.e., triangles) are filtered in vertical direction according to the vertical component of the fractional motion vector, and the final samples 503 (shown in circle) are generated. The vertical filtering in this example is 8-tap, so for each final sample 503 (i.e., circle), four samples in the top and bottom side of that sample are needed.

According to this process, to generate a predicted sample with fractional motion vector, TxT integer samples are needed from reference frame. Also, according to this process, as shown in Figure 5, to generate an NxM block with fractional motion vector, (N+T- l)x(M+T-l) integer samples are needed, where T is the size of filter tap (for example T=8 for Luma block prediction in HEVC).

Motion compensation can be done using several complex motion models which need more motion information, for example six parameter affine, motion model with four motion vectors for four corners of the block, or elastic motion compensation which represent the motion information of the samples in the block based on cosine functions. There are other motion models including perspective or polynomial.

Affine motion compensation, for example, can be done according to various ways. The original and the most flexible affine motion compensation which supports any linear deformation including zooming, rotating, shearing, needs six motion parameters for the block. But limited affine motion compensation, supporting only zooming and rotation (and their combination), needs two motion vectors for top-left and top-right of the block. Then the motion vectors for all the samples inside the block are calculated based on these two motion vectors. In this case, as the (fractional) motion vector for each sample inside the block can be different, each sample should be calculated separately. This needs huge amount of calculation for a block. This case is shown in Figure 6. The alternative method is to divide the block to smaller subblocks (e.g. 4x4 blocks) and calculate motion vector for each subblock, for example according to the center location of the subblock. In this case, each subblock has a (fractional) motion vector, and the number of calculation for filtering is reduced significantly. The more optimal way is to select the size of the subblock based on those two motion vectors. Particularly, when the difference of two motion vectors is small, larger subblocks can be used, and when the difference of motion vectors is large, smaller subblocks should be used. This case is shown in Figure 7. The abovementioned techniques can be used with other complex motion compensation models.

When the motion compensated block is close to the tile boundaries 800, the required integer samples in the reference frame may come from the other tiles. In Figure 8, motion compensation of a block happened close to top and left boundary 800 of the current tile, so some of the samples (shown in with reference number 801 ) are needed from other tiles.

A motion-constrained tile set (MCTS) is such that the inter prediction process is constrained in encoding such that no sample value outside the motion-constrained tile set, and no sample value at a fractional sample position that is derived using one or more sample values outside the motion-constrained tile set, is used for inter prediction of any sample within the motion-constrained tile set.

It is appreciated that sample locations used in inter prediction are saturated so that a location that would be outside the picture otherwise is saturated to point to the corresponding boundary sample of the picture. Hence, if a tile boundary is also a picture boundary, motion vectors may effectively cross that boundary or a motion vector may effectively cause fractional sample interpolation that would refer to a location outside that boundary, since the sample locations are saturated onto the boundary. In some applications, it may be desired not to let the motion vector cross the boundary of the tile which is located in picture boundary.

The temporal motion-constrained tile sets SEI (supplemental enhancement information) message of HEVC can be used to indicate the presence of motion- constrained tile sets in the bitstream.

A recent trend in streaming in order to reduce the streaming bitrate of video (especially virtual reality (VR) contents) is known as viewport dependent delivery and can be explained as follows: a subset of video content (e.g. 360-degree) covering the primary viewport (i.e., the current view orientation) is transmitted at the best quality/resolution, while the remaining of 360-degree video is transmitted at a lower quality/resolution. There are generally two approaches for viewport-adaptive streaming:

1 ) Viewport-specific encoding and streaming, which can be referred to as viewport- dependent encoding and streaming, or as asymmetric projection, or as packed VR (Virtual Reality) video. In the viewport-specific encoding and streaming, a 360-degree image content is packed into the same frame with an emphasis (e.g. greater spatial area) on the primary viewport. The packed VR frames are encoded into a single bitstream. For example, a front face of a cube map may be sampled with a higher resolution, compared to other cube faces and the cube faces may be mapped to the same packed VR frame, where the front cube face is sampled with twice the resolution compared to the other cube faces.

) Tile-based encoding and streaming:

In the VR viewport video, a 360-degree content is encoded and made available in a manner that enables selective streaming of viewports from different encodings. An approach of tile-based encoding and streaming, which may be referred to as tile rectangle-based encoding and streaming or sub-picture-based encoding and streaming, may be used with any video codec, even if tiles similar to HEVC were not available in the codec or even if motion-constrained tile sets or alike were not implemented at an encoder.

In another approach to realize the tile rectangle-based encoding, the source content is split into tile rectangle sequences (a.k.a. sub-picture sequences) before encoding. Each tile rectangle sequence covers a subset of the spatial area of the sources content, such as full panorama content, which may e.g. be of equirectangular projection format. Each tile rectangle sequence is then encoded independently from each other as a single-layer bitstream. Several bitstreams may be encoded form the same tile rectangle sequence, e.g. for different bitrates. Each tile rectangle bitstream may be encapsulated in a file as its own track (or alike) and made available for streaming. At the receiver side, the tracks to be streamed may be selected based on the viewing orientation. The client may receive tracks covering the entire omnidirectional content. Better quality or higher resolution tracks may be received for the current viewport compared to the quality or resolution covering the remaining, currently non-visible viewports. In an example, each track may be decoded with a separate decoder instance.

In an example of the tile rectangle-based encoding and streaming, each cube face may be separately encoded and encapsulated in its own track (and Representation). More than one encoded bitstream for each cube face may be provided, e.g. each with different spatial resolution. Players can choose tracks (or Representations) to be decoded and played based on the current viewing orientation. High-resolution tracks (or Representations) may be selected for the cube faces used for rendering for the present viewing orientation, while the remaining cube faces may be obtained from their low-resolution tracks (or Representations).

In an approach of tile-based encoding and streaming, encoding is performed in a manner that the resulting bitstream comprises motion- constrained tile sets. Several bitstreams of the same sources content are encoded using motion-constrained tile sets.

In an approach, one or more motion-constrained tile set sequences are extracted from a bitstream, and each extracted motion-constrained tile set sequence is stored as a tile set track (e.g. an HEVC tile track or a full- picture-compliant tile set track) in a file. A tile-based track (e.g. an HEVC tile base track or a full picture track comprising extractors to extract data from the tile set tracks) may be generated and stored in a file. The tile- based track represents the bitstream by implicitly collecting motion- constrained tile sets from the tile set tracks or by explicitly extracting (e.g. by HEVC extractors) motion-constrained tile sets from the tile set tracks. Tile set tracks and the tile-based track of each bitstream may be encapsulated in an own file, and the same track identifiers may be used in all files. At the receiver side the tile set tracks to be streamed may be selected based on the viewing orientation. The client may receive tile set tracks covering the entire omnidirectional content. Better quality or higher resolution tile set tracks may be received for the current viewport compared to the quality or resolution covering the remaining, currently non-visible viewports.

In an example, equirectangular panorama content is encoded using motion-constrained tile sets. More than one encoded bitstream may be provided, e.g. with different spatial resolution and/or picture quality. Each motion-constrained tile set is made available in its own track (and Representation). Players can choose tracks (or Representations) to be decoded and played based on the current viewing orientation. High- resolution or high-quality tracks (or Representations) may be selected for tile sets covering the present primary viewport, while the remaining area of the 360-degree content may be obtained from low-resolution or low- quality tracks (or Representations). It is also possible to combine the viewport-specific encoding and streaming (approach 1 ) and tile-based encoding and streaming (approach 2) discussed above.

It needs to be understood that tile-based encoding and streaming may be realized by splitting a source picture in tile rectangle sequences that are partly overlapping. Alternatively, or additionally, bitstreams with motion-constrained tile sets may be generated from the same source content with different tile grids or tile set grids. For example, if 360-degrees space is divided into a discrete set of viewports, the set of viewports being separated by a given distance (e.g., expressed in degrees), then the omnidirectional space can be imagined as a map of overlapping viewports, and the primary viewport can be switched discretely as the user changes his/her orientation while watching content with a HMD. When the overlapping between viewports is reduced to zero, the viewports could be imagined as adjacent non-overlapping tiles within the 360-degrees space.

As explained above, in the viewport-adaptive streaming the primary viewport (i.e., the current viewing orientation) is transmitted at the best quality/resolution, while the remaining of 360-degree video is transmitted at a lower quality/resolution. When the viewing orientation changes, e.g. when the user turns his/her head when viewing the content with a head-mounted display, another version of the content needs to be streamed, matching the new viewing orientation. In general, the new version can be requested starting from a stream access point (SAP), which is typically aligned with (Sub)segments.

An example of merging of coded tile rectangle sequences is shown in Figure 9. In the solution of Figure 9, a source picture sequence 71 is split 72 into tile rectangle sequences 73 before encoding. Each tile rectangle sequence 73 is then encoded 74 independently. Two or more coded tile rectangle sequences 75 are merged 76 into a bitstream 77. The coded tile rectangle sequences 75 may have different characteristics, such as picture quality, so as to be used for viewport-dependent delivery. The coded tile rectangles 75 of a time instance are merged vertically into a coded picture of the bitstream 77. Each coded tile rectangle 75 in a coded picture forms a coded slice.

Vertical arrangement of the coded tile rectangles 75 into a coded picture brings at least the following benefits. For example, slices can be used as a unit to carry a coded tile rectangle and no tile support is needed in the codec, hence the approach is suitable e.g. for H.264/AVC. No transcoding is needed for the vertical arrangement, as opposed to horizontal arrangement where transcoding would be needed as coded tile rectangles would be interleaved in the raster scan order (i.e., the decoding order) of blocks (e.g. macroblocks in H.264/AVC or coding tree units in HEVC). In addition, motion vectors that require accessing sample locations horizontally outside the picture boundaries (in inter prediction) can be used in the encoding of tile rectangle sequences. Hence, the compression efficiency benefit that comes from allowing motion vectors over horizontal picture boundaries is maintained (unlike e.g. when using motion-constrained tile sets).

The merged bitstream 77 is full-picture compliant. For example, if tile rectangle sequences were coded with H.264/AVC, the merged bitstream is also compliant with H.264/AVC and can be decoded with a regular H.264/AVC decoder.

Another option to deliver the viewport-dependent omnidirectional video is to carry out SHVC region-of interest scalability encoding. Therein, the base layer may be coded conventionally. Additionally, region-of-interest (ROI) enhancement layers may be encoded with SHVC Scalable Main profile. For example, several layers per each tile position can be coded, each for different bitrate or resolution. The ROI enhancement layers may be spatial or quality scalability layers. Several SHVC bitstreams can be encoded for significantly differing bitrates, since it can be assumed that bitrate adaptation can be handled to a great extent with enhancement layers only. This encoding approach is illustrated in Figure 10. In such encoding approach the base layer is always received and decoded. Additionally, enhancement layers selected on the basis of the current viewing orientation are received and decoded. Stream access points (SAPs) for the enhancement layers are inter-layer predicted from the base layer and are hence more compact than similar SAPs realized with intra-coded pictures. Since the base layer is consistently received and decoded, the SAP interval for the base layer can be longer than that for ELs.

A further method is called constrained inter-layer prediction (CILP). An example of a CILP is illustrated in Figure 1 1 . The input picture sequence is encoded into two or more bitstreams, each representing the entire input picture sequence, i.e., the same input pictures are encoded in the bitstreams or a subset of the same input pictures, potentially with a reduced picture rate, are encoded in the bitstreams. Certain input pictures are chosen to be encoded into two coded pictures in the same bitstream, the first referred to as a shared coded picture. A shared coded picture is either intra coded or uses only other shared coded pictures (or the respective reconstructed pictures) as prediction references. A shared coded picture in a first bitstream (of the encoded two or more bitstreams) is identical to the respective shared coded picture in a second bitstream (of the encoded two or more bitstreams), wherein "identical" may be defined to be identical coded representation, potentially excluding certain high-level syntax structures, such as SEI messages, and/or identical reconstructed picture. Any picture subsequent to a particular shared coded picture in decoding order is not predicted from any picture that precedes the particular shared coded picture and is not a shared coded picture.

A shared coded picture may be indicated to be a non-output picture. As a response to decoding a non-output picture indication, the decoder does not output the reconstructed shared coded picture.

The encoding method facilitates decoding a first bitstream up to a selected shared coded picture, exclusive, and decoding a second bitstream starting from the respective shared coded picture. No intra-coded picture is required to start the decoding of the second bitstream, and consequently compression efficiency is improved compared to a conventional approach.

The SHVC ROI approach significantly outperforms MCTS-based viewport-dependent delivery and enabling inter-layer prediction provides a significant compression gain compared to using no inter-layer prediction. However, the SHVC ROI approach has the disadvantage that inter-layer prediction is enabled only in codec extensions, such as the SHVC extension of HEVC. Such codec extensions might not be commonly supported in decoding, particularly when considering hardware decoder implementations.

CILP (Constrained Inter-Layer Prediction) enables the use of HEVC Main profile encoder and decoder, and hence has better compatibility with implementations than the SHVC ROI approach. Moreover, CILP takes advantage of relatively low intra picture frequency (similarly to the SHVC ROI approach). However, when compared to the SHVC ROI approach, CILP suffers from the use of MCTSs for the base-quality tiles. The streaming rate-distortion performance of CILP is close to that of SHVC-ROI in relatively coarse tile grids (up to 6x3). However, CILP has inferior streaming rate- distortion performance compared to SHVC-ROI when finer tile grids are used, presumably due to the use of MCTSs for the base quality.

SP-CILP aims at:

- Encoding with a single-layer encoder, such as HEVC Main profile encoder.

(Similarly to CILP.)

- Decoding with a single-layer decoder, such as HEVC Main profile decoder.

(Similarly to CILP.) - Infrequent intra picture coding for the base quality. (Similarly to SHVC-ROI and CILP.)

- Avoiding the use of MCTSs for the base quality to improve the compression performance for the base quality coding. (Similarly to SHVC-ROI.)

- Achieving streaming rate-distortion compression gain similar to CILP and SHVC-ROI in coarse tile grids and close to SHVC-ROI for fine tile grids.

Encoding according to the invention is illustrated with Figure 12. A solid line 1210 indicates a picture boundary or such a tile boundary over which motion constraints identical or similar to those for MCTS apply. A dashed line 1220 indicates a tile boundary where motion constraints need not be applied.

The picture area comprises two parts:

- Constituent picture area - used to carry the base quality encoding

- Tile area - used to carry enhanced quality tiles

In order to enable prediction of enhanced quality tiles from the time-aligned base- quality constituent picture in a similar manner as in the SHVC ROI and CILP approaches, certain input pictures may be encoded as two coded pictures. In the first coded picture of these two coded pictures, the tile area may for example be blank (e.g. have a constant color). In the second coded picture of these two coded pictures, the tile area may be predicted from the base-quality constituent picture of the first coded picture. The constituent picture area of the second coded picture may be blank (e.g. constant color) or may be coded with reference to the first coded picture with zero motion and without prediction error (referred to as "skip coded" here).

In subsequent pictures, any conventional inter prediction hierarchy may be used. Motion constraints are applied so that the constituent picture area forms a MCTS, and the tile area comprises one or more MCTSs.

Several bitstreams are encoded, each with different selection of enhanced quality tiles, but with the same base-quality constituent pictures. For example, when the 4x2 tile grid is used, and four tiles are selected to be coded at enhanced quality matching a viewing orientation, about 40 bitstreams may need to be coded for different selection of enhanced quality tiles. The IRAP picture interval may be selected to be longer than the interval of coding an input picture as two coded pictures as described above. As an example, encoding of two bitstreams is illustrated in Figure 13 (b = blank tile, as described above; "B-slices" comprises at least one B or P slice and may additionally comprise any other slices, e.g. I-, P, or B-slices). Coding an input picture as two coded pictures as described above forms a switching point that enables switching from one bitstream to another. Since the base-quality constituent picture is identical in the encoded bitstreams, the base-quality constituent picture at the switching point can be predicted from earlier picture(s). Continuing the example above, Figure 14 shows an example of switching from enhanced quality tiles 1 ,2,5,6 to 3, 4, 7, 8 at the first non-IRAP switching point.

MCTSs (comprising enhanced quality tiles) may be encapsulated into a file as sub picture tracks (e.g. 'hvd ' or 'hev1 ' tracks for FIEVC). A sequence of the base-quality constituent pictures may be encapsulated into a file as a sub-picture track (e.g. 'hvd ' or 'hev1 ' track for HEVC). One extractor track may be formed for each selection of enhanced quality tiles. The extractor track extracts the base-quality constituent pictures and the enhanced quality tiles from their respective tracks. Figure 15 illustrates a possible file arrangement and a respective arrangement of Representations for streaming. Reference number 21 10 stands for base-quality track/Representation, and reference number 2120 stands for enhanced quality tile tracks/Representations, where there is one track/Representation for each pair of positions in the original picture and in the extractor track. Reference number 2130 is for extractor tracks/Representations, where there is one track/Representation for each assignment of (a, b, c, d, )

Motion compensation (and IntraBlockCopy) for a fractional motion vector uses an n- tap interpolation filter, so it needs more samples from each side (left/right/top/bottom) of the signal in the reference frame. In the case of MCST, this n-tap increases the chance the data from outside of the current tile (or tile set) to be used in motion compensation process. So, to guarantee the MCTS coding, at the encoder side, these motion vectors should be avoided in motion estimation and merge candidate selection. As a result, either a suboptimal motion vector should be selected (which may increase residual values), or the block should be split into subblocks to have different motion vectors for different subblocks. Each of the above solution results in degrading the RD performance. Using MCST in FIEVC, for example, impose 3.73% bdrate loss when using 4x2 tiling arrangements.

One way to solve this problem is to crop the video and extract each tile as a separate video and then code each video independently. In this case, each reference frame is padded (according to recent codec like FI.264/AVC or FIEVC), and the block at the boarder of the frame can be predicted from data in the reference frame outside of the frame boundary. This improves the coding of the block, but the RD overhead of the coding each tile as separate video can be high, so in total the RD performance can be lower than MCTS coding of tiles. The other problem is that in this case, multiple encoder, media parser and decoders are needed to decode multiple tiles (cropped video).

An invention discussed in this description relates to the above problem. The purpose of the present solution is to modify the motion compensation (and IntraBlockCopy) filter for a block that the calculation of the corresponding predicted block needs reference frame's samples outside of the current tile. Based on the location where the prediction block is generated, the MC filter is modified in a way that no sample value is used from other tiles to calculate the predicted samples. In a more detailed manner, according to the motion vector of the current block and the block location, the number of the samples going outside of the tile is calculated. Then based on the filter tap of the motion compensation filter, some of the filter coefficients may be changed to zero, and their original values may be added to other coefficients of that filter. This modification, may be done for horizontal and vertical direction separately, but similarly. For example, when doing the horizontal/vertical filtering, the number of the samples in the reference frame go outside of the current tile in horizontal/vertical direction are considered. This modification guarantees that, for prediction of this kind of blocks, no sample value in used outside the current tile. The filter modification may be done in block level (as mentioned above) or may be done in row/column/sample level to achieve higher RD performance, but with a bit more complexity. Applying this modification may be enabled/disabled for example by a high level flag for, for example, all tile boundaries in whole video, all tile boundaries in each picture or reference frame, all tile boundaries in each reference frame for each picture, for all tile boundaries for each tile/slice in the picture or in the reference frame, , and each tile/slice boundary of all tiles in the picture or reference frame, each tile/slice boundary of each tile in the picture or reference frame, independently. For example, this flag may be enabled for the boundaries that the corresponding neighboring tile is subject to be changed. For the boundaries that there is no neighboring tile (e.g. top/bottom/left/right boundary of the tile in the top/bottom/left/right row/column of the picture), this flag may be disabled. This filter modification may also be enabled or disabled for specific block sizes by default or by signaling a high-level flag.

The solution is based on conditional motion compensation filter modification and is targeted to a situation that at least of the required samples is located in the other tile, examples of this situation are illustrated in Figure 16 and Figure 17 for the filtering in each dimension (i.e. horizontal or vertical filtering). In Figure 16 squares (indicated by r _;) are samples of the reference frame, and circles (indicated by pi) are the predicted samples for a block with associated motion vector. There are two tiles (TileO, Tilel ) in this example, and their common boundary is shown in dashed line. Figure 16 shows the case where the motion vectors is fractional. In the example of Figure 16, each predicted sample p is calculated using a motion compensation filter (8-tap filter in this figure with filter coefficients from Co to C _?). In this example, the calculation of po needs two samples (i.e. r_ ₂ and m) from the other tile (i.e., TileO).

The motion constrained goal is to avoid using any reference sample from other tiles. In the above example, r_ ₂ and m should not have effect in the generation of the predicted sample po. To realize this, the main idea of this invention is eliminating the effect of these reference samples by changing the corresponding coefficients of MC filter (i.e. Co and C7 for calculation of po) to zero. As a result, even though these reference samples r _-2, m are accessed during motion compensation, they will not have contribution in calculation of the predicted sample po.

The technique according to present embodiments may not need to be restricted to calculation of the predicted sample with fractional motion vectors. The calculation of predicted sample with integer motion vector may be also done using a filter where the abovementioned filter modification can be applied as well. Such a situation may happen for example in pixel-wise affine motion compensation where some of the samples may have integer motion vectors and some others may have fractional motion vectors.

When the corresponding coefficients (i.e. Co and Ci in this example) are set to zero, their original values should be added to other coefficients to keep the sum of coefficients of the new filter the same as the original one. There are several options to do this coefficient modification.

• One simple way is to add the value of the changed coefficients to coefficient of the first sample next to the samples of the changed coefficients (i.e., C ₂ = C2+ ( C ₀+C1 )). This is equivalent to assume that the original value of the samples in the other tile has been ignored, and the next sample is used instead of them (i.e., r'_ ₂ = r'-i = r ₀), where r is the estimated value instead of n).

• The general way to calculate the value of the new coefficients can be to estimate the value of the reference samples in the other tile (m and ro in the above example), from the reference samples inside the current tile, and then based on that, reflect the corresponding coefficients to other coefficients. For example, m and r_ ₂ can be simplify estimated by extrapolating r ₀ and n using equations below. And as a result, C ₂ and G will be changed as below. o r'-i = ao*ro+ai *ri, and

• The reference samples in the other tile (r_; and ro in the above example) can be estimated using more samples from the reference samples inside the current tile. It should be noted that the samples in the other tile cannot go beyond the length of motion compensation filter tap. This means that if all the samples go to the other tile, the coefficient modification, described in this invention, may not be applied. For the above example, m and r. ₂ can maximally be estimated using r ₀ to r ₅ as below. And as a result, coefficients will be changed as below:

The abovementioned modification may be needed in horizontal and/or vertical MC filtering. The modification of the filter when applied in horizontal/vertical direction is dependent to coordinate of the current block in horizontal/vertical direction and horizontal/vertical component of the motion vector.

A first embodiment of the invention, shown also in Figure 16 and Figure 17, relates to a sample-based filter modification. As described in the previous section, the modification of motion compensation (MC) filter can be done based on the location of the predicted sample with respect to the tile boundary (as well as the MV of that block). This means that for the sample that are close the tile boundary (e.g. po) higher number of motion compensation filter coefficients are needed to be modified. But as the predicted sample is going far from the tile boundary (i.e., is going more inside the current tile), fewer number of samples are needed from the other tile, so fewer number of motion compensation coefficients needed to be modified. For example, compared to po in the Figure 16, for calculation of pi, only one sample is needed from the other tile (i.e. m). So only Co should be changed to zero, and its value should be reflected on the other coefficients. There are several options to do the change of coefficients. • Option 1 :

o r'-i = ro

o Co = Co + Ci and

o Co = 0

• Option 2:

In this example, the calculation of /¾ and p ₃ does not need any sample value from the other tile, hence no need to change the MC filter.

A second embodiment relates to a block-based filter modification. For the sample- based implementation of this idea, it is needed to change the MC filter for each sample. Alternatively, the MC filter may be changed (independently for horizontal and vertical filtering) for whole the prediction block. For this, the MC filter may need to be modified according to the sample that needs the most modification in the MC filter (i.e., the sample that needs the most number of samples from the other tile). In the above case, for example, the most modification is needed for po, so the coefficients for whole the block should be change according to the modification needed for po. This will simplify the implementation of the MC function but will also reduce the RD performance gain.

A third embodiment relates to subblock (e.g. 4x4) based filter modification. In the block- based implementation of this idea, described above, the MC filter may be changed for whole the block. The prediction block size can be large (e.g. 8x8, 16x16, or even larger), and therefore in order to improve the performance of the MC filter in terms of RD the block can be split into smaller blocks using the tools defined in the video coding standard, such as a split tool. Signalling this split bit for a block needs some bits which degrade RD performance. As another alternative, the MC for the large blocks can be done by dividing a large block into small (e.g. 4x4) subblocks (by default, without signalling split bit) and the filter modification may then be applied for each subblock independently. For example, for the subblocks that are close to the tile border, the MC filter can be modified based on the worst case in that subblock, but the other subblocks may not need changes in MC filter. This approach limits the MC filter modification only to limited number of subblocks, which results in improved RD performance.

As mentioned before, for calculation of a prediction block with fractional motion vector needs, some extra intermediate samples need to be calculated above and below the current block. With the proposed modifications in the MC filter, some of those intermediate samples which falls in the other tiles are not used. This means that samples from the other tiles are not needed to be calculated any more. This will slightly reduce the complexity of motion compensation function for those blocks.

Figure 18 is a flowchart illustrating a method according to an embodiment. A method comprises

- obtaining a reference frame being partitioned into at least one tile 1810;

- determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block 1820;

- for each direction of the block:

o determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample 1850;

o determining samples going outside the tile in said direction 1860; and o modifying 1870 the motion compensation filter by

• adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and

• changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

It is to be noticed that in above flowchart, the block that is going to be predicted is located in the current frame, i.e. the frame that is encoded/decoded at the moment. The actual prediction is made from the reference frame, as disclosed in the flowchart.

In addition, the tiling in the current frame (i.e. frame being encoded/decode) and in the reference frame are independent. The tiling can be the same in both frames, especially in tile-based viewport adaptive streaming, or the tiling can be different for example in parallel or partial decoding use cases. In the latter case, the reference frame may have tiling, and the current frame may different tiling or have no tiling, or vice versa.

In addition, motion vector can be determined in different way for each sample of the block. For example, motion vector can be the same for all samples of a block. Alternatively, each sample of a block can have its own motion vector, especially in affine or elastic motion compensation. In this case, several motion information parameters are signaled for a block, and motion vector for each sample is calculated based on its location in the block and motion information of the block. Alternatively, each subblock (e.g. 4x4) of a block may have its own motion vector. For example, in block-based affine motion compensation, several motion vectors parameters are signaled for a block, and MV for subblocks (4x4) are calculated based on the location of the subblock and block MVs.

The filter can be modified by changing the filter coefficients. This can be done either pixel-wise, column- or row-wise, block-based, or subblock-wise. In pixel-wise, the coefficients for each sample in the block can be modified independently. In column/row-wise, the coefficients can be modified in the same way for all samples in one or more direction-based rows. The direction-based row means horizontal row for horizontal direction, and vertical row (i.e. column) in a vertical direction. In block-based, the coefficient of the filter can be modified in the same way for all samples in the block. In subblock-wise, the coefficients can be modified in the same way for each subblock (e.g. 4x4) of a block independently.

An apparatus according to an embodiment comprises means for obtaining a reference frame being partitioned into at least one tile; means for determining at least one motion vector for at least one sample in a block in a current frame, the motion vector having a motion vector component for each direction of the block; and means for implementing the following for each direction of the block:

o determining a tile in the reference frame corresponding to the sample in the block;

o determining samples in the reference frame needed to be used with the motion compensation filter to generate the predicted sample;

o determining samples going outside the tile in said direction; and o modifying the motion compensation filter by

• adding original values of filter coefficients of the samples outside the tile to filter coefficients of the samples within the tile; and

• changing original values of the filter coefficients of the samples outside the tile to a change value equal to zero.

An apparatus according to an embodiment is illustrated in Figure 19. An apparatus of this embodiment is a camera having multiple lenses and imaging sensors, but also other types of cameras may be used to capture wide view images and/or wide view video.

The terms wide view image and wide view video mean an image and a video, respectively, which comprise visual information having a relatively large viewing angle, larger than 100 degrees. Hence, a so called 360 panorama image/video as well as images/videos captured by using a fish eye lens may also be called as a wide view image/video in this specification. More generally, the wide view image/video may mean an image/video in which some kind of projection distortion may occur when a direction of view changes between successive images or frames of the video so that a transform may be needed to find out co-located samples from a reference image or a reference frame. This will be described in more detail later in this specification.

The camera 2500 of Figure 19 comprises two or more camera units 2501 and is capable of capturing wide view images and/or wide view video. Each camera unit 2501 is located at a different location in the multi-camera system and may have a different orientation with respect to other camera units 2501 . As an example, the camera units 2501 may have an omnidirectional constellation so that it has a 360-viewing angle in a 3D-space. In other words, such camera 2500 may be able to see each direction of a scene so that each spot of the scene around the camera 2500 can be viewed by at least one camera unit 2501 .

The camera 2500 of Figure 19 may also comprise a processor 2504 for controlling the operations of the camera 2500. There may also be a memory 2506 for storing data and computer code to be executed by the processor 2504, and a transceiver 2508 for communicating with, for example, a communication network and/or other devices in a wireless and/or wired manner. The camera 2500 may further comprise a user interface (Ul) 2510 for displaying information to the user, for generating audible signals and/or for receiving user input. However, the camera 2500 need not comprise each feature mentioned above or may comprise other features as well. For example, there may be electric and/or mechanical elements for adjusting and/or controlling optics of the camera units 2501 (not shown).

Figure 19 also illustrates some operational elements which may be implemented, for example, as a computer code in the software of the processor, in a hardware, or both. A focus control element 2514 may perform operations related to adjustment of the optical system of camera unit or units to obtain focus meeting target specifications or some other predetermined criteria. An optics adjustment element 2516 may perform movements of the optical system or one or more parts of it according to instructions provided by the focus control element 2514. It should be noted here that the actual adjustment of the optical system need not be performed by the apparatus, but it may be performed manually, wherein the focus control element 2514 may provide information for the user interface 2510 to indicate a user of the device how to adjust the optical system.

The various embodiments may provide advantages. For example, the block-based implementation of the present embodiments brings about 1 .05% bitrate reduction when 4x2 tiling grid is used. (4x2 tiling grid with MCTS impose 3.73% bitrate loss, so this method can compensate 28% of this bitrate loss). Sequence wise results are presented below in the table, where the first column is the bitrate loss when there is 4x2 tiling coded with MCTS, with respect to coding the video without tiling (and no MCTS). The second column is the bitrate gain the proposed MC filter brings with respect to exiting MC filter, when used in 4x2 tiling with MCTS. In an MCTS encoding process, the present embodiments impose less constraints on motion vector selection. The present embodiments improve the RD (rate-distortion) performance of MCTS coded video which is useful in different applications like parallel and partial encoding/decoding as well as streaming scenarios like tile-based, ROI-SHVC, CILP, and SP-CILP.

It also is expected that the encoding and decoding time are reduce, since there is no need to split the block at the tile border to smaller subblocks. Changes are small and quite local. The present embodiments need an easy preprocessing of MC filter coefficients. This change is applied to certain number of blocks which are in tile boarder, so there is no significant extra computational complexity. There is no need to add low level (CU level) syntax element, so no change is needed in parsing either.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Previous Patent: METHODS AND APPARATUSES FOR FASTER RADIO FREQUENCY ACTIVATION

Next Patent: A METHOD FOR THE TREATMENT OF BIOSLUDGE