Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
VIDEO CODEC SUPPORTING QUALITY-SCALABILITY
Document Type and Number:
WIPO Patent Application WO/2007/042063
Kind Code:
A1
Abstract:
A basic idea of the present application was that refinement layers showing an improved rate/distortion performance may be achieved by accompanying residual information refinement information with refined motion information. In particular, it has been recognized that for small maximum-rates a coarse motion information involving less bits for representing the motion information' in the bit-stream leads to good rate/distortion performance. However, with increasing maximum-rates, merely refining the encoding of the texture information relative to the coarse motion information does not lead to optimal rate-distortion performance for representing the respective refinement layer. Rather, more bits should be spent for the motion information in order to refine the motion' information for achieving a better rate/distortion performance for higher for maximum-bit-rates. By this measure, a significantly improved coding efficiency is achieved.

Inventors:
SCHWARZ HEIKO (DE)
WIEGAND THOMAS (DE)
MARPE DETLEV (DE)
WINKEN MARTIN (DE)
Application Number:
PCT/EP2005/010972
Publication Date:
April 19, 2007
Filing Date:
October 12, 2005
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FRAUNHOFER GES FORSCHUNG (DE)
SCHWARZ HEIKO (DE)
WIEGAND THOMAS (DE)
MARPE DETLEV (DE)
WINKEN MARTIN (DE)
International Classes:
H04N7/26
Foreign References:
US6510177B12003-01-21
Other References:
J. REICHEL, H. SCHWARZ, M. WIEN: "Joint Scalable Video Model JSVM-3", JOINT VIDEO TEAM (JVT) OF ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 AND ITU-T SG16 Q6), 16TH MEETING: POZNAN, POLAND, JULY, 2005, no. JVT P202, 29 July 2005 (2005-07-29), pages 1 - 34, XP002384686, Retrieved from the Internet [retrieved on 20060606]
SCHWARZ H ET AL: "SVC Core Experiment 2.1: Inter-layer prediction of motion and residual data", ISO/IEC JTC1/SC29/WG11 M11043, XX, XX, no. M11043, 23 July 2004 (2004-07-23), pages 1 - 6, XP002360488
LI Z G ET AL: "A Novel SNR Refinement Scheme for Scalable Video Coding", IMAGE PROCESSING, 2005. ICIP 2005. IEEE INTERNATIONAL CONFERENCE ON GENOVA, ITALY 11-14 SEPT. 2005, PISCATAWAY, NJ, USA,IEEE, 11 September 2005 (2005-09-11), pages 644 - 647, XP010851473, ISBN: 0-7803-9134-9
SCHWARZ H ET AL: "Scalable Extension of H.264/AVC", ISO/IEC JTC1/CS29/WG11 MPEG04/M10569/S03, XX, XX, March 2004 (2004-03-01), pages 1 - 39, XP002340402
SCHWARZ H ET AL: "Constrained Inter-Layer Prediction for Single-Loop Decoding in Spatial Scalability", IMAGE PROCESSING, 2005. ICIP 2005. IEEE INTERNATIONAL CONFERENCE ON GENOVA, ITALY 11-14 SEPT. 2005, PISCATAWAY, NJ, USA,IEEE, 11 September 2005 (2005-09-11), pages 870 - 873, XP010851192, ISBN: 0-7803-9134-9
LANGE R ET AL: "Simple AVC-based codecs with spatial scalability", IMAGE PROCESSING, 2004. ICIP '04. 2004 INTERNATIONAL CONFERENCE ON SINGAPORE 24-27 OCT. 2004, PISCATAWAY, NJ, USA,IEEE, 24 October 2004 (2004-10-24), pages 2299 - 2302, XP010786245, ISBN: 0-7803-8554-3
Attorney, Agent or Firm:
SCHOPPE, Fritz et al. (Zimmermann Stöckeler & Zinkle, Postfach 246 Pullach bei München, DE)
Download PDF:
Claims:
Claims

1. Video encoder comprising

prediction means (110a, 110b) for performing a motion- compensated prediction on a predetermined picture (104a, 108b) of a video input signal (108, 104) to obtain motion information (116a, 116b) of a first quality level and residual information (118a, 118b) corresponding thereto;

encoding means (112a, 112b) for encoding the residual information (118a, 118b) corresponding to the motion information (116a, 116b) of the first quality level with a first precision to obtain encoded residual information (120a, 120b) for the first quality level;

computing means (114a, 114b) for computing residual information refinement information for refining or replacing the encoded residual information (120a, 120b) of the first quality level with a second precision higher than the first precision to obtain refined encoded residual information, and motion information refinement information for refining or replacing the motion information of the first quality level to obtain refined motion information, such that the refined encoded residual information depends on the refined motion information, wherein the motion information refinement information along with the residual information refinement information forms a * second quality level; and

constructing means (124) for constructing an encoded quality-scalable video data stream (126) encoding the video input signal by use of the motion information of the first quality level, the encodedT residual information of the first quality level, the residual

information refinement information, and the motion information refinement information.

2. Video encoder according to claim 1, wherein the computing means (114a, 114b) is adapted to compute the motion information refinement information and the residual information refinement information such that a predetermined distortion measure of the predetermined picture as derived by combining the refined motion information and the refined encoded residual information or derived by combining the refined motion information, the refined encoded residual information and the residual information of the first quality level, is lower than the predetermined distortion measure of the predetermined picture as derived by combining the motion information of the first quality level with the residual information of the first quality level being refined to the second quality level, at the same rates for encoding the refined motion information and the refined encoded residual information, on the one hand, and refinement information necessary for refining the residual information corresponding with the motion information of the first quality level to the second quality level.

3. Video encoder according to claims 1 or 2, wherein the computing means (114a, 114b) is designed such that the motion information refinement information represents a refinement of a partitioning of the motion information

(mi) of the first quality level with a definition of sub-motion vectors (406a-d) of sub-partitions C408a-d) resulting from the refinement of the partitioning.

4. Video encoder according to one of claims 1 to 3, wherein the computing means (114a, 114b) is designed such that the motion information refinement information represents a residual to the motion information of the first quality level as a predictor.

5. Video encoder according to one of claims 1 to 3, wherein the computing means (114a, 114b) is designed such that the refined motion information is derivable from the motion information refinement information without use of the motion information of the first quality level.

6. Video encoder according to one of the preceding claims, wherein the computing means (114a, 114b) is designed to decide, on a portion-by-portion basis, as to whether, for a predetermined portion of the predetermined picture, the motion information refinement information is computed along with the residual information refinement information or as to whether the residual information refinement information shall alternatively refine the encoded residual information of the first quality level to the second quality level relative to a predictor derivable from the motion information of the first quality level, and to insert a flag indicating the decision into the encoded scalable video data stream.

7. Video encoder according to one of the preceding claims, wherein the computing means (114a, 114b) is designed such that the motion information refinement information (m i+ i) represents an increase (410) in the number of motion hypothesises (400, 410) relative to the motion information (mi) of the first quality level.

8. Video encoder according to one of the preceding claims, wherein the video input signal is a spatially decimated version (108) of a higher resolution video signal (104).

9. Video encoder according to one of the preceding claims, wherein the prediction means (110a) is designed to perform the motion-compensated prediction on the predetermined picture by pre-predicting the predetermined picture from a predictor derived from a

spatially decimated version (108) of the video input signal (104) and then performing a motion-compensated prediction on a residual to the pre-prediction.

10. Video encoder according to one of the preceding claims, wherein the prediction means is designed for performing a motion-compensated filtering on the video input signal .

11. Video decoder for decoding an encoded quality-scalable video data stream (126) encoding a video signal, the encoded quality-scalable video data stream (126) comprising motion information of a first quality level, corresponding encoded residual information of the first quality level, residual information refinement information, and motion information refinement information, the motion information of the first quality level and the corresponding encoded residual information of the first quality level being the result of a performance of a motion-compensated prediction on a predetermined picture (104a, 108b) of the video input signal (108, 104) and an encoding of residual information (118a, 118b) corresponding to the motion information (116a, 116b) of the first quality level with a first precision, the decoder comprising:

parsing means (500) for parsing the encoded quality- scalable video data stream (126) to realize the motion information of the first quality level and the corresponding encoded residual information of th * e first quality level;

refinement means (506, 508) for refining or replacing

(508) the encoded residual information (120a, 120b) of the first quality level from the first to a second precision higher than the first precision with the residual information refinement information to obtain

refined encoded residual information, and refining or replacing (506) the motion information of the first quality level with the motion information refinement information to obtain refined motion information; and

combining means (504) for predicting the predetermined picture by means of the refined motion information to obtain predictive information and combing the predictive information with the refined encoded residual information to obtain the predetermined picture in a second quality level.

12. Video decoder according to claim 11, wherein the refinement means is designed to use the motion information refinement information as a refinement of a partitioning of the motion information (mi) of the first quality level with a definition of sub-motion vectors

(406a-d) of sub-partitions (408a-d) resulting from the refinement of the partitioning.

13. Video decoder according to claim 11 or 12, wherein the refinement means is designed to use the motion information refinement information as a residual to the motion information of the first quality level as a predictor.

14. Video decoder according to one of claims 11 to 13, wherein the refinement means is designed to derive the refined motion information from the motion information refinement information without use of the * motion information of the first quality level.

15. Video decoder according to one of claims 11 to 14, wherein the encoded scalable video data stream contains -. a flag, wherein the refinement means is designed to skip the refinement or replacement of the motion information of the first quality level for a predetermined portion

of the predetermined picture depending on the flag, wherein the combining means is designed to, in this case, use the motion information of the first quality level instead of the refined motion information to obtain the predictive information.

16. Video decoder according to one of claims 11 to 15, wherein the refinement means is designed to use the motion information refinement information (mi + i) for increasing (410) the number of motion hypothesises (400, 410) relative to the motion information (mi) of the first quality level.

17. Video decoder according to one of claims 11 to 16, wherein the obtained predetermined picture of the second quality level is a spatially decimated version of a corresponding picture of higher resolution video signal.

18. Video decoder according to one of the claims 11 to 17, wherein the combining means (110a) is designed to combine the predictive information additionally with a predictor resulting from a pre-prediction of the predetermined picture from a reconstruction of a spatially decimated version of the video signal.

19. Video decoder according to one of the claims 11 to 18, further comprising checking means (502) for checking as to whether the residual information refinement information and the motion information refinement information is present in the encoded quality-scalable video data stream or as to whether a further refinement of the first quality level is desired, wherein the refinement means is designed to be active or inactive depending on the result of the check, and the combining means is designed to, in this case, use the motion information of the first quality level instead of the

refined motion information to obtain the predictive information.

20. Video encoding method comprising the following steps:

performing a motion-compensated prediction on a predetermined picture (104a, 108b) of a video input signal (108, 104) to obtain motion information (116a, 116b) of a first quality level and residual information (118a, 118b) corresponding thereto;

encoding the residual information (118a, 118b) corresponding to the motion information (116a, 116b) of the first quality level with a first precision to obtain encoded residual information (120a, 120b) for the first quality level;

computing residual information refinement information for refining or replacing the encoded residual information (120a, 120b) of the first quality level with a second precision higher than the first precision to obtain refined encoded residual information, and motion information refinement information for refining or replacing the motion information of the first quality level to obtain refined motion information, such that the refined encoded residual information depends on the refined motion information, wherein the motion information refinement information along with the residual information refinement information forms a second quality level; and

constructing an encoded quality-scalable video data stream (126) encoding the video input signal by use of the motion information of the first quality level, the encoded residual information of the first quality level, the residual information refinement information, and the motion information refinement information.

21. Video decoding method for decoding an encoded quality- scalable video data stream (126) encoding a video signal, the encoded quality-scalable video data stream (126) comprising motion information of a first quality level, corresponding encoded residual information of the first quality level, residual information refinement information, and motion information refinement information, the motion information of the first quality level and the corresponding encoded residual information of the first quality level being the result of a performance of a motion-compensated prediction on a predetermined picture (104a, 108b) of the video input signal (108, 104) and an encoding of residual information (118a, 118b) corresponding to the motion information (116a, 116b) of the first quality level with a first precision, the method comprising the following steps :

parsing the encoded quality-scalable video data stream (126) to realize the motion information of the first quality level and the corresponding encoded residual information of the first quality level;

checking as to whether the residual information refinement information and the motion information refinement information is present in the encoded quality-scalable video data stream (126);

refining or replacing (508) the encoded residual information (120a, 120b) of the first quality level from the first to a second precision higher than the first precision with the residual information refinement information to obtain refined encoded residual information, and refining or replacing (506) the motion information of the first quality level with the motion

information refinement information to obtain refined motion information; and

22. Encoded quality-scalable video data stream encoding a video input signal, comprising

motion information (116a, 116b) of a first quality level and encoded residual information (118a, 118b) for the first quality level corresponding thereto being the result of a performance of a motion-compensated prediction on a predetermined picture (104a, 108b) of the video input signal (108, 104) and an encoding of residual information (118a, 118b) corresponding to the motion information (116a, 116b) of the first quality level with a first precision;

residual information refinement information for refining or replacing the encoded residual information (120a, 120b) of the first quality level with a second precision higher than the first precision to enable obtainment of refined encoded residual information; and

motion information refinement information for refining or replacing the motion information of the first quality level to enable obtainment of refined motion information, such that predicting the predetermined picture by means of the refined motion information to obtain a predictive picture and combing the predictive picture with the refined encoded residual infσrmation yields the predetermined picture in a second quality level.

combining means (504) for predicting the predetermined picture by means of the refined motion information to

"obtain predictive information and combing the predictive information with the refined encoded residual

information to obtain the predetermined picture in a second quality level.

23. Computer-Program having a program code for performing, when running on a computer, a method according to claim 20 or 21.

Description:

Video Codec Supporting Quality-Scalability

Description

The present invention relates to a video codec supporting quality- or SNR-scalability .

A current project of the Joint Video Team (JVT) of the ISO/IEC Moving Pictures Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG) is the development of a scalable extension of the state-of-the-art video coding standard H.264/MPEG4-AVC defined in ITU-T Rec. & ISO/IEC 14496-10 AVC, "Advanced Video Coding for Generic Audiovisual Services," version 3, 2005. The current working draft as described in J. Reichel, H. Schwarz, and M. Wien, eds . , "Scalable Video Coding - Working Draft 3," Joint Video Team, Doc. JVT-P201, Poznan, Poland, July 2005 and J. Reichel, H. Schwarz, and M. Wien, eds., "Joint Scalable Video Model JSVM- 3," Joint Video Team, Doc. JVT-P202, Poznan, Poland, July 2005, supports temporal, spatial, and SNR scalable coding of video sequences or any combination thereof.

H.264/MPEG4-AVC as described in ITU-T Rec. & ISO/IEC 14496-10 AVC, "Advanced Video Coding for Generic Audiovisual Services, " version 3, 2005, specifies a hybrid video codec in which macroblock prediction signals are either generated by motion-compensated prediction or intra-prediction and both predictions are followed by residual coding. H.264/MPEG4-AVC coding without the scalability extension is referred to as single-layer H.264/MPEG4-AVC coding. Rate-distortion performance comparable to single-layer H.264/MPEG4-AVC means that the same visual reproduction quality is typically achieved at 10% bit-rate. Given the above, scalability is considered as a functionality for removal of parts of the bit-stream while achieving an R-D performance at -any supported spatial, temporal, or SNR resolution that is comparable to single-layer H.264/MPEG4-AVC coding at that particular resolution.

The basic design of the scalable video coding (SVC) can be classified as layered video codec. In each layer, the basic concepts of motion-compensated prediction and intra prediction are employed as in H.264/MPEG4-AVC. However, additional inter-layer prediction mechanisms have been integrated in order to exploit the redundancy between several spatial or SNR layers. SNR scalability is basically achieved by residual quantization, while for spatial scalability, a combination of motion-compensated prediction and oversampled pyramid decomposition is employed. The temporal scalability approach of H.264/MPEG4-AVC is maintained.

In general, the coder structure depends on the scalability space that is required by an application. For illustration Fig. 12 shows a typical coder structure 900 with two spatial layers 902a, 902b. In each layer, an independent hierarchical motion-compensated prediction structure 904a, b with layer- specific motion parameters 906a b is employed. The redundancy between consecutive layers 902a, b is exploited by inter-layer prediction concepts 908 that include prediction mechanisms for motion parameters 906a, b as well as texture data 910a, b. A base representation 912a, b of the input pictures 914a, b of each layer 902a, b is obtained by transform coding 916a, b similar to that of H.264/MPEG4-AVC, the corresponding NAL units (NAL - Network Abstraction Layer) contain motion information and texture data; the NAL units of the base representation of the lowest layer, i.e. 912a, are compatible with single-layer H.264/MPEG4-AVC. The reconstruction quality of the base representations can be improved by an additional coding 918a, b of so-called progressive refinement slices; the corresponding NAL units can be arbitrarily truncated in order to support fine granular quality scalability (FGS) or flexible bit-rate adaptation.

The resulting bit-streams output by the base layer coding 916a, b and the progressive SNR refinement texture coding

918a, b of the respective layers 902a, b, respectively, are multiplexed by a multiplexer 920 in order to result in the scaleable bit-stream 922. This bit-stream 922 is scaleable in time, space and SNR quality.

One disadvantage of the above-described scaleable extension of the video coding standard H.264/MPEG4-AVC is that a distortion/rate performance of the refinement layers defined by the base layer bit-stream 912a or 912b plus the respective refinement layer bit-streams output by blocks 918a, b up to a specific respective refinement layer is not non-optimal for all R/D-performances .

Thus, it is an object of the present application to provide a video codec supporting quality scalability enabling encoded quality-scaleable video data-streams having an improved coding efficiency.

This object is achieved by methods according to claim 20 or 21, a video encoder according to claim 1, a video decoder according to claim 19, and an encoded quality-scaleable video data stream according to claim 22.

The basic idea underlying the present invention is that refinement layers showing an improved rate/distortion performance may be achieved by accompanying residual information refinement information with refined motion information. In particular, it has been recognized that for small maximum-rates a coarse motion information involving less bits for representing the motion information in ϋhe bit- stream leads to good rate/distortion performance. However, with increasing maximum-rates, merely refining the encoding of the texture information relative to the coarse motion information does not lead to optimal rate-distortion performance for representing the respective refinement layer. Rather, more bits should be spent for the motion information in order to refine the motion information for achieving a

better rate/distortion performance for higher for maximum- bit-rates. By this measure, a significantly improved coding efficiency is achieved.

In the following, preferred embodiments of the present application are described with reference to the Figs. In particular, it is shown in

Fig. 1 a block diagram of a video encoder according to an embodiment of the present invention;

Fig. 2 a schematic illustrating the hierarchical prediction structure that may me used in the encoder of Fig. 1;

Fig. 3 a schematic illustrating the function of the key pictures as re-synchronization points between encoder and decoder;

Fig. 4 a graph illustrating an example for the coding efficiency of SNR scalable coding strategies;

Fig. 5 a graph showing a comparison of the FGS concept of

Fig. 1 with a FGS concept of Fig. 12;

Fig. 6 a graph showing a comparison of the FGS concept of Fig. 1 with the FGS concept of Fig. 12 for another scene;

Fig. 7 a schematic illustrating the SNR refinement* coding according to Fig. 12;

Fig. 8 a schematic illustrating the refinement coding used in Fig. 1 according to an embodiment of the present application;

Fig. 9 a schematic illustrating another embodiment for the refinement coding used in Fig. 1;

Fig. 10a and 10b schematics showing embodiments for possible refinements of motion information;

Fig. 11 a flow chart illustrating the most pertinent steps performed in a decoder for decoding the bit-stream of the encoder of Fig. 1 according to an embodiment of the present application; and

Fig. 12 a conventional coder structure for scalable video coding.

Fig. 1 shows an embodiment of a video encoder of the present application. In particular, the video encoder of Fig. 1 supports two spatial layers. To this end, the encoder of Fig. 1 which is generally indicated by 100 comprises two layer portions or layers 102a and 102b, among which layer 102b is dedicated for generating that part of the desired scalable bit-stream concerning a coarser spatial resolution, while the other layer 102a is dedicated for supplementing the bit- stream output by layer 102b with information concerning a higher resolution representation of an input video signal 104. Therefore, the video signal 104 to be encoded by encoder 100 is input directly into layer 102a whereas encoder 100 comprises a spatial decimeter 106 for spatially decimating the video signal 104 before inputting the resulting spatially decimated video signal 108 into layer 102b.

The decimation performed in spatial decimeter 106 comprises for example, decimating the number of pixels of each picture 104a of the original video signal 104 by a factor of 4 by means of discarding every second pixel in column and row direction.

The low-resolution layer 102b comprises a motion-compensated prediction block 110b, a base layer coding block 112b and a refinement coding block 114b. The prediction block 110b performs a motion-compensated prediction on pictures 108a of the decimated video signal 108 in order to predict pictures 108a of the decimated video signal 108 from other reference pictures 108a of the decimated video signal 108. For example, the prediction block 110b generates for a specific picture 108a motion information that indicates as to how this picture may be predicted from other pictures of the video signal 108, i.e., from reference pictures. In particular, to this end, the motion information may comprise pairs of motion vectors and associated reference picture indices, each pair indicating, for example, how a specific part or macroblock of the current picture is predicted from an indexed reference picture by displacing the respective reference picture by the respective motion vector. Each macroblock may be assigned one or more pairs of motion vectors and reference picture indices. Moreover, some of the macroblocks of a picture may be intra predicted, i.e., predicted by use of the information of the current picture. As described in the following, the prediction block 110b may perform a hierarchical motion- compensated prediction on the decimated video signal 108.

The prediction block 110b outputs the motion information 116b as well as the prediction residuals or the video texture information 118b representing the differences between the predictors and the actual decimated pictures 108a. As will be described in more detail below, the determination of the motion information and text information 116b and 118b in prediction block 110b is performed such that the resulting encoding of these information by means of the subsequent base layer coding 110b results in a base-representation bit-stream with, preferably, optimum rate/distortion performance. Moreover, as will be described in more detail below, the prediction block 110b also determines, in cooperation with the refinement coding block 114b, refined motion information

along with corresponding refined residual texture information, however, this will also be described in more detail below.

As already described above, the base layer coding block 110b receives the first motion information 116b and texture information 118b from block 110b and encodes the information to a base-representation bit-stream 120b. The encoding performed by block 110b comprises a transformation and a quantization of the texture information 118b. In particular, the quantization used by block 110b is relatively coarse. Thus, in order to enable quality-up-scaling of bit-stream 120b, the refinement coding block 114b supports the bit- stream 120b with additional bit-streams for various refinement layers containing information for refining the coarsely quantized transform coefficients representing the texture information in bit-stream 120b. In this regard, the refinement coding block 114b is not only able to refine the transform coefficients representing the texture information 118b relative to the base representation of bit-stream 120b or a lower refinement layer bit-stream output performed by coding block 114b, i.e., 122b. Rather, as will be described in more detail below, refinement coding block 114b - in cooperating with the prediction block 110b, is further able to decide that a specific refinement layer bit-stream 122b should be accompanied by refined motion information 116b. However, this functionality will be described later after the description of Fig. 6. The refinement of the residual texture information relative to the base representation 120b of the formerly output lower refinement layer bit-streaTn 122b comprises, for example, the encoding of the current quantization error of the transform coefficient representing the texture information 118b with a finer quantization precision.

Both bit-streams 120b and 122b are multiplexed by a multiplexer 124 comprised by encoder 100 in order to insert

both bit-streams into the final scaleable bit-stream 126 representing the output of encoder 100.

Layer 102a operates substantially the same as layer 102b. Accordingly, layer 102a comprises a motion-compensation prediction block 110a, a base layer coding block 112a, and a refinement coding block 114a. In conformity with layer 102b, the prediction block 110a receives the video signal 104 and performs a motion-compensated prediction thereon in order to obtain motion information 116a and texture information 118a. The firstly output motion and texture information 116a and 118a are received by coding block 110a which encodes this information to obtain the base representation bit-stream 120a. The refinement coding block 114a codes refinements of the quantization error manifesting itself in the base representation 120a by comparing a transformation coefficient in bit-stream 120a and the actual transform coefficient resulting from the original texture information 118a and, accordingly, outputs refinement-layer bit-streams 122a for various refinement layers. Moreover, as mentioned above, for the various refinement layers, the refinement coding means 114a may use refinement motion information, which is also generated by prediction block 110a.

The only difference between layers 102a and 102b is that layer 102a involves inter-layer prediction. That is, as will be described in more detail below, the prediction block 110a uses information derivable from layer 102b, such as residual texture information, motion information or a reconstructed video signal as derived from one or more of the bit-streams 120b and 122b, in order to pre-predict the higher resolution pictures 104a of the video signal 104, thereafter performing the motion-compensated prediction on the pre-prediction residuals as mentioned above with respect to prediction block 110b relative the decimated video signal 108.

In order to illustrate the advantage of the present invention, in the following, the mode of operation of encoder 100 is described in more detail below, with, however, firstly neglecting the usage of refined motion information in the refinement coding process in block 114a and 114b.

In particular, the following detailed description of the encoder 100 represents an embodiment where this encoder is designed to represent a scaleable extension of the video coding standard H.264/MPEG4-AVC. In this case, in each layer 102a, b hierarchical prediction structure as illustrated in Fig. 2 is employed. The first picture 202 of a video sequence 204 is coded as IDR (intra) picture; so-called key pictures 202, 206 are coded in regular intervals. A key picture 202, 206 and all pictures that are temporally located between a key picture and the previous key pictures are considered to build a group of pictures 208 (GOP) . The key pictures 202, 206 are either intra-coded or inter-coded by using previous key pictures as reference for motion-compensated prediction. The remaining pictures of a GOP are hierarchically predicted as shown in Fig. 2. It is obvious that this hierarchical prediction structure provides temporal scalability; but it turned out that it offers also the possibility to efficiently integrating spatial and SNR scalability.

The hierarchical picture coding can be extended to motion- compensated filtering (MCTF) . For that, motion-compensated update operations using the prediction residuals (dashed arrows in Fig. 2 are introduced in addition to the motion- compensated prediction (continuous arrows in Fig.* 2). A detailed description on how H.264/MPEG4-AVC is extended towards MCTF can be found in J. Reichel, H. Schwarz, and M. Wien, eds., "Scalable Video Coding - Working Draft 3," Joint Video Team, Doc. JVT-P201, Poznan, Poland, July 2005 and J. Reichel, H. Schwarz, and M. Wien, eds., "Joint Scalable Video Model JSVM-3," Joint Video Team, Doc. JVT-P202, Poznan, Poland, July 2005.

Temporal scalability can thus be provided by using a hierarchical prediction structure similar to that depicted in Fig. 2. This can be achieved with single-layer H.264/MPEG4- AVC and does not re-quire any changes of the standard. For spatial and SNR scalability additional tools have to be added to single-layer H.264/MPEG4-AVC. All three scalability types can be combined in order to generate a bit-stream 126 that supports a large degree on combined scalability.

Next, the SNR scalability enabled by refinement coding blocks 114a and 114b is described. Still, a motion information refinement feature of the encoder is preliminarily neglected in order to afterwards emphasise the advantage of the present invention.

For SNR scalability, coarse-grain scalability (CGS) and fine- granular scalability (FGS) may be distinguished. With CGS only selected SNR scalability layers are supported and the coding efficiency is optimised for coarse rate graduations as factor 1.5-2 from one layer to the next. FGS enables the truncation of NAL units (bit packets) at any arbitrary (byte- aligned) point.

For coarse-grain SNR scalability the same concepts as for spatial scalability (see Fig. 1) , but without up sampling operations are used. Thus, the pictures for different coarse- grain SNR layers are coded independently with layer-specific motion information. However, in order to improve the coding efficiency of the enhancement layers in comparison to simulcast additional inter-layer prediction mechanisms are incorporated in order to employ base layer information for an efficient enhancement layer coding. It was found that these inter-layer prediction techniques should be switchable, so that an encoder can freely choose which dependencies between CGS layers need to be exploited for an efficient coding.

The following 3 inter-layer prediction techniques turned out to provide coding gains and were included into the SVC design. In the following, we only describe the original concept based on simple dyadic spatial scalability. However, during the development of SVC it was shown that the tools can be generalized for arbitrary resolution factors.

Inter-layer motion prediction

In order to employ base layer motion data for an efficient coding of the enhancement layer motion, an additional macroblock modes that utilize motion information of the lower resolution layer has been introduced. If this macroblock is selected, the macroblock partitioning is copied from the co- located macroblock of the corresponding base layer. For the macroblock partitions and sub-macroblock partitions, the same reference picture indices and the same motion vectors as for the corresponding macroblock partition or sub-macroblock partition of the base macroblock are used. Neither reference indices nor motion vector differences are transmitted. Additionally, the design of SVC includes the possibility to use a motion vector of the lower layer as motion vector predictor for the conventional motion-compensated macroblock modes .

Inter-layer residual prediction

In order to also incorporate the possibility of exploiting the residual information coded in the lower layer, an additional flag is transmitted for each inter-coded macroblock, which signals the application of residual signal prediction from the lower resolution layer. If the flag is true, the base layer residual signal is used as prediction for the residual signal of the current layer, so that only the corresponding difference signal is coded.

Inter-layer intra prediction

Furthermore, an additional intra macroblock mode, in which the intra prediction signal is formed by the reconstruction signal of the lower layer, is introduced. For this inter- layer intra prediction it is generally required that the lower layer is completely decoded including the computationally complex operations of motion-compensated prediction and deblocking. However, this problem can circumvented when the inter-layer intra prediction is restricted to those parts of the lower layer picture that are coded with intra macroblocks. This restriction enables single motion compensation loop decoding and is mandatory in the current SVC Working Draft J. Reichel, H. Schwarz, and M. Wien, eds., "Scalable Video Coding - Working Draft 3," Joint Video Team, Doc. JVT-P201, Poznan, Poland, July 2005 and J. Reichel, H. Schwarz, and M. Wien, eds., "Joint Scalable Video Model JSVM-3," Joint Video Team, Doc. JVT-P202, Poznan, Poland, July 2005.

In order to support fine granular SNR scalability, so-called progressive refinement (PR) slices (for details, see the above mentioned workings drafts) have been introduced. Each NAL unit for a PR slice represents a refinement signal that corresponds to a bisection of the quantization step size (QP increase of 6) . These signals are represented/generated by blocks 114a, b in a way that only a single inverse transform has to be performed for each transform block at the decoder side. The progressive refinement NAL units can be truncated at an arbitrary point, so that the quality of the SNR base layer can be improved in a fine granular way. Therefore, the coding order of transform coefficient levels has been modified. Instead of scanning the transform coefficients macroblock by macroblock as it is done in "normal" slices, the transform coefficient blocks are scanned in several paths, and in each path only a few coding symbols for a transform coefficient block are coded. With the exception of

the modified coding order, the CABAC entropy coding as specified in H.264/MPEG4-AVC is re-used.

When FGS layers are truncated the reference pictures that are used in encoder and de-coder are different. In order to limit the corresponding drift, the key pictures are used as re- synchronisation points between encoder and decoder as illustrated in Fig. 3. Fig. 3 demonstrates the amount of bit- stream data for five pictures 250a to 25Oe of a video sequence, the bit-stream data for each picture 250a to 25Oe in scalable data-stream 126 being divided into a SNR base layer part (lower half) and a SNR enhancement layer part (upper part) . As can be seen, while for other pictures, the references pictures 250a, 25Oe including PR slices are used for motion compensated prediction (dashed arrows in Fig. 3, the motion-compensated prediction signal for key pictures 250a, b is generated by using only the base layer representation (lower part) (without FGS enhancements) of the reference key pictures (line-dot-arrow) . Thus, it is ensured that the motion-compensated prediction signal for key pictures is identical at encoder and decoder side. The non- key pictures 250b, c, d are predicted by using the highest quality reference that is available (SNR enhancement layer) , since the reconstruction quality is highly dependent on the quality of the reference pictures 250a, e and the drift is limited by the hierarchical prediction structure.

With CGS, only complete SNR layers can be discarded, and thus the reference pictures at encoder and decoder side are always identical. Thus for CGS coding, also the key pictures are predicted using the highest quality reference available

(dashed arrow) , since this generally improves the coding efficiency.

In Fig. 4, a comparison of the coding efficiency of coarse- grain and fine-grain SNR scalable coding is illustrated for an example sequence. The base layer has been always coded in

compliance with H.264/MPEG4-AVC. Only the first picture of a sequence was intra-coded, and a GOP size of 16 pictures has been selected. No motion-compensated update steps have been employed. All encoder runs have been performed with a rate- distortion optimised encoder control as described in T. Wiegand, et. al, "Rate-Constrained Coder Control and Comparison of Video Coding Standards," IEEE Trans. CSVT, vol. 13, pp. 688-703, July 2003. The difference between the quantization parameters of the lowest and highest SNR layers was set to 12, which approximately corresponds to a factor of 4 in bit-rate. The dashed curve represent CGS runs with adaptive selection of the inter-layer prediction tools and quantization parameter differences of 6. A corresponding CGS run for which all inter-layer prediction tools have always been used is represented by the continuous curve connecting the triangles. For both runs, the same motion parameters optimised for the lowest rate point have been chosen in the base layer. A comparison of these 2 curves shows that the coding efficiency for the CGS enhancement layers can always be improved when the inter-layer prediction tools are adaptively selected, i.e. especially when the motion parameters for the enhancement layer are adaptively refined, as described further below.

The continuous curve connecting the circles represents an FGS coding run that is comparable with the CGS run of the triangle fitted curve. In both runs, the QP difference between successive layers is equal to 6 and no motion refinements are used. But while with CGS only three bit-rates are provided, the rate for decoding the FGS bit-stream can be arbitrarily chosen in-side the supported interval. The difference between the points of the green curve and the FGS layer end points of the red curve, which are marked by black circles, results from the fact that when using the FGS functionality the key pictures are only predicted from the base representation of previous key pictures, while in the

CGS run the highest quality reference is always used for motion-compensated prediction.

After having described the functionality or operation of the encoder 100 of Fig. 1 without use it of refinement of motion information during the refinement coding process in blocks 114a and 114b, the following description focus on this functionality and its advantages.

It is a well-known fact that for an efficient coding of video sequences in rate-distortion sense, the trade-off between motion and texture rate have to be optimised for the target bit-rate. However, in SNR scalable video coding being of interest here, the encoder decisions have to be optimised for a rate interval instead of a specific target rate. When the supported rate interval is large, e.g. when the rate increase between the lowest and highest rate point is greater than 100% of the lowest rate point, the usage of a single motion vector field generally results in poor coding efficiency. The coding efficiency can be improved, when an adaptive refinement of motion data is possible. This has been shown with reference to Fig. 4 for coarse-grain SNR scalability. The coding efficiency at high bit-rates (about four times the base layer rate) can be improved by more than 1 dB compared to a simple scheme, when an adaptive switching of the inter- layer prediction mechanisms is allowed. In the simple scheme as described here before, the motion data of all layers 122 are identical and the reconstructed base layer signal is always used as predictor for the signal of the current layer. The improved coding efficiency of the adaptive concept, which allows an adaptive switching of the inter-layer prediction techniques, is mainly a result of the possibility to refine the motion information of the base layer for the coding of the enhancement of refinement layers. Thus, the trade-off between motion and texture data can be optimised for each coarse-grain SNR layer.

The current concept of fine-grain scalability of Fig. 12 did not provide for an adaptive refinement of motion parameters. With the syntax of the so-called progressive refinement slices of the SVC design of Fig. 12 only refinements of the transform coefficient levels, and thus the residual (texture) information, could be represented. Thus the FGS concept is similar in spirit to the simple CGS concept, which does not allow an adaptive switching of the inter-layer prediction techniques on a block basis. The main differences between the simple CGS approach and the fine-grain scalable coding with progressive refinement slices are the following: (1) With the FGS concept, the transform coefficient levels are directly refined in the transform domain, so that only a single inverse transform is required at the decoder side for each block. For CGS, the current residual (or intra) signal is predicted by the reconstructed base layer residual (or intra) signal, and thus multiple inverse transforms for each block are required at the decoder side. (2) While in CGS, the coding symbols of the refinement pictures are coded in a common macroblock by macroblock scan, the coding symbols for progressive refinement slices are coded via so-called cyclic block coding. The coding symbols (transform coefficient levels) are coded in several scans, where in each scan only a few coding symbols (e.g. only one significant transform coefficient level) is coded for each transform block. This modified coding order allows the truncation of the corresponding NAL units without generating disturbing coding artefacts .

Thus, combining the feature of the CGS, that * motion parameters can be refined in each layer, with the progressive refinement coding of the FGS concept, results in an FGS coding scheme that allows the refinement of motion parameters, and that thus shows a significantly improved coding efficiency compared to the FGS concept of the SVC of Fig. 12.

Thus, referring back to Fig. 1, the new approach of adaptive motion information refinement for SNR scalable video coding enables the video encoder 100 the choice to select a, in rate-distortion (RD) sense, better tradeoff between bit rate for coding of residual 118 and motion data 116. For each macroblock in a progressive refinement slice which corresponds to a base layer slice in bit-stream 122 that supports motion-compensated prediction (so called P- and B- slices) block 114a, b decides between two possible coding modes:

Model - Using the same motion information as the SNR base layer

Using the same motion information as the SNR base layer and thus transmitting only a refinement of the residual data is the first option. Fig. 12 only supports this mode for FGS. This option or coding mode of coding block 114a, b is illustrated in Fig. 7. Fig. 7 illustrates the process of refinement coding block 114a, b for the case that the coding block 114a, 114b decides for each refinement layer to reuse the motion information of the base representation. To be more precise, Fig. 7 shows an exemplary part 300 of the video sequence input into prediction block 110a, b. For a specific picture 302 of the sequence 300 the motion information m created by prediction block 110 a,b is illustrated by arrow 304. As shown in 306, the information entering base layer coding block 112a, b comprises the base-layer motion information m 0 and the residual texture information residualo. Residual information residualo in combination wi*th ' tbe predictor for picture 302 derived from motion information mo represents the reconstruction of the base-representation information derivable from the encoder bit-stream 120a, b, wherein it is emphasized that "residualo" shall indicate the residual information 118a, b after passing the base - layer coding 112a, b and the corresponding quantization. Next, it can be seen at 308 that the refinement coding block 114a, b

decides to reuse the motion information m 0 as input into the base layer coding 112b. Coding block 114a, b nearly generates additional information residuali that refines the coarsely quantized residual information residual 0 to a higher precision. Thus, residuali forms the first refinement layer and a combination of predictor (mo) , residualo and residuali represents the reconstruction of the first refinement layer. Similarly, as shown in 310, for a second refinement layer, the coding block 114a, b generates additional information residual2 for finer quantizing residual information as represented by residuali and residualo. It is emphasized with respect of Fig. 7 that the quantization subdivision performed with increasing refinement layer number is directly performed in a transformed domain not in a spatial domain so that at the decoder side, as mentioned above, merely one re- transformation is necessary. Further, in Fig. 7, mo and residualo is output by block 112a, b, while refinement coding block 114a, b outputs consecutively residuali, residual 2 , residuals... Multiplexer 124 multiplexes them in the order of mo, residuali, residua^, residuals ... into bit-stream 126, for example. Further, in each block residual*, the transform coefficients may be arranged such that a sudden truncation of the resulting multiplexed scale level bit-stream 126 results in optimum coding efficiency.

Mode2 - transmission of new motion information together with a new residual

The other option for block 114a, b is the transmission of -new motion information together with a new residual. This is illustrated with respect to Fig. 8. Again, the base-layer bit-stream 120a, b is formed by encoding the motion information M 0 and a residualo as shown at 306. However, differing from Fig. 7, refinement coding block 114a, b then decides to use a new motion information mi for the first refinement as shown at 312. Different examples for refinement motion information are described below. Moreover, the

decision to replace or refine the base-layer motion information m 0 with motion information mi (Fig. 8) is conducted by 114a, b on a macroblock basis, for example. Moreover, the decision may be performed such that the raise/distortion performance for encoding the reconstruction 314 of the first refinement layer is increased relative to Fig. 7 or even optimized. Similarly, the rate/distortion performance may be higher in case of reusing the motion information as shown in Fig. 7. In this case, the decision is negative for option of Fig. 8 and positive for option of Fig. 7.

Referring back to Fig. 8, it is noted that the generation of the new motion information ni is actually performed by the prediction block 110a, b. Concurrently, prediction block 110a, b generates corresponding residual texture information 118a, b. Same are then encoded by refinement coding block 114a, b along with the new motion information mi to yield the first refinement of enhancement layer bit-stream 122a, b, with the encoding residual information resulting in residuali, and with the quantization during encoding the residual being performed with a smaller quantization step size as used for quantizing or encoding residualo in base layer coding block 112. Similarly, as shown in 316, in building the second refinement layer, coding block 114 again uses new motion information In 2 . It is emphasized that this does not have to be the case. Rather, the coding block 114 could reuse the motion information M x , thereby deciding between option Fig. 7 and Fig. 8 on a refinement layer basis.

Fig. 9 represents an alternative embodiment for the case of Fig. 8, i.e. for the case that the coding block 114 decides to transmit new motion information together with new residual. However, in contrast to Fig. 8 where residual a is equal to a finer quantized version of a difference between the actual picture 302 and the sum of the predictive picture derived from the motion information m a plus the residual

information of the quantization level available, i.e. residual 0 plus ... plus residual a -i, in accordance with Fig. 9, the refinement coding block 114a, b disregards the residual information available so far and encodes a new residual to the new motion information mi, m 2 , ... by use of the quantization level associated with the current refinement layer. Therefore, as can be seen at 318, for the first refinement the refinement layer, coding block 114a, b combines the new motion information mi with new residual information residuali so that picture 302 of the first refinement layer is derivable from the first refinement layer bit-stream 122a, b independently on the base layer bit-stream 120a, b, i.e. without reuse of residual 0 in bit-stream 120a, b. Similarly, at 320, refinement coding block 114a, b forms the second refinement layer bit-stream 122a, b by combining new motion information m 2 with new residual information residual 2 being the result of encoding the residual between the actual picture 302 and the predictive picture obtained from the new motion information m 2 .

First, with regard to the option of a coding mode of transmitting new motion information together with a new residual, the new residual data may or may not be predicted from the SNR subordinate laye.r. Similarly, the new motion information of a current SNR layer may or may not be coded independently on the motion information of the SNR subordinate layer. Thus, both, the new motion and residual data can be predicted from the SNR subordinate layer to achieve a better RD-performance.

It is possible to restrict the possible motion modes of the base-layer motion information and refined motion information to motion modes supported by the video coding standard H.264/MPEG 4-AVC, which means that by subdivision of the macroblock into smaller blocks for motion-compensated prediction up to 16 motion vectors for P-slices and up to 32 motion vectors for B-slices can be signalled.

One obvious way to make the above decision between the two options is to use a Lagrangian approach where a Lagrangian cost functional is minimized for a given λ. Here, D stands for the distortion between original and reconstructed (decoded) signal as illustrated on the right hand side of the equal sign in Fig. 7 to 9 and R gives the bit rate needed for coding of the macroblock. If the cost for refining only the residual data is higher than the cost for one of the possible motion refinement modes, it is in rate-distortion sense obviously better to transmit a new set of motion information for a specific macroblock. Consequently, using adaptive motion information refinement as used by the encoder of Fig. 1, it is possible to achieve a higher picture quality at the same bit rate.

Before describing the functionality of the refinement coding blocks 114a, b with regard to progressive refinement slices in further detail, in the following, different possibilities are described how to refine motion information from one refinement layer to the next. These possibilities may be mixed when refining motion information.

The first possibility is shown in Fig. 10a. As illustrated, the motion information mi of a base layer or subordinate refinement layer may be refined by defining a new motion information m 1+ i for the current refinement layer by further partitioning the current picture by increasing the resolution with which motion vectors are defined for the current picture. In the example of Fig. 10a, the motion info-rmation of the subordinate layer comprises just one motion vector 400 with corresponding reference index referencing picture 402, for a specific partition 404 of the current picture. The new motion information rrii + i comprises pairs of motion vectors 406a to 406d and corresponding reference indices associated with one of four sub-partitions 400a to 400d of current partition

400 each. In this regard, it is recalled that the actual

motion information refinement information coded into the current refinement layer bit-stream by coding block 114 may either define the motion vectors 406a to 406d a new independent on the motion vector 400 or may define motion vector 406a to 406d by defining merely the differences of these vectors to vector 400, respectively.

Another possibility to refine motion information is illustrated in Fig. 10b. In this case, the motion information is refined in that relative to the former motion information mi an additional motion hypothesis is specified in a new motion information mi + i of the current refinement layer. Again, as in the example of Fig. 10a, the motion information of the subordinate refinement layer or base-layer is illustrated to contain merely one motion vector 400 along with the corresponding reference index to the reference picture 402 for one partitition 404. Compared thereto, the new motion information M i+ i contains for that partitition 404 another motion sector 410 belonging to and originating from a picture 412 that is different from reference picture 402. Therefore, the number of motion hypothesis used for partitition 404 in case of the new motion information is increased. The new motion information mi + i may incorporate information as to how partition 404 is to be predicted from both motion hypothesises, such as, by averaging both predictions. However, conventions at encoder and decoder may be used in order to avoid the transmission of such an information.

In addition to the above-described possibilities of refining or adapting the motion information from one refinement layer to the next, the motion information may be refined by merely amending the motion vector. The reason that a "wrong" motion vector may lead to a better R/D-performer for a low refinement layer or for the base-layer is that the encoding of a larger motion vector leads to more bits being necessary in order to encode the motion information, for example.

Therefore, in such cases, the base layer motion information has likely to comprise motion vectors that are shorter than they should. However, with increasing rates, the bits may, in rate-distortion sense, better be spent for a finer, more accurate motion information so that for higher refinement layers the motion information is likely to change to indicate a longer motion vector.

In the following, it is described which changes would have to be performed on the encoder of Fig. 12 implemented in the above-mentioned working drafts JVT/P201 and JVT/P202 in order to correspond with the Fig. 1 embodiment. In particular, all the changes described below only refer to progressive refinement slices which correspond to slices which support motion-compensated prediction (i.e., P- and B-slices) . For progressive refinement slices which correspond to intra coded slices (I-slices) , the coding as described in JVT/P201 and JVT/P202 is used. As described there, progressive refinement slices are coded using cyclic block coding, where all the macroblocks in a video frame and all the blocks in a macroblock are scanned in multiple cycles. This should be changed such that in the first cycle for each macroblock, it is signalled whether for this macroblock motion information will be transmitted. If no motion information is transmitted, the coding of this macroblock continues as described in JVT/P201 and JVT/P202. In the other case, after the described signalling, the following information should be transmitted:

• the selected macroblock prediction mode (Direct, 16x16, 16x8, 8x16 or 8x8) , • in the case of 8x8 mode: the selected sub-macxoblock prediction mode (8x8, 8x4, 4x8 or 4x4),

• the residual prediction flag, which signals that the residual signal is to be predicted from the subordinate layer • the motion prediction flag, which signals that the motion vector is to be predicted from the subordinate layer

• the reference frame index

• the motion vector (s) as a difference between the original motion vector to be used and a prediction thereof • In the case of a bipredictive coded slice (B-slice) the last three of these are coded once for each reference frame list. After this, the coding of the residual data follows, similar to the residual coding described in

JVT/P201 and JVT/P202 for progressive refinement slices. The transmission of this information can be done similar to the current JVT/P201 and JVT/P202 Working Draft, but the coding efficiency can possibly improved by using a different syntax for transmitting this refinement information. The following possibilities or any combination of these are considered as preferred embodiments of the invention:

• A macroblock mode, in which the macroblock partitioning of the base macroblock is used, but reference picture indices and/or motion vector refinements are transmitted, possibly by using the base layer syntax elements for generating predictors.

• The transmission of a new macroblock mode, which specifies a new partitioning of the macroblock together with refinements for the reference indices and/or motion vectors. The transmission of the new macroblock mode can also be realized by refining the macroblock partitioning of the base layer.

• By specifying an additional motion hypothesis in addition to the motion hypothesis of the base layer. For example, when the base layer macroblocks specifies a ' 16x16 mode with prediction from list 0, then « in the enhancement layer an additional motion hypothesis could specify a 16x16 mode with prediction from list 1, and the final motion compensated prediction signal is generated as the weighted average of both the 16x16 list 0 and the 16x16 list 1 prediction. The weighting factor could be additionally specified.

Note, that without transmitting of any refinement of the motion information it is still possible to achieve nearly the same rate-distortion characteristic as that of the current SVC WD, since the bit rate is increased only by the additional possibility to signal a motion refinement mode per macroblock which can very efficiently be realized by a well- designed entropy coding. Consequently, the coding efficiency using motion information refinement cannot become noticeably worse than that of the current SVC described in JVT/P201 and JVT/P202.

The predictor for the motion vectors can be adaptively switched between a spatial motion vector predictor (similar to H.264/MPEG4-AVC) and a prediction from the corresponding FGS base layer. Additionally when the adaptive motion information refinement is used the prediction of the residual data from the corresponding FGS base layer can be modified or completely switched off. The used prediction signal could be obtained by weighting and filtering of the corresponding base layer residual signal.

Furthermore, in a preferred embodiment for implementing Fig. 1, the transform block size (4x4 or 8x8 transform) can be selected for each FGS layer macroblock independent of the used transform size of the corresponding base layer macroblock.

The new adaptive motion information refinement has little influence on the complexity at the decoder side, since for motion-compensated prediction only different - motion information is used; the motion-compensated prediction only has to be done for the highest FGS layer to be decoded. That is, when an FGS layer is present and is desired to be used, no motion-compensated prediction is required for the corresponding base layer macroblock.

For sake of completeness, Fig. 11 shows a part of the steps performed at the decoder side for decoding the scaleable bit- stream 126 generated by the encoder 100 of Fig. 1. In particular, the portion of the steps shown in Fig. 11 concentrates on the decoding of the base-layer or refinement- layer bit-stream 120 or 122 which are present in the scaleable bit-stream 126 in a multiplexed form. The process begins with a base-layer extraction in step 500, for example, by means of parsing datastream 126. The result of this step is the base-layer motion information m 0 and the corresponding residual information residual 0 (see p. Fig. 8 and Fig. 9) . At step 502, it is determined as to whether a further refinement is desired. For example, it is determined as to whether a terminal of interest supports a further refined representation of the video signal. If no, the base-layer bit-stream 120 extracted in step 500 is directly decoded in step 504. However, in case, further refinement is desired and present, the process of Fig. 11 proceeds to step 506, where the current motion information, i.e. the base-layer motion information, is refined or replaced by the motion information contained in the next or first refinement layer bit-stream 122. "Refining" means that the motion information refinement information contained in the refinement layer bit-stream is not self-contained, but has to be combined with the current motion information m 0 in order to obtain the new motion information πn. "Replacing" refers to the case where the motion information refinement information contained in the refinement layer bit-stream 122 is self-contained, i.e. the new motion information mi is derivable from this bit-stream without use of the current motion information mo. Thus, in this case, m 0 is disregarded.

Similarly, in step 508, the current residual information, i.e. the transformed coefficients representing the current residual (residualo in Fig. 8 and 9) is refined or replaced by use of the residual information refinement information contained in the refinement layer bit-stream 122. The

refining of current residual information is illustrated in Fig. 8, where residuali represents residual information refinement information contained in the refinement layer bit- stream used for refining the coarsely quantized current residual information residualo- The case of replacement of the current residual information is shown in Fig. 9. There, new residual information residuali is derived from bit-stream 122 only.

After having refined the replaced the motion and residual information at the decoder side, the process loops back to step 502 in order to determine if an even further refinement as desired or available in the bit-stream 126. If yes, step 506 and 508 are repeated for the next refinement layer, with a current motion information being mi and a current encoded residual information being residuali+ residualo in case of Fig. 8, and residuali in case of Fig. 9 (see above) .

If no further refinement is desired of possible, the process proceeds with step 504, where the current motion information and current residual information is decoded by combining same information, i.e. by combining the residual texture derived by decoding the current encoded residual information with a prediction picture as derived from the current motion information. The output of step 504 is the video input signal in the desired refinement or quality level.

A performance evaluation of the above presented FGS coding scene has been performed. The two diagrams in Fig. 5 and 6 show the rate distortion performance of the invented FGS coding scheme in comparison to the current SVC WD. Two different test sequences, Soccer and Crew, were used at CIF resolution with 30 Hz. It can clearly be seen that the coding efficiency at high bit-rates is significantly improved with the invented FGS coding scheme.

Thus, the above description reveals FGS coding scheme, in which it is possible to use a different method for generating the motion-compensated prediction signal in an FGS enhancement layer macroblock than in the co-located base layer macroblock. Additionally, in the FGS coding scheme described in which the different motion-compensated prediction may be is specified by a refinement of the motion vectors of the base layer macroblock. Further, in the FGS coding scheme described the different motion-compensated prediction may be specified by a refinement of the macroblock partitioning of the base layer macroblock or a completely different macroblock partitioning as well as corresponding reference indices and motion vectors. For transmission, the reference indices and/or motion vectors can be predicted by using the base layer information, or the different motion- compensated prediction may be specified by an additional motion hypothesis, which consists of a macroblock partitioning (which can be identical to the partitioning of the base layer macroblock) and corresponding reference indices and motion vectors. The motion-compensated prediction signal for the FGS enhancement layer may then be generated by a weighted average of the motion-compensated prediction signals that are obtained by using the motion information of the base layer motion hypothesis and the enhancement layer motion hypothesis. Additionally, in this FGS coding scheme a filtered and weighted (the weight can also be equal to zero) version of the base layer residual signal may be used as a prediction for the enhancement layer residual signal, and the used transform block size for coding of the enhancement layer residual signal may be chosen independently of the used transform block size in the base layer.

In even other words, the above embodiments describe a concept for fine-granular SNR scalable coding of video sequences with an adaptive refinement of motion/prediction information. Compared to the concept for fine-granular scalable coding that only refine the representation of transform

coefficients, the coding efficiency can be significantly improved, especially when the fine-granular scalability have to be supported for large bit-rate intervals. The modification of motion information is signalled on a macroblock basis in an fine-granular scalable layer. When the corresponding syntax element does not signal a modification of motion information, the transform coefficient levels for the corresponding macroblock are coded as in the FGS design of the scalable H.264/MPEG4-AVC extension as described in the above mentioned working drafts. However, when the syntax element indicates a modification of motion parameters, the motion parameters are additionally included into the progressive refinement slice syntax and the coding of transform coefficient levels is changed. The decision on whether a motion parameter refinement is coded and the corresponding motion parameters are determined by rate- distortion optimized mode decision and motion estimation at the encoder side. The decoder complexity is only slightly increased compared to the current FGS concept as specified in the above mentioned SVC Working Drafts.

Depending on an actual implementation, the inventive encoding scheme can be implemented in hardware or in software. Therefore, the present invention also relates to a computer program, which can be stored on a computer-readable medium such as a CD, a disk or any other data carrier. The present invention is, therefore, also a computer program having a program code which, when executed on a computer, performs the inventive method of encoding or binarizing or the inventive method of decoding or recovering described in connection with the above figures.

Furthermore, it is noted that all steps indicated in the flow diagrams could be implemented by respective means in the encoder and that the implementations may comprise subroutines running on a CPU, circuit parts of an ASIC or the like.