VIDEO ENCODING AND DECODING WITH REFERENCE FRAMES USING PREVIOUSLY ENCODED FRAME STATISTICS

Title:

VIDEO ENCODING AND DECODING WITH REFERENCE FRAMES USING PREVIOUSLY ENCODED FRAME STATISTICS

Document Type and Number:

WIPO Patent Application WO/2016/083828

Kind Code:

Abstract:

In video coding or decoding, multiple reference frames are permitted for a frame in one or more reference lists. The number of reference frames to evaluate for each list is selected by, in a first phase, collecting statistics over a number of initial frames on the usage of reference frames for each list and, in a second phase determining, based on the statistics, how many reference frames should be evaluated for each frame and for each reference list.

Inventors:

NACCARI MATTEO (GB)
GABRIELLINI ANDREA (GB)
MRAK MARTA (GB)

Application Number:

PCT/GB2015/053626

Publication Date:

June 02, 2016

Filing Date:

November 27, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

BRITISH BROADCASTING CORP (GB)
NACCARI MATTEO (GB)
GABRIELLINI ANDREA (GB)
MRAK MARTA (GB)

International Classes:

H04N19/105; H04N19/31; H04N19/109; H04N19/119; H04N19/136; H04N19/137; H04N19/14; H04N19/176; H04N19/179

Domestic Patent References:

WO2014168561A1

2014-10-16

Foreign References:

US20070092147A1	2007-04-26
US8073048B2	2011-12-06
US20140092991A1	2014-04-03
US20140072031A1	2014-03-13

Other References:

MATTEO NACCARI ET AL: "HEVC coding optimisation for Ultra High Definition television services", 2015 PICTURE CODING SYMPOSIUM (PCS), 1 May 2015 (2015-05-01), pages 20 - 24, XP055245806, ISBN: 978-1-4799-7783-3, DOI: 10.1109/PCS.2015.7170039
NACCARI (BBC) M ET AL: "Coding with a unified reference picture list", 7. JCT-VC MEETING; 98. MPEG MEETING; 21-11-2011 - 30-11-2011; GENEVA; (JOINT COLLABORATIVE TEAM ON VIDEO CODING OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://WFTP3.ITU.INT/AV-ARCH/JCTVC-SITE/,, no. JCTVC-G635, 9 November 2011 (2011-11-09), XP030110619
R. SJOBERG ET AL: "Overview of HEVC high-level syntax and reference picture management", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1 January 2012 (2012-01-01), pages 1 - 1, XP055045360, ISSN: 1051-8215, DOI: 10.1109/TCSVT.2012.2223052

Attorney, Agent or Firm:

GARRATT, Peter et al. (The Shard32 London Bridge Street, London SE1 9SG, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

A method of video coding or decoding in which coding blocks in a current frame are predicted from previously encoded frames, where the selection of frames from which the prediction is formed is decided upon some preliminary statistics collected over a limited amount of previously encoded frames.

A method according to claim 1 , wherein reference frames are organised in reference lists with a number of entries greater or equal to one

A method according to claim 2, wherein only the first entry in each reference list is used for prediction in inter coding if it is determined from said preliminary statistics that this first entry is selected more often than a threshold.

A method according to claim 3, wherein said determination is made for each reference list.

A method according to claim 3 or claim 4, wherein said statistics are based on frames having the same temporal hierarchical level.

A method according to any one of the preceding claims, wherein the selection of whether to use the first entry of a given reference list without evaluation of any other entry in that reference list is based on the number of times the first entry is selected on a previously encoded frame.

A method according to claim 6, wherein the first entry in each reference list is selected for prediction of a coding block if this entry was used more than a predetermined percentage value of the total number of like coding blocks in said previously encoded frames.

A method according to claim 7, where recognition of a like coding block is based on one or more parameters selected from the group consisting of: temporal hierarchical level; block size; number of skip blocks; representative pixel value; average value of luminance intensity; and position in the SOP.

A method according to any one of the preceding claims, wherein said statistics are collected over a fixed number of previously encoded frames and said collection is be repeated to refresh the statistics.

10. A method according to claim 9, wherein the collection of statistics is repeated at whichever is the earlier of a defined number of frames or a scene change.

1 1. A method of video coding or decoding in which multiple reference frames are permitted for a frame or part thereof, in one or more reference lists, wherein the number of reference frames to evaluate for each list is selected by, in a first phase, collecting statistics over a number of initial frames on the usage of reference frames for each list and, in a second phase determining, based on the statistics, how many reference frames should be evaluated for each frame or part thereof and for each reference list.

12. A method according to claim 1 1 , where said determination is based on the

number of frames or parts thereof using a reference frame compared to the total number of like frames or parts thereof.

13. A method according to claim 12, where recognition of a like frame or part thereof is based on one or more parameters selected from the group consisting of: frame coding type; block size; number of skip blocks; representative pixel value; average value of luminance intensity; and position in the SOP.

14. Video encoding apparatus configured to implement a method according to any one of the preceding claims.

15. Video decoding apparatus configured to implement a method according to any one of claims 1 to 13.

16. A computer program product containing instructions casing programmable

apparatus to implement a method according to any one of claims 1 to 13.

Description:

VIDEO ENCODING AND DECODING WITH REFERENCE FRAMES USING

PREVIOUSLY ENCODED FRAME STATISTICS

BACKGROUND OF THE INVENTION

This invention relates to video coding and decoding. In one aspect, the invention seeks to reduce the complexity associated with advanced coding techniques such as the High Efficiency Video Coding (HEVC) standard whilst retaining quality and efficiency.

HEVC is the new video compression standard jointly defined by ITU-R and ISO. It promises great improvements over its predecessor, AVC, especially for large format content, such as UHD. The greater compression efficiency comes with a much greater complexity. Practical implementations of HEVC must devise strategies to reduce the complexity without compromising the compression efficiency. In an important aspect, this invention addresses one particular area of the coder and decoder, namely inter prediction or inter-frame prediction.

SUMMARY OF THE INVENTION

There is provided a method of video coding or decoding in which coding blocks in a current frame are predicted from previously encoded frames, where the selection of frames from which the prediction is performed is decided upon some preliminary statistics collected over a limited amount of previously compressed encoded frames. The reference frames may be accessed in different ways. In one arrangement, reference lists with a number of entries greater or equal to one may be used.

Suitably, only the first entry in each reference list is used for prediction in inter coding if it is determined from said preliminary statistics that this first entry is selected more often than a threshold. That determination may be made for each reference list. The statistics may be based on frames having the same temporal hierarchical level.

The selection of whether to use the first entry of a given reference list without evaluation of any other entry in that reference list may be based on the number of times the first entry is selected on a previously encoded frame.

The first entry in each reference list may be selected for prediction of a coding block if this entry was used more than a pre-determined percentage value of the total number of like coding blocks in said previously encoded frames, where recognition of a like coding block may be based on one or more parameters selected from the group consisting of: temporal hierarchical level; block size; number of skip blocks;

representative pixel value; average value of luminance intensity; and position in the SOP.

Statistics may be collected over a fixed number of previously encoded frames and the collection repeated to refresh the statistics, preferably at whichever is the earlier of a defined number of frames or a scene change.

There is also provided a method of video coding or decoding in which multiple reference frames are permitted for a frame or part thereof, in one or more reference lists, wherein the number of reference frames to evaluate for each list is selected by, in a first phase, collecting statistics over a number of initial frames on the usage of reference frames for each list and, in a second phase determining, based on the statistics, how many reference frames should be evaluated for each frame or part thereof and for each reference list.

The determination may be based on the number of frames or parts thereof using a reference frame compared to the total number of like frames or parts thereof, where recognition of a like frame or part thereof is based for example on one or more parameters selected from the group consisting of: frame coding type; block size; number of skip blocks; representative pixel value; average value of luminance intensity; and position in the SOP.

There is also provided video encoding and decoding apparatus and configured to implement such a method and a computer program product containing instructions casing programmable apparatus to implement such a method.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a typical Structure of Pictures (SOP)

Figures 2 Illustrates the organisation of references in lists (L0 and L1 ) DESCRIPTION OF PREFERRED EMBODIMENTS

In HEVC, distinct reference frames are organised in two lists (L0 and L1 ) where L0 denotes the reference frames displayed before the current frame while L1 denotes the reference frames displayed after the current frame. Each reference list can contain one or more reference frames and a coding block in the current frame can be predicted from up to two reference frames coming from L0, L1 or both. Other codec may not rely on two reference lists L0 and L1. However, the inter prediction relies on previously coded frames referenced in different ways. Given the different possibilities to select a reference frame for prediction in inter coding, the encoder evaluates all possible combinations and selects the one which optimises a given performance metric (hereafter denoted as coding cost).

Video frames are encoded according to a specific order which is associated with the Structure Of Picture (SOP). In broadcasting applications, a typical SOP provides a hierarchy between coded frames where the prediction for frames in the highest levels of the hierarchy depends on a smaller number of previously coded frames, and this dependency on reference frames is not as strong as the dependency on lower levels of hierarchy. Frames belonging to lower levels of the hierarchy depend on larger number of previously coded frames, i.e. their prediction can select from more alternative references coming from L0, L1 or both. The reference implementation of the codec standardised by HEVC performs an exhaustive search over all entries in L0 and L1. However, reference frames in L0 and L1 are generally arranged to have the first entry in the list which is the one temporally close to the current frame. The same arrangement may be also followed by codecs which do make use of lists of reference frames. In one example, a codec can store the reference frame in a decoded picture buffer whereby reference frames are sorted with respect to their temporal closeness to a given picture. Such an arrangement, both for lists or picture buffer, increases the likelihood of finding the best predictor for each current block so that the available computational effort can be spent more on the first reference in each list and less in the remaining entries. A practical encoder can take advantage from this arrangement and reduce the complexity associated with the prediction in inter coding. As a general observation, video containing large image areas with high amount of motion is generally best predicted by references which are temporally close to the current frame. Conversely, video characterised by low motion can use prediction coming from reference which are not temporally close to the current frame. It is therefore understood that the selection of which reference frame to use for prediction in inter coding is content dependent. Also the selected structure of picture influences the selection of the reference frames. In the typical SOP selected for compression in broadcasting applications, frames in high levels of the hierarchy will predict from the references which are not temporally close to the current frame. For this reason, only the first entry in each reference list may be considered. Conversely, for frames located in the lowest levels of the hierarchy, the available reference frames are temporally closer and therefore it may be expected that other entries in the reference list and different from the first one can provide a good prediction for inter coding.

Figure 1 shows a possible arrangement for the pictures to be encoded; such an arrangement is referred to as a structure of pictures. There is also shown the coding order of the frames, which is of course different from the display order. The arrows in the figure show the direction of prediction, the tip of each arrow identifying the frame used as a reference for prediction. It is worth noting the temporal distance of the reference frames for each of the frames in a SOP; it is appropriate to group frames based on the temporal distance of the reference frames since the effectiveness of the prediction can vary greatly based on that distance. In our example we can identify 4 groups: POC8, POC4, POC2+POC6, POC1+POC3+POC5+POC7. A SOP defines a hierarchy with temporal levels among different frame groups. A temporal level in a hierarchy is defined by the amount of reference frames in that SOP needed to decode a frame in that level. Given the example in Figure 1 , frames belonging to group POC8 are in the highest temporal level of the hierarchy since only POC0 frames are needed for decoding. Frames belonging to POC1 +POC3+POC5+POC7 are in the lowest temporal level of the hierarchy since their decoding depends on the availability of POC0, POC8, POC4 and POC2+POC6 frames.

Figure 1 only shows the first reference frame for each frame in the SOP; HEVC allows for multiple reference frames to be used, thus e.g. POC2 could use both POC4 and POC8 as a reference, choosing the most effective one on a block basis. Figure 2 shows how the reference frames available for a frame of group POC4 are organised in reference lists L0 and L1. References in each list are organised so that the first entries are temporally closer to the current frame.

In order to decide which reference frame to use for prediction in each block, an encoder should evaluate all possible alternatives in the reference lists, including their combinations obtained as bilinear interpolation of the pixel values coming from the two different references. This evaluation can involve a large amount of computational complexity. Given the temporal consistency associated with camera captured video content, it is generally expected that frames temporally closer to the current frame will more likely be selected for prediction rather than frames located far away from the current frame. It is also generally expected that video content such as sport content where large image areas have a high amount of motion, will have most of the frames predicted from the first entries in the reference list while sequences such as drama content will have predictions selected from different entries in the reference list. Based on these observations, the selection of which references to consider so that the encoder complexity is reduced is content dependent.

Besides the content dependency described, the selection of the reference also depends on the adopted SOP. For the SOP depicted in Figure 1 , the frames belonging to the highest levels of the hierarchy (for example POC8 frames) will require to evaluate less entries in the reference list than the frames belonging to the lowest level of the hierarchy.

The present invention defines a method to select the number of reference frames to evaluate for each inter-coded frame and for each reference list. The method defines two phases. During the first phase the method collects statistics on the usage of reference frames for a few initial frames. For each coding block the method records which reference frame, i.e. which entry in the reference list, was used for each available list. These statistics are grouped separately with respect to the temporal distance each frame has with the frame with POCO depicted in Figure 1. This grouping is needed given the different statistics associated with different frame types. The statistics are also differentiated with respect to the reference list. As stated in the summary of the invention, video coding standards such as the H.264/AVC and H.265/HEVC assumed L0 for reference frames which are preferably temporally located before the current frame and L1 for references preferably temporally located after the current frame. Given the different characteristics for different type of content, also the statistics on the usage of the entries for L0 and L1 are different. During the first phase, the encoder checks all possible reference frames

The number of frames used to collect the statistics during the first phase is a compromise between encoder complexity and accuracy of the collected statistics. Using a high number of reference frames would leave to more accurate data but the encoder complexity is not decreased. Conversely, using a low number of references would reduce the encoder complexity but the collected statistics would not be accurate resulting in an overall loss of compression efficiency. Using the first two SOPs can constitute a good compromise for the aforementioned trade-off.

Additionally, the collected statistics need to be refreshed either because the compressed content changed or because the compression performance penalty may become too large. To refresh the statistics a refreshed period can be set based upon some preliminarily observations on the application scenario. A change in the content can be detected with suitable methods for shot change detection.

During the second phase the method determines how many reference frames should be evaluated for each frame and for each reference list. The decision is based on the number of coding blocks using the first available reference frame compared to the total number of coding blocks. When the first reference frame is chosen more than a pre-determined percentage value of the total number of coding blocks then only the first reference frame is used in subsequent inter frames of the temporal distance and for the same list. An example of how the select usage of one reference frame per list is determined is given here. Let Ri[L0, d] be the number of times the first reference frame in L0 is selected for frame with temporal distance d. In the same way , d] can be defined. Let also T[d] be the total number of coding blocks in frame with temporal distance d and p the pre-determined percentage value. Considering L0 and a frame with the temporal distance d in the adopted SOP, only the first reference frame in the subsequent frames will be used if:

Ri[L0, d]≥p * T[d] In the same way the condition for L1 can be defined.

Quantity T[d] can be also refine to be not the whole total but the total number of blocks which share similar features such as block size, number of skip blocks, same average value of luminance intensity and same position in the SOP. These features can also depend on the position of the current frame in the SOP. In fact different levels of the hierarchy may benefit by using different statistics.

It should be understood that the present invention has been described by way of example and that a wide variety of modifications are possible without departing from the scope of claim. Thus it will be possible to decide the selection of frames from which a prediction is formed based upon a variety of statistics collected over a limited amount of previously encoded frames, in ways other than those specifically mentioned. Whilst the example of HEVC has been taken, the invention may have application to other coding arrangements where multiple reference frames are permitted.

Previous Patent: PHARMACEUTICAL COMPOSITION COMPRISING AN ARTEMISININ DERIVATIVE FOR NASAL OR PULMONARY DELIVERY

Next Patent: ROCKET MOTOR INTEGRATION