Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LEARNED ADAPTIVE MOTION ESTIMATION FOR NEURAL VIDEO CODING
Document Type and Number:
WIPO Patent Application WO/2022/269441
Kind Code:
A1
Abstract:
Various embodiments provide a method, an apparatus, and a computer program product. An example method includes: receiving a bitstream; adapting a motion estimation operation, an input or output of the motion estimation operation, based on one or more aspects or values of the one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decoding an encoded output of the motion estimation operation or an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

Inventors:
ZOU NANNAN (FI)
CRICRÌ FRANCESCO (FI)
ZHANG HONGLEI (FI)
REZAZADEGAN TAVAKOLI HAMED (FI)
Application Number:
PCT/IB2022/055670
Publication Date:
December 29, 2022
Filing Date:
June 17, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/105; H04N19/172; H04N19/537
Other References:
LU GUO ET AL: "An End-to-End Learning Framework for Video Compression", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE COMPUTER SOCIETY, USA, vol. 43, no. 10, 20 April 2020 (2020-04-20), pages 3292 - 3308, XP011875084, ISSN: 0162-8828, [retrieved on 20210901], DOI: 10.1109/TPAMI.2020.2988453
ZHANG HAOXIAN ET AL: "Multi-Frame Pyramid Refinement Network for Video Frame Interpolation", IEEE ACCESS, vol. 7, 11 September 2019 (2019-09-11), pages 130610 - 130621, XP011747003, DOI: 10.1109/ACCESS.2019.2940510
XI ZHANG ET AL: "Layered Optical Flow Estimation Using a Deep Neural Network with a Soft Mask", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 9 May 2018 (2018-05-09), XP080876506
HU ZHIHAO ET AL: "Improving Deep Video Compression by Resolution-Adaptive Flow Coding", 13 September 2020, ARXIV.ORG, PAGE(S) 193 - 209, XP047569497
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. An apparatus comprising: at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive an input data stream, the input data stream comprising a target frame and one or more reference frames; adapt a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encode at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

2. The apparatus of claim 1, wherein the input data stream is video.

3. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: estimate the motion between the one or more reference frames and the target frame to generate a predicted frame, using the adapted motion estimation operation.

4. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on resolution information of the input data stream. 5. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from resolution information of the input data stream; and adapt the motion estimation operation or an input of the motion estimation operation using the features determined from the resolution information.

6. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on resolution information of the input data stream.

7. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from resolution information of the input data stream; and adapt the output of the motion estimation operation using the features determined from the resolution information.

8. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapt the motion estimation operation or an input of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

9. The apparatus of claim 8, wherein the analysis operation is object detection.

10. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapt the output of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

11. The apparatus of claim 10, wherein the analysis operation is object detection. 12. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on temporal distance information of the input data stream.

13. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from temporal distance information of the input data stream; and adapt the motion estimation operation or an input of the motion estimation operation using the features determined from the temporal distance information.

14. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on temporal distance information of the input data stream.

15. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from temporal distance information of the input data stream; and adapt the output of the motion estimation operation using the features determined from the temporal distance information.

16. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from information of the input data stream using a neural network; wherein the information is at least one of: resolution information of the input data stream, information derived from an output of an analysis operation performed on the input data stream, or temporal distance information of the input data stream; and adapt the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation using the neural network determined features.

17. The apparatus of claim 1, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: input the input data stream, and an original size of a frame of the input data stream into a learnable resizer; output from the learnable resizer a resized frame of the input data stream; and adapt the motion estimation operation using the resized frame of the input data stream.

18. An apparatus comprising: at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decode the encoded output of the motion estimation operation, or decode the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; perform a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generate a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

19. The apparatus of claim 18, wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation has been adapted based on at least one of: resolution information of an input data stream; features determined from the resolution information; an output of an analysis operation performed on the input data stream; information derived from the output of the analysis operation performed on the input data stream; temporal distance information of the input data stream; or features determined from the temporal distance information.

20. An apparatus comprising: at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive a bitstream; adapt a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decode an encoded output of the motion estimation operation, or decode an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; perform a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generate a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

21. The apparatus of claim 20, wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation is adapted based on at least one of: resolution information of the bitstream; features determined from the resolution information; an output of an analysis operation performed on the bitstream; information derived from the output of the analysis operation performed on the bitstream; temporal distance information of the bitstream; or features determined from the temporal distance information. 22. The apparatus of claim 20, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: perform the motion estimation operation to output a motion field or optical flow.

23. An apparatus comprising: means for receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; means for adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and means for encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

24. An apparatus comprising: means for receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; means for decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; means for performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and means for generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

25. An apparatus comprising: means for receiving a bitstream; means for adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; means for decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; means for performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and means for generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

26. A method comprising: receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

27. The method of claim 26, wherein the input data stream is video.

28. The method of claim 26 further comprising: estimating the motion between the one or more reference frames and the target frame to generate a predicted frame, using the adapted motion estimation operation.

29. The method of claim 26 further comprising: adapting the motion estimation operation or an input of the motion estimation operation based on resolution information of the input data stream.

30. The method of claim 26 further comprising: determining features from resolution information of the input data stream; and adapting the motion estimation operation or an input of the motion estimation operation using the features determined from the resolution information.

31. The method of claim 26 further comprising: adapting the output of the motion estimation operation based on resolution information of the input data stream.

32. The method of claim 26 further comprising: determining features from resolution information of the input data stream; and adapting the output of the motion estimation operation using the features determined from the resolution information.

33. The method of claim 26 further comprising: adapting the motion estimation operation or an input of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapting the motion estimation operation or an input of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

34. The method of claim 33, wherein the analysis operation is object detection.

35. The method of claim 26 further comprising: adapting the output of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapting the output of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

36. The method of claim 35, wherein the analysis operation is object detection.

37. The method of claim 26 further comprising: adapting the motion estimation operation or an input of the motion estimation operation based on temporal distance information of the input data stream.

38. The method of claim 26 further comprising: determining features from temporal distance information of the input data stream; and adapting the motion estimation operation or an input of the motion estimation operation using the features determined from the temporal distance information.

39. The method of claim 26 further comprising: adapting the output of the motion estimation operation based on temporal distance information of the input data stream.

40. The method of claim 26 further comprising: determining features from temporal distance information of the input data stream; and adapting the output of the motion estimation operation using the features determined from the temporal distance information.

41. The method of claim 26 further comprising: determining features from information of the input data stream using a neural network; wherein the information is at least one of: resolution information of the input data stream, information derived from an output of an analysis operation performed on the input data stream, or temporal distance information of the input data stream; and adapting the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation using the neural network determined features.

42. The method of claim 26 further comprising: inputting the input data stream, and an original size of a frame of the input data stream into a learnable resizer; outputting from the learnable resizer a resized frame of the input data stream; and adapting the motion estimation operation using the resized frame of the input data stream.

43. A method comprising: receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation. 44. The apparatus of claim 43, wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation has been adapted based on at least one of: resolution information of an input data stream; features determined from the resolution information; an output of an analysis operation performed on the input data stream; information derived from the output of the analysis operation performed on the input data stream; temporal distance information of the input data stream; or features determined from the temporal distance information.

45. A method comprising: receiving a bitstream; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

46. The apparatus of claim 45, wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation is adapted based on at least one of: resolution information of the bitstream; features determined from the resolution information; an output of an analysis operation performed on the bitstream; information derived from the output of the analysis operation performed on the bitstream; temporal distance information of the bitstream; or features determined from the temporal distance information.

47. The apparatus of claim 46, wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: perform the motion estimation operation to output a motion field or optical flow.

48. A non-transitory program storage device readable by an apparatus, tangibly embodying a program of instructions executable by the apparatus, wherein the program of instruction cause the apparatus to perform: receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

49. A non-transitory program storage device readable by an apparatus, tangibly embodying a program of instructions executable by the apparatus, wherein the program of instructions cause the apparatus to perform: receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

50. A non-transitory program storage device readable by an apparatus, tangibly embodying a program of instructions executable by the apparatus, wherein the program of instructions cause the apparatus to performing: receiving a bitstream; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

51. The non-transitory program storage device of claims 58 to 50, wherein the program of instructions further cause the apparatus to perform the methods as claimed in any of the claims 27 to 42, or 44, or 46, or 47.

Description:
LEARNED ADAPTIVE MOTION ESTIMATION FOR NEURAL VIDEO CODING

STATEMENT OF GOVERNMENT SUPPORT

[0001] The project leading to this application has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 876019. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Germany, Netherlands, Austria, Romania, France, Sweden, Cyprus, Greece, Lithuania, Portugal, Italy, Finland, Turkey.

TECHNICAL FIELD

[0002] The examples and non-limiting embodiments relate generally to multimedia transport and machine learning and, more particularly, to learned adaptive motion estimation for neural video coding.

BACKGROUND

[0003] It is known to perform data compression and decoding in a multimedia system.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The foregoing aspects and other features are explained in the following description, taken in connection with the accompanying drawings, wherein:

[0005] FIG. 1 shows schematically an electronic device employing embodiments of the examples described herein.

[0006] FIG. 2 shows schematically a user equipment suitable for employing embodiments of the examples described herein.

[0007] FIG. 3 further shows schematically electronic devices employing embodiments of the examples described herein connected using wireless and wired network connections.

[0008] FIG. 4 shows schematically a block diagram of an encoder used for data compression on a general level.

[0009] FIG. 5 illustrates examples of how neural networks can function as components of a traditional codec’s pipeline. [0010] FIG. 6 illustrates re-use of a traditional video coding pipeline, with most or all components replaced with neural networks.

[0011] FIG. 7 shows a neural auto-encoder architecture, where the Analysis Network is the Encoder NN, the Synthesis Network is the Decoder NN.

[0012] FIG. 8 shows a neural network-based end-to-end learned video coding system.

[0013] FIG. 9 is a general illustration of the pipeline of Video Coding for Machines.

[0014] FIG. 10 illustrates an example of a pipeline for an end-to-end learned approach for video coding for machines.

[0015] FIG. 11 illustrates an example of how the end-to-end learned system for video coding for machines may be trained.

[0016] FIG. 12 illustrates an example of how motion estimation and motion compensation may be used.

[0017] FIG. 13 shows an example where the motion estimation is performed at the decoder- side.

[0018] FIG. 14 illustrates the Example implementation 1 of Embodiment 1 related to using resolution information as auxiliary input to motion estimation.

[0019] FIG. 15 illustrates the Example implementation 2 of Embodiment 1 related to using resolution information as auxiliary input to motion estimation.

[0020] FIG. 16 illustrates the Example implementation 1 of Embodiment 2 related to resolution-based scaling.

[0021] FIG. 17 illustrates the Example implementation 2 of Embodiment 2 related to resolution-based scaling.

[0022] FIG. 18 are plots showing the validation loss (smaller is better) for the baseline, Embodiment 1, and Embodiment 2. [0023] FIG. 19 shows an example implementation where features extracted from the analysis results are stacked to one or more inputs of the motion estimation operation.

[0024] FIG. 20 shows an example implementation where analysis results or features extracted from the analysis results may be used to modify the output of the motion estimation operation.

[0025] FIG. 21 illustrates the Example implementation 1 of Embodiment 5 related to using temporal distance information as auxiliary input to motion estimation.

[0026] FIG. 22 illustrates the Example implementation 2 of Embodiment 5 related to using temporal distance information as auxiliary input to motion estimation.

[0027] FIG. 23 illustrates the Example implementation 1 of Embodiment 6 related to temporal distance-based scaling.

[0028] FIG. 24 illustrates the Example implementation 2 of Embodiment 6 related to temporal distance-based scaling.

[0029] FIG. 25 shows an architecture for learning the proper image size in conjunction with the original size and resized size.

[0030] FIG. 26 is an example apparatus configured to implement learned adaptive motion estimation for neural video coding, based on the examples described herein.

[0031] FIG. 27 is an example method to implement learned adaptive motion estimation for neural video coding, based on the examples described herein.

[0032] FIG. 28 is another example method to implement learned adaptive motion estimation for neural video coding, based on the examples described herein.

[0033] FIG. 29 is another example method to implement learned adaptive motion estimation for neural video coding, based on the examples described herein.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS [0034] Described herein is a set of methods for improving the performance of motion estimation with respect to a metric related to the task for which motion estimation is used. The proposed set of methods comprise adapting the motion estimation operation or its input or its output according to one or more aspects derived from the input video or according to the values of one or more aspects derived from an input data stream. The examples described herein consider the use case of using motion estimation within a video codec. Other use cases may be considered. For this specific use case, the metric which is improved by the proposed method is rate-distortion performance.

[0035] The following describes in detail a suitable apparatus and possible mechanisms for a video/image encoding process according to embodiments. In this regard reference is first made to FIG. 1 and FIG. 2, where FIG. 1 shows an example block diagram of an apparatus 50. The apparatus may be an Internet of Things (IoT) apparatus configured to perform various functions, such as, for example, gathering information by one or more sensors, receiving or transmitting information, analyzing information gathered or received by the apparatus, or the like. The apparatus may comprise a video coding system, which may incorporate a codec. FIG. 2 shows a layout of an apparatus according to an example embodiment. The elements of FIG. 1 and FIG. 2 are explained next.

[0036] The apparatus 50 may, for example, be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or other lower power device. However, it would be appreciated that embodiments of the examples described herein may be implemented within any electronic device or apparatus which may process data by neural networks.

[0037] The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display, a light emitting diode display, an organic light emitting diode display, and the like. In other embodiments of the examples described herein the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the examples described herein any suitable data or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.

[0038] The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analog signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analog audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera capable of recording or capturing images and/or video. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as, for example, a Bluetooth® wireless connection or a USB/firewire wired connection.

[0039] The apparatus 50 may comprise a controller 56, a processor or a processor circuitry for controlling the apparatus 50. The controller 56 may be connected to a memory 58 which in embodiments of the examples described herein may store both data in the form of an image, video data, and audio data and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and or decoding of audio, an image and/or video data or assisting in coding and/or decoding carried out by the controller.

[0040] The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example, a UICC and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.

[0041] The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals, for example, for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and/or for receiving radio frequency signals from other apparatus(es).

[0042] The apparatus 50 may comprise a camera capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video image data for processing from another device prior to transmission and/or storage. The apparatus 50 may also receive either wirelessly or by a wired connection the image for coding/decoding. The structural elements of apparatus 50 described above represent examples of means for performing a corresponding function. [0043] With respect to FIG. 3, an example of a system within which embodiments of the examples described herein can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM, UMTS, CDMA, LTE, 4G, 5G network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth® personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.

[0044] The system 10 may include both wired and wireless communication devices and/or apparatus 50 suitable for implementing embodiments of the examples described herein.

[0045] For example, the system shown in FIG. 3 shows a mobile telephone network 11 and a representation of the Internet 28. Connectivity to the Internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.

[0046] The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.

[0047] The embodiments may also be implemented in a set-top box; e.g., a digital TV receiver, which may/may not have a display or wireless capabilities, in tablets or (laptop) personal computers (PC), which have hardware and/or software to process neural network data, in various operating systems, and in chipsets, processors, DSPs and/or embedded systems offering hardware/software based coding.

[0048] Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the Internet 28. The system may include additional communication devices and communication devices of various types.

[0049] The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth®, IEEE 802.11, 3GPP Narrowband IoT and any similar wireless communication technology. A communications device involved in implementing various embodiments of the examples described herein may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection.

[0050] In telecommunications and data networks, a channel may refer either to a physical channel or to a logical channel. A physical channel may refer to a physical transmission medium such as a wire, whereas a logical channel may refer to a logical connection over a multiplexed medium, capable of conveying several logical channels. A channel may be used for conveying an information signal, for example, a bitstream, from one or several senders (or transmitters) to one or several receivers.

[0051] The embodiments may also be implemented in IoT devices. The IoT may be defined, for example, as an interconnection of uniquely identifiable embedded computing devices within the existing Internet infrastructure. The convergence of various technologies has and may enable many fields of embedded systems, such as wireless sensor networks, control systems, home/building automation, and the like, to be included in the IoT. In order to utilize the Internet, IoT devices are provided with an IP address as a unique identifier. IoT devices may be provided with a radio transmitter, such as a WLAN or Bluetooth® transmitter or a RFID tag. Alternatively, IoT devices may have access to an IP-based network via a wired network, such as an Ethernet-based network or a power-line connection (PLC).

[0052] An MPEG-2 transport stream (TS), specified in ISO/IEC 13818-1 or equivalently in ITU-T Recommendation H.222.0, is a format for carrying audio, video, and other media as well as program metadata or other metadata, in a multiplexed stream. A packet identifier (PID) is used to identify an elementary stream (a.k.a. packetized elementary stream) within the TS. Hence, a logical channel within an MPEG-2 TS may be considered to correspond to a specific PID value. [0053] Available media file format standards include ISO base media file format (ISO/IEC 14496-12, which may be abbreviated ISOBMFF) and file format for NAF unit structured video (ISO/IEC 14496-15), which derives from the ISOBMFF.

[0054] A video codec includes of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. A video encoder and/or a video decoder may also be separate from each other, e.g., need not form a codec. Typically the encoder discards some information in the original video sequence in order to represent the video in a more compact form (e.g., at lower bitrate).

[0055] Typical hybrid video encoders, for example, many encoder implementations of ITU- T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or “block”) are predicted, for example, by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, e.g., the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g., Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).

[0056] In temporal prediction, the sources of prediction are previously decoded pictures (a.k.a. reference pictures, or reference frames). In intra block copy (IBC; a.k.a. intra-block-copy prediction and current picture referencing), prediction is applied similarly to temporal prediction but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter-layer or inter-view prediction may be applied similarly to temporal prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal prediction only, while in other cases inter prediction may refer collectively to temporal prediction and any of intra block copy, inter-layer prediction, and inter-view prediction provided that they are performed with the same or similar process as temporal prediction. Inter prediction or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction. [0057] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, reduces temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures. Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in the spatial or transform domain, e.g., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.

[0058] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.

[0059] FIG. 4 shows a block diagram of a general structure of a video encoder. FIG. 4 presents an encoder for two layers, but it would be appreciated that presented encoder could be similarly extended to encode more than two layers. FIG. 4 illustrates a video encoder comprising a first encoder section 500 for a base layer and a second encoder section 502 for an enhancement layer. Each of the first encoder section 500 and the second encoder section 502 may comprise similar elements for encoding incoming pictures. The encoder sections 500, 502 may comprise a pixel predictor 302, 402, prediction error encoder 303, 403 and prediction error decoder 304, 404. FIG. 4 also shows an embodiment of the pixel predictor 302, 402 as comprising an inter-predictor 306, 406 (Pi er ), an intra-predictor 308, 408 (Pi nna ), a mode selector 310, 410, a filter 316, 416 (F), and a reference frame memory 318, 418 (RFM). The pixel predictor 302 of the first encoder section 500 receives base layer image(s) 300 (Io ,n ) of a video stream to be encoded at both the inter-predictor 306 (which determines the difference between the image and a motion compensated reference frame) and the intra-predictor 308 (which determines a prediction for an image block based only on the already processed parts of the current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 310. The intra-predictor 308 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 310. The mode selector 310 also receives a copy of the base layer image 300. Correspondingly, the pixel predictor 402 of the second encoder section 502 receives 400 enhancement layer images (Ii ,n ) of a video stream to be encoded at both the inter-predictor 406 (which determines the difference between the image and a motion compensated reference frame ) and the intra-predictor 408 (which determines a prediction for an image block based only on the already processed parts of the current frame or picture). The output of both the inter-predictor and the intra-predictor are passed to the mode selector 410. The intra-predictor 408 may have more than one intra-prediction modes. Hence, each mode may perform the intra-prediction and provide the predicted signal to the mode selector 410. The mode selector 410 also receives a copy of the enhancement layer image 400.

[0060] Depending on which encoding mode is selected to encode the current block, the output of the inter-predictor 306, 406 or the output of one of the optional intra-predictor modes or the output of a surface encoder within the mode selector is passed to the output of the mode selector 310, 410. The output of the mode selector is passed to a first summing device 321, 421. The first summing device may subtract the output of the pixel predictor 302, 402 from the base layer image 300/enhancement layer image 400 to produce a first prediction error signal 320, 420 (D n ) which is input to the prediction error encoder 303, 403.

[0061] The pixel predictor 302, 402 further receives from a preliminary reconstructor 339, 439 the combination of the prediction representation of the image block 312, 412 (P’ n ) and the output 338, 438 (D’ n ) of the prediction error decoder 304, 404. The preliminary reconstructed image 314, 414 (I’ n ) may be passed to the intra-predictor 308, 408 and to the filter 316, 416. The filter 316, 416 receiving the preliminary representation may filter the preliminary representation and output a final reconstructed image 340, 440 (R’ n ) which may be saved in a reference frame memory 318, 418. The reference frame memory 318 may be connected to the inter-predictor 306 to be used as the reference image against which a future base layer image 300 is compared in inter-prediction operations. Subject to the base layer being selected and indicated to be the source for inter-layer sample prediction and/or inter-layer motion information prediction of the enhancement layer according to some embodiments, the reference frame memory 318 may also be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image 400 is compared in inter -prediction operations. Moreover, the reference frame memory 418 may be connected to the inter-predictor 406 to be used as the reference image against which a future enhancement layer image 400 is compared in inter-prediction operations.

[0062] Filtering parameters from the filter 316 of the first encoder section 500 may be provided to the second encoder section 502 subject to the base layer being selected and indicated to be the source for predicting the filtering parameters of the enhancement layer according to some embodiments. [0063] The prediction error encoder 303, 403 comprises a transform unit 342, 442 (T) and a quantizer 344, 444 (Q). The transform unit 342, 442 transforms the first prediction error signal 320, 420 to a transform domain. The transform is, for example, the DCT transform. The quantizer 344, 444 quantizes the transform domain signal, e.g., the DCT coefficients, to form quantized coefficients.

[0064] The prediction error decoder 304, 404 receives the output from the prediction error encoder 303, 403 and performs the opposite processes of the prediction error encoder 303, 403 to produce a decoded prediction error signal 338, 438 which, when combined with the prediction representation of the image block 312, 412 at the second summing device 339, 439, produces the preliminary reconstructed image 314, 414. The prediction error decoder 304, 404 may be considered to comprise a dequantizer 346, 446 (Q '), which dequantizes the quantized coefficient values, e.g., DCT coefficients, to reconstruct the transform signal and an inverse transformation unit 348, 448 (T '), which performs the inverse transformation to the reconstructed transform signal wherein the output of the inverse transformation unit 348, 448 contains reconstructed block(s). The prediction error decoder may also comprise a block filter which may filter the reconstructed block(s) according to further decoded information and filter parameters.

[0065] The entropy encoder 330, 430 (E) receives the output of the prediction error encoder 303, 403 and may perform a suitable entropy encoding/variable length encoding on the signal to provide error detection and correction capability. The outputs of the entropy encoders 330, 430 may be inserted into a bitstream, e.g., by a multiplexer 508 (M).

[0066] Fundamentals of neural networks

[0067] A neural network (NN) is a computation graph including several layers of computation. Each layer includes of one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may have associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, e.g., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.

[0068] Two of the most widely used architectures for neural networks are feed-forward and recurrent architectures. Feed-forward neural networks are such that there is no feedback loop: each layer takes input from one or more of the layers before and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers, and provide output to one or more of following layers.

[0069] Initial layers (those close to the input data) extract semantically low-level features such as edges and textures in images, and intermediate and final layers extract more high-level features. After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc. In recurrent neural nets, there is a feedback loop, so that the network becomes stateful, e.g., it is able to memorize information or a state.

[0070] Neural networks are being utilized in an ever-increasing number of applications for many different types of device, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.

[0071] One of the properties of neural nets (and other machine learning tools) is that they are able to learn properties from input data, either in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.

[0072] In general, the training algorithm includes of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Training usually happens by minimizing or decreasing the output's error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, etc. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network's output, e.g., to gradually decrease the loss.

[0073] The examples described herein use the terms "model", "neural network", "neural net" and "network" interchangeably, and also the weights of neural networks are sometimes referred to as leamable parameters or simply as parameters.

[0074] Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, e.g., data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, e.g., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following things (1-2): when the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting when the network is learning to generalize - in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. When the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized properties of the training set and performs well only on that set, but performs poorly on a set not used for training or tuning its parameters.

[0075] Lately, neural networks have been used for compressing and de-compressing data such as images, e.g., in an image codec. The most widely used architecture for realizing one component of an image codec is the auto-encoder, which is a neural network including of two parts: a neural encoder and a neural decoder (in this description these are referred to simply as encoder and decoder, even though the description herein may refer to algorithms which are learned from data instead of being tuned by hand). The encoder takes as input an image and produces a code which requires less bits than the input image. This code may be obtained by applying a binarization or quantization process to the output of the encoder. The decoder takes in this code and reconstructs the image which was input to the encoder.

[0076] Such encoder and decoder are usually trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), or similar. These distortion metrics are meant to be correlated to the human visual perception quality, so that minimizing or maximizing one or more of these distortion metrics results into improving the visual quality of the decoded image as perceived by humans. [0077] Fundamentals of video/image coding

[0078] Video codec includes of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).

[0079] Typical hybrid video codecs, for example, ITU-T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or "block") are predicted, for example, by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, e.g., the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g., Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate).

[0080] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures.

[0081] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, e.g., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.

[0082] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently when they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction. [0083] The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.

[0084] In typical video codecs the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently those are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs the predicted motion vectors are created in a predefined way, for example, calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture. Moreover, typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.

[0085] In typical video codecs the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding. [0086] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g., the desired macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor l to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:

C = D + 1R

[0087] where C is the Lagrangian cost to be minimized, D is the image distortion (e.g., Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

[0088] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications include both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike. An SEI NAL unit includes one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may include the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.

[0089] A design principle has been followed for SEI message specifications: the SEI messages are generally not extended in future amendments or versions of the standard.

[0090] Filters in video codecs [0091] Conventional image and video codecs use a set of filters to enhance the visual quality of the predicted visual content and can be applied either in-loop or out-of-loop, or both. In the case of in-loop filters, the filter applied on one block in the currently-encoded frame affects the encoding of another block in the same frame and/or in another frame which is predicted from the current frame. An in-loop filter can affect the bitrate and/or the visual quality. In fact, an enhanced block causes a smaller residual (difference between original block and predicted-and-filtered block), thus requiring less bits to be encoded. An out-of-the loop filter is applied on a frame after it has been reconstructed, the filtered visual content won't be as a source for prediction, and thus it may only impact the visual quality of the frames that are output by the decoder.

[0092] Neural network based image/video coding

[0093] Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.

[0094] In one approach, NNs are used to replace or are used as an addition to one or more of the components of a traditional codec such as VVC/H.266. Here, ‘traditional’ means those codecs whose components and parameters are typically not learned from data. Examples of components that may be implemented as neural networks are:

- Additional in-loop filter, for example, by having the NN as an additional in-loop filter with respect to the traditional loop filters;

- Single in-loop filter, for example, by having the NN replacing all traditional in-loop filters;

- Intra-frame prediction;

- Inter-frame prediction;

- Transform and/or inverse transform;

- Probability model for the arithmetic codec; and the like.

[0095] FIG. 5 illustrates examples of functioning of NNs as components of a traditional codec's pipeline, in accordance with an embodiment. In particular, FIG. 5 illustrates an encoder, which also includes a decoding loop. FIG. 5 is shown to include components described below:

A luma intra pred block or circuit 501. This block or circuit performs intra prediction in the luma domain, for example, by using already reconstructed data from the same frame. The operation of the luma intra pred block or circuit 501 may be performed by a deep neural network such as a convolutional auto-encoder. A chroma intra pred block or circuit 522. This block or circuit performs intra prediction in the chroma domain, for example, by using already reconstructed data from the same frame. The chroma intra pred block or circuit 522 may perform cross-component prediction, for example, predicting chroma from luma. The operation of the chroma intra pred block or circuit 522 may be performed by a deep neural network such as a convolutional auto-encoder.

An intra pred block or circuit 503 and inter-pred block or circuit 504. These blocks or circuit perform intra prediction and inter-prediction, respectively. The intra pred block or circuit 503 and the inter-pred block or circuit 504 may perform the prediction on all components, for example, luma and chroma. The operations of the intra pred block or circuit 503 and inter-pred block or circuit 504 may be performed by two or more deep neural networks such as convolutional auto-encoders.

A probability estimation block or circuit 505 for entropy coding. This block or circuit performs prediction of probability for the next symbol to encode or decode, which is then provided to the entropy coding module 512, such as the arithmetic coding module, to encode or decode the next symbol. The operation of the probability estimation block or circuit 505 may be performed by a neural network.

A transform and quantization (T/Q) block or circuit 506. These are actually two blocks or circuits. The transform and quantization block or circuit 506 may perform a transform of input data to a different domain, for example, the FFT transform would transform the data to frequency domain. The transform and quantization block or circuit 506 may quantize its input values to a smaller set of possible values. In the decoding loop, there may be inverse quantization block or circuit and inverse transform block or circuit Q VT 'SIS. One or both of the transform block or circuit and quantization block or circuit may be replaced by one or two or more neural networks. One or both of the inverse transform block or circuit and inverse quantization block or circuit 513 may be replaced by one or two or more neural networks.

An in-loop filter block or circuit 507. Operations of the in-loop filter block or circuit 507 is performed in the decoding loop, and it performs filtering on the output of the inverse transform block or circuit, or anyway on the reconstructed data, in order to enhance the reconstructed data with respect to one or more predetermined quality metrics. This filter may affect both the quality of the decoded data and the bitrate of the bitstream output by the encoder. The operation of the in-loop filter block or circuit 507 may be performed by a neural network, such as a convolutional auto encoder. In examples, the operation of the in-loop filter may be performed by multiple steps or filters, where the one or more steps may be performed by neural networks.

A post-processing filter block or circuit 528. The post-processing filter block or circuit 528 may be performed only at decoder side, as it may not affect the encoding process. The post processing filter block or circuit 528 filters the reconstructed data output by the in-loop filter block or circuit 507, in order to enhance the reconstructed data. The post-processing filter block or circuit 528 may be replaced by a neural network, such as a convolutional auto-encoder.

A resolution adaptation block or circuit 509: this block or circuit may downsample the input video frames, prior to encoding. Then, in the decoding loop, the reconstructed data may be upsampled, by the upsampling block or circuit 510, to the original resolution. The operation of the resolution adaptation block or circuit 509 block or circuit may be performed by a neural network such as a convolutional auto-encoder.

An encoder control block or circuit 511. This block or circuit performs optimization of encoder's parameters, such as what transform to use, what quantization parameters (QP) to use, what intra-prediction mode (out of N intra-prediction modes) to use, and the like. The operation of the encoder control block or circuit 511 may be performed by a neural network, such as a classifier convolutional network, or such as a regression convolutional network.

An ME/MC block or circuit 514 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation.

In another approach, commonly referred to as “end-to-end learned compression”, NNs are used as the main components of the image/video codecs. Some examples of the second approach include, but are not limited to following:

[0096] Option 1 : re-use the video coding pipeline but replace most or all the components with NNs. Referring to FIG. 6, it illustrates an example of modified video coding pipeline based on a neural network, in accordance with an embodiment. An example of neural network may include, but is not limited to, a compressed representation of a neural network. FIG. 6 is shown to include following components:

A neural transform block or circuit 602: this block or circuit transforms the output of a summation/subtraction operation 603 to a new representation of that data, which may have lower entropy and thus be more compressible.

A quantization block or circuit 604: this block or circuit quantizes an input data 601 to a smaller set of possible values.

An inverse transform and inverse quantization blocks or circuits 606. These blocks or circuits perform the inverse or approximately inverse operation of the transform and the quantization, respectively.

An encoder parameter control block or circuit 608. This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits. An entropy coding block or circuit 610. This block or circuit may perform lossless coding, for example, based on entropy. One popular entropy coding technique is arithmetic coding.

A neural intra-codec block or circuit 612. This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame. An encoder 614 may be an encoder block or circuit, such as the neural encoder part of an auto-encoder neural network. A decoder 616 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network. An intra-coding block or circuit 618 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.

A deep loop filter block or circuit 620. This block or circuit performs filtering of reconstructed data, in order to enhance it.

A decode picture buffer block or circuit 622. This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 624 and enhanced reference frames 626 to be used for inter prediction.

An inter-prediction block or circuit 628. This block or circuit performs inter-frame prediction, for example, predicts from frames, for example, frames 632, which are temporally nearby. An ME/MC 630 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation.

[0097] Option 2: re-design the whole pipeline, as follows. Option 2 is described in detail in FIG. 7.

- Encoder NN: performs a non-linear transform

- Quantization and lossless encoding of the encoder NN's output.

- Lossless decoding and dequantization.

- Decoder NN: performs a non-linear inverse transform.

[0098] FIG. 7 depicts an encoder and a decoder NNs being parts of a neural auto-encoder architecture, in accordance with an example. In FIG. 7, the Analysis Network 701 is an Encoder NN, and the Synthesis Network 702 is the Decoder NN, which may together be referred to as spatial correlation tools 703, or as neural auto-encoder.

[0099] In Option 2, the input data 704 is analyzed by the Encoder NN, Analysis Network 701 , which outputs a new representation of that input data. The new representation may be more compressible. This new representation may then be quantized, by a quantizer 705, to a discrete number of values. The quantized data may be then lossless encoded, for example, by an arithmetic encoder 706, thus obtaining a bitstream 707. The example shown in FIG. 7 includes an arithmetic decoder 708 and an arithmetic encoder 706. The arithmetic encoder 706, or the arithmetic decoder 708, or the combination of the arithmetic encoder 706 and arithmetic decoder 708 may be referred to as arithmetic codec in some embodiments. On the decoding side, the bitstream is first lossless decoded, for example, by using the arithmetic codec decoder 708. The lossless decoded data is dequantized and then input to the Decoder NN, Synthesis Network 702. The output is the reconstructed or decoded data 709.

[00100] In case of lossy compression, the lossy steps may comprise the Encoder NN and/or the quantization.

[00101] In order to train the neural networks of this system, a training objective function, referred to as 'training loss', is typically utilized, which usually comprises one or more terms, or loss terms, or simply losses. Although here the Option 1 and FIG. 6 considered as example for describing the training objective function, a similar training objective function may also be used for training the neural networks for the systems in FIG. 4 and FIG. 5. In one example, the training loss comprises a reconstruction loss term and a rate loss term. The reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric. Some examples of reconstruction losses are:

- a loss derived from mean squared error (MSE);

- a loss derived from multi-scale structural similarity (MS-SSIM), such as 1 minus MS- SSIM, or 1 - MS-SSIM;

- Losses derived from the use of a pretrained neural network. For example, error(fl, f2), where fl and f2 are the features extracted by a pretrained neural network for the input (uncompressed) data and the decoded (reconstructed) data, respectively, and error() is an error or distance function, such as LI norm or L2 norm; and

- Losses derived from the use of a neural network that is trained simultaneously with the end-to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of generative adversarial networks (GANs) and their variants.

[00102] The rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. 'Compressing', for example, means reducing the number of bits output by the encoding stage. [00103] When an entropy-based lossless encoder is used, such as the arithmetic encoder, the rate loss typically encourages the output of the Encoder NN to have low entropy. The rate loss may be computed on the output of the Encoder NN, or on the output of the quantization operation, or on the output of the probability model. Some examples of rate losses are the following:

- A differentiable estimate of the entropy;

- A sparsification loss, for example, a loss that encourages the output of the Encoder NN or the output of the quantization to have many zeros. Examples are L0 norm, LI norm, LI norm divided by L2 norm; and

- A cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by the arithmetic encoder.

[00104] For training one or more neural networks that are part of a codec, such as one or more neural networks in FIG. 5 and/or FIG. 6, one or more of reconstruction losses may be used, and one or more of rate losses may be used.. The loss terms may then be combined, for example, as a weighted sum to obtain the training objective function. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion loss. For example, when more weight is given to one or more of the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy as measured by a metric that correlates with the reconstruction losses. These weights are usually considered to be hyper-parameters of the training session and may be set manually by the operator designing the training session, or automatically, for example, by grid search or by using additional neural networks.

[00105] For the sake of explanation, video is considered as data type in various embodiments. However, it would be understood that the embodiments are also applicable to other media items, for example, images and audio data.

[00106] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as the arithmetic codec.

[00107] Further information on neural network-based end-to-end learned video coding

[00108] As shown in FIG. 8, a typical neural network-based end-to-end learned video coding system 800 includes encoder 804, quantizer 806, probability model 808, entropy codec 810/814 (for example, arithmetic encoder/decoder), dequantizer 816, and decoder 818. The encoder 804 and decoder 818 are typically two neural networks, or mainly comprise neural network components. The probability model 808 may also comprise mainly neural network components. Quantizer 806, dequantizer 816 and entropy codec 810/814 are typically not based on neural network components, but they may also comprise neural network components, potentially.

[00109] On the encoder side, the encoder component 804 takes a video 802 as input and converts the video from its original signal space into a latent representation that may comprise a more compressible representation of the input. In the case of an input image, the latent representation may be a 3-dimensional tensor, where two dimensions represent the vertical and horizontal spatial dimensions, and the third dimension represent the “channels” which include information at that specific location. When the input image 802 is a 128x128x3 RGB image (with horizontal size of 128 pixels, vertical size of 128 pixels, and 3 channels for the Red, Green, Blue color components), and when the encoder 804 downsamples the input tensor by 2 and expands the channel dimension to 32 channels, then the latent representation is a tensor of dimensions (or “shape”) 64x64x32 (e.g., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels). Note that, the order of the different dimensions may differ depending on the convention which is used; in some cases, for the input image, the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3. In the case of an input video (instead of just an input image), another dimension in the input tensor may be used to represent temporal information. The quantizer component 806 quantizes the latent representation into discrete values given a predefined set of quantization levels. Probability model 808 and arithmetic codec component 810/814 work together to perform lossless compression for the quantized latent representation and generate bitstreams 812 to be sent to the decoder side. Given a symbol to be encoded into the bitstream 812, the probability model 808 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already been encoded/decoded. Then, the arithmetic encoder 810 encodes the input symbols to bitstream 812 using the estimated probability distributions.

[00110] On the decoder side, opposite operations are performed. The arithmetic decoder 814 and the probability model 808 first decode symbols from the bitstream 812 to recover the quantized latent representation. Then the dequantizer 816 reconstructs the latent representation in continuous values and passes it to decoder 818 to recover the input video/image, as output video/image 820. Note that the probability model 808 in this system is shared between the encoding and decoding systems. In practice, this means that a copy of the probability model 808 is used at encoder side, and another exact copy is used at decoder side. [00111] In this system, the encoder 804, probability model 808, and decoder 818 are normally based on deep neural networks. The system is trained in an end-to-end manner by minimizing the following rate-distortion loss function:

L = D + AR,

[00112] where D is the distortion loss term, R is the rate loss term, and A is the weight that controls the balance between the two losses. The distortion loss term may be the mean square error (MSE), structure similarity (SSIM) or other metrics that evaluate the quality of the reconstructed video. Multiple distortion losses may be used and integrated into D , such as a weighted sum of MSE and SSIM. The rate loss term is normally the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits-per-pixel (bpp).

[00113] For lossless video/image compression, the system includes only the probability model 808 and arithmetic encoder/decoder 810/814. The system loss function includes only the rate loss, since the distortion loss is always zero (e.g., no loss of information).

[00114] Video Coding for Machines (VCM)

[00115] Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, e.g., consuming/watching the decoded image. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (e.g. , autonomous agents) that analyze data independently from humans and that may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. Example use cases and applications are self driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc. This may raise the following question: when decoded data is consumed by machines, shouldn’t the systems/models aim at a different quality metric -other than human perceptual quality- when considering media compression in inter-machine communications? Also, dedicated algorithms for compressing and decompressing data for machine consumption are likely to be different than those for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as video coding for machines. [00116] It is likely that the receiver-side device has multiple “machines” or neural networks (NNs). These multiple machines may be used in a certain combination which is, for example, determined by an orchestrator sub-system. The multiple machines may be used, for example, in succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.

[00117] Notice that, for the examples described herein, “machine” and “task neural network” are referred to interchangeably, and this means any process or algorithm (learned or not learned from data) which analyzes or processes data for a certain task. In the rest of the description further details are specified concerning other assumptions made regarding the machines considered in the examples described herein.

[00118] Also, the terms “receiver-side” or “decoder-side” refer to the physical or abstract entity or device which includes one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or abstract entity or device, the “encoder-side device”.

[00119] The encoded video data may be stored into a memory device, for example, as a file. The stored file may later be provided to another device.

[00120] Alternatively, the encoded video data may be streamed from one device to another.

[00121] FIG. 9 is a general illustration of the pipeline 900 of video coding for machines. A VCM encoder 904 encodes the input video 902 into a bitstream 906. A bitrate 910 may be computed (908) from the bitstream 906 in order to evaluate the size of the bitstream 906. A VCM decoder 912 decodes the bitstream output 906 by the VCM encoder 904. The output of the VCM decoder is referred in FIG. 9 as “Decoded data for machines” 914. This data may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have same or similar characteristics as the original video which was input to the VCM encoder. For example, this data may not be easily understandable by a human by simply rendering the data onto a screen. The output of VCM decoder 914 is then input to one or more task neural networks (task- NNs) 916. In FIG. 9, for the sake of illustrating that there may be any number of task-NNs 916, there are three example task-NNs (one for object detection 916-1, one for object segmentation 916-2, and another for object tracking 916-3), and a non-specified one (Task-NN X 916-4). The goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric associated to each task.

[00122] Shown also within the pipeline 900 of Video Coding for Machines is the evaluation 918 of the various task NNs (including respective evaluations 918-1, 918-2, 918-3, and 918-4), and the performance 920 of the various task NNs (including respective performances 920-1, 920-2, 920- 3, and 920-4.

[00123] One of the possible approaches to realize video coding for machines is an end-to-end learned approach. In this approach, the VCM encoder 1004 and VCM decoder 1012 mainly include of neural networks. FIG. 10 illustrates an example of a pipeline 1001 for the end-to-end learned approach. The video 1002 is input to a neural network encoder 1024. The output of the neural network encoder 1024 is input to a lossless encoder 1026, such as an arithmetic encoder, which outputs a bitstream 1006. The lossless codec may be a probability model 1028 (including 1028 and 1028-2), both in the lossless encoder 1026 and in the lossless decoder 1030, which predicts the probability of the next symbol to be encoded and decoded. The probability model 1028 may also be learned, for example, it may be a neural network. At the decoder-side, the bitstream 1006 is input to a lossless decoder 1030, such as an arithmetic decoder, whose output is input to a neural network decoder 1032. The output of the neural network decoder 1032 is the decoded data for machines 1014, that may be input to one or more task-NNs 1016.

[00124] As further shown in FIG. 10, each task NN 1016 (including task NNs 1016-1, 1016- 2, 1016-3, and 1016-4) has an output 1022, shown respectively as outputs 1022-1, 1022-2, 1022-3, and 1022-4.

[00125] FIG. 11 illustrates an example of how the end-to-end learned system may be trained. For the sake of simplicity, only one task-NN 1116 is illustrated. A rate loss 1136 may be computed (at 1134) from the output of the probability model 1128. The rate loss 1136 provides an approximation of the bitrate required to encode the input video data. A task loss 1142 may be computed (at 1140) from the output 1122 of the task-NN 1116.

[00126] The rate loss 1136 and the task loss 1142 may then be used to train (at 1138 and 1144) the neural networks used in the system, such as the neural network encoder 1124, the probability model 1128, and the neural network decoder 1132. Training may be performed by first computing gradients of each loss with respect to the neural networks that are contributing or affecting the computation of that loss. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.

[00127] As further shown in FIG. 11, the video 1102 is input into the neural network encoder 1124, and the output of the training 1138 and the output of the neural network encoder 1124 are input to the neural network decoder 1132, which generates decoded data for machines 1114.

[00128] The machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example, because the encoder-side device does not have the capabilities (computational, power, memory) for running the neural networks that perform these tasks, or because some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there could be a customization need, where different clients would run different neural networks for performing these machine learning tasks.

[00129] Terminology: encoder, encoder-side, decoder, decoder-side

[00130] In this description, the difference between decoder and decoder-side is that the decoder comprises operations which are necessary to decode the encoded data, whereas the decoder-side comprises the decoder and any additional operations which are not necessary to decode the encoded data. The additional operations performed at decoder-side may comprise enhancing some aspects of the decoded data. For example, an additional operation may be a post-processing filter which enhances the visual quality (for example, according to an objective quality metric such as Peak Signal-to-Noise Ratio (PS NR)), where this post-processing filter may not be part of the decoder, but may be part of the decoder-side.

[00131] It is to be understood that in the context of data compression (including compression of images and videos), the encoder may include one or more operations performed by the decoder. In some cases, the encoder or encoder-side may perform one or more operations performed by the decoder or decoder-side.

[00132] Motion estimation and motion compensation

[00133] One of the fundamental steps in video compression is motion estimation and motion compensation. However, it is to be understood that motion estimation and motion compensation may be performed for other applications or use-cases than for video compression. Video compression is considered herein as an example application or use-case, but the proposed embodiments are applicable to other applications or use-cases where motion estimation and/or motion compensation is performed.

[00134] Motion estimation comprises estimating a motion field or optical flow, based on a target frame and one or more reference frames. Motion compensation comprises estimating or predicting a target frame, based on a motion field and one or more reference frames, where the motion field may be the output of motion estimation or may be derived from the output of motion estimation.

[00135] FIG. 12 illustrates an example of how motion estimation 1206 and motion compensation 1212 may be used to first obtain a motion field 1208 based on a target frame 1202 and one or more reference frames (1204) and then use the obtained motion field 1208 and one or more reference frames (1210) for obtaining a predicted target frame 1214.

[00136] In the context of video compression, motion estimation may be done as part of the encoding operations at the encoder-side, or as part of the decoding operations at the decoder-side. Motion compensation is done as part of the decoding operations at the decoder-side or at both the decoder-side and encoder-side, to estimate or predict a target frame based on a motion field (or a signal derived from a motion field) and one or more reference frames. This way, the encoder does not need to send the target frame directly. At decoder side, the reference frames may be available in decompressed or reconstructed form, thus they may be affected by compression artifacts. Therefore, it is to be understood that when the term “reference frame” is used in some of the proposed embodiments, its meaning may be a reconstructed reference frame when it is available at decoder- side. As the encoder-side may comprise also one or more decoder-side operations, in some embodiments, the encoder-side may also use one or more reconstructed reference frames as the reference frames used for performing motion estimation and/or motion compensation.

[00137] In one case, the one or more reference frames 1004 used for motion estimation 1006 may be the same or substantially the same reference frames 1010 used for motion compensation 1012.

[00138] In another case, the one or more reference frames 1004 used for motion estimation 1006 may be a subset or a superset of the one or more reference frames 1010 used for motion compensation 1012. [00139] In another case, the one or more reference frames 1010 used for motion compensation 1012 may be a reconstructed version of the reference frames 1004 used for motion estimation 1006.

[00140] The one or more reference frames (1004, 1010) may be past and/or future frames with respect to the display order. In this context the display order is the order by which the video frames are in the input video, e.g., the order by which the video frames are displayed or shown to a viewer. In one example, two reference frames may be used for either or both motion estimation 1006 and motion compensation 1012, where one reference frame is a past frame with respect to the target frame (e.g., to be displayed before the target frame) and where the other reference frame is a future frame with respect to the target frame (e.g., to be displayed after the target frame). In another example, one reference frame may be used for either or both motion estimation and motion compensation, where the one reference frame is a past frame with respect to the target frame (e.g., to be displayed before the target frame). In yet another example, two reference frames may be used for either or both motion estimation and motion compensation, where the two reference frames are past frames with respect to the target frame (e.g., to be displayed before the target frame). Any combination of the above examples may be considered.

[00141] In the following, more detailed information is provided about the motion estimation and the motion compensation operations.

[00142] Motion estimation

[00143] Motion estimation 1006 comprises estimating motion occurring between two or more frames, such as between a target frame 1002 and one or more reference frames 1004. The target frame may be an uncompressed target frame, or a reconstructed frame. The one or more reference frames may be uncompressed frames and/or reconstructed frames. The output or result of motion estimation may be a motion field or optical flow 1008, that is, a signal which represents the motion of the content (objects, background, etc.) in the target frame and/or in the reference frames. The output 1008 of motion estimation may represent the motion between the target frame 1002 and each of the reference frames 1004. In case of two reference frames, the output of motion estimation may represent the motion between the target frame and a first reference frame, and the motion between the target frame and a second reference frame. In case of using more than one reference frames for estimating motion, the motion information between the target frame and each reference frame may be represented as a separate signal. For example, when two reference frames are used and when a motion signal comprises two matrices, where one matrix represents the motion in horizontal direction and another matrix represents the motion in vertical direction, the output of motion estimation may comprise four matrices, where two matrices represent the motion between the target frame and one reference frame, and other two matrices represent the motion between the target frame and another reference frame.

[00144] In some cases, motion estimation 1006 may be performed at encoder-side. The input to the motion estimation operation may be a target frame and one or more reference frames, or features extracted therefrom. The encoder may encode the motion field into the bitstream. The motion field may be compressed in a lossy and/or lossless manner, such as via quantization and entropy coding. The decoder would decode the motion field and perform motion compensation by using at least the decoded motion field and one or more reconstructed reference frames. In order to achieve higher compression rate, the encoder may predict the motion field based on at least one previously reconstructed motion field. Then, only a motion prediction error may need to be encoded into bitstream. The motion prediction error may be compressed in lossy and/or lossless manner, such as via quantization and entropy coding. The decoder would decode the motion prediction error, predict the motion field based on at least one previously reconstructed motion field, and perform motion compensation by using at least the predicted motion field and one or more reconstructed reference frames.

[00145] In some other cases, motion estimation may be performed at decoder-side. The input to the motion estimation operation may be features extracted from a target frame, and one or more reference frames or features extracted therefrom. The features extracted from a target frame may be provided by an encoder. In one example, the features extracted from a target frame may be the output of one or more neural network layers at encoder-side. The motion estimation operation may be a neural network which is trained jointly with one or more neural network layers at encoder-side. The target frame and the one or more reference frames may be reconstructed frames.

[00146] FIG. 13 shows an example where the motion estimation (1306-1 and 1306-2) is performed at decoder-side.

[00147] Motion compensation

[00148] In this description, motion compensation 1012 may be referred to also as frame prediction. Motion compensation comprises using one or more motion fields 1008 and one or more reference frames 1010 to obtain a motion-compensated target frame (also referred to as predicted frame 1014). Motion compensation may comprise performing a warping operation on the one or more reference frames according to the one or more motion fields. In the context of video compression, motion compensation is performed as part of the decoding operations at the decoder side or at both the decoder side and encoder side, where the one or more motion fields may be reconstructed or decoded motion fields, and where the one or more reference frames may be reconstructed reference frames. In one example, a reconstructed motion field (or decoded motion field) may be obtained by decoding an encoded motion field. In another example, a reconstructed motion field (or decoded motion field) may be obtained by first predicting a motion field based on at least a previously reconstructed motion field, decoding or reconstructing a motion prediction error, combining the reconstructed motion prediction error with the predicted motion field, where combining may comprise adding the reconstructed motion prediction error to the predicted motion field.

[00149] The examples described herein target the improvement of motion estimation 1006, by considering the specific use case of video compression, although the proposed embodiments are applicable to other applications of motion estimation 1006.

[00150] Described herein are a set of methods for improving the performance of motion estimation with respect to a metric related to the task for which motion estimation is used. The proposed set of methods comprise adapting the motion estimation operation or its input or its output according to one or more aspects derived from the input video, or according to the values of one or more aspects derived from an input data stream. In case the motion estimation is performed at decoder-side, the values of the one or more aspects derived from the input video may be available to the decoder-side by one or more of the following methods:

- Indicating the values of the one or more aspects derived from the input video in or along the bitstream from the encoder to the decoder;

- Delivering the values of the one or more aspects derived from the input video at a session setup, e.g., as a part of media presentation description; or

- Delivering the values of the one or more aspects derived from the input video from the encoder to the decoder prior to or during bitstream delivery.

[00151] Described herein and considered herein is the use case of using motion estimation within a video codec. Other use cases may be considered. For this specific use case, the metric which is improved by the proposed method is rate-distortion performance.

[00152] In one embodiment, the motion estimation operation or its input is adapted according to the resolution of the video. [00153] In another embodiment, the output of the motion estimation operation is adapted according to the resolution of the video and/or according to information derived from the resolution of the video.

[00154] In another embodiment, the motion estimation operation or its input is adapted according to the output of an analysis operation performed on the video, or according to information derived from the output of an analysis operation performed on the video, where an example analysis operation may be object detection.

[00155] In another embodiment, the output of the motion estimation operation is adapted according to the output of an analysis operation performed on the video, or according to information derived from the output of an analysis operation performed on the video, where an example analysis operation may be object detection.

[00156] In one embodiment, the motion estimation operation or its input is adapted according to the temporal distance information of the video.

[00157] In another embodiment, the output of the motion estimation operation is adapted according to the temporal distance information of the video and/or according to information derived from the temporal distance information of the video.

[00158] As used herein, in one example, to adapt the motion estimation operation refers to changing or modifying the motion estimation operation, which changing or modifying may, for example, be the result of machine learning or other artificial intelligence method. In another example, to adapt the motion estimation operation refers to changing or modifying the output of the motion estimation operation by changing at least part of the input to the motion estimation operation, which changing or modifying may, for example, be the result of machine learning or other artificial intelligence method.

[00159] Further details are provided herein

[00160] In the embodiments, considered is the use case of video compression as an example use case. In particular, considered is the specific case of end-to-end learned video compression, where most of the components of the video codec are neural networks. In addition, considered is the case where the motion estimation operation is performed by a neural network. This motion estimation neural network may be initialized by pretraining it as a separate module with respect to the video codec, or may be initialized randomly based on one of the available random initialization methods found in the literature. The initialized motion estimation neural network may then be finetuned together with one or more other neural networks that are part of the video codec, or it may be kept unmodified when one or more other neural networks that are part of the video codec are trained.

[00161] However, the scope of the proposed embodiments is not limited to the use case of video compression, or to the specific case of end-to-end learned video compression, or to the case of using a neural network for performing motion estimation. The scope of the proposed embodiments may be valid for other use cases that comprise a motion estimation step, and for other types of motion estimation operations other than neural networks.

[00162] Embodiment 1: resolution information as auxiliary input to motion estimation

[00163] In this embodiment, with reference to FIG. 14 and FIG. 15, resolution information is provided as one of the inputs (refer to inputs 1434, 1436, 1534, and 1535) to the motion estimation operation (1406, 1506), or as one of the inputs to one or more other operations whose outputs are input to the motion estimation operation. For example, the inputs to the motion estimation operation (1406, 1506) may comprise a target frame (1402, 1502), one reference frame (1404, 1504) and the resolution information. In another example, the inputs to the motion estimation operation (1406, 1506) may comprise features extracted from a target frame (1402, 1502) and features extracted from one reference frame (1404, 1504) where the features extracted from a target frame (1402, 1502) may be extracted based on a target frame (1402, 1502) and the resolution information, and/or where the features extracted from the one reference frame (1404, 1504) may be extracted based on the one reference frame (1404, 1504) and the resolution information. The feature extraction operation may be performed by a neural network.

[00164] The resolution information may be represented as two non-zero integer values: a width value (1420, 1520) and a height value (1422, 1522). Other forms or formats for representing the resolution information may be considered and are in the scope of the examples described herein. An alternative representation of the resolution information may be two Real numbers that are obtained by normalizing a width value (1420, 1520) and a height value (1422, 1522). Another alternative representation of the resolution information may be a number that is obtained by multiplying a width value (1420, 1520) and a height value (1422, 1522), or by multiplying a normalized version of a width value (1420, 1520) and a normalized version of a height value (1422, 1522). [00165] Example implementation 1 of Embodiment 1

[00166] The width (1420, 1520) and height values (1422, 1522) may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00167] Two matrices may be derived (at 1424, 1524) from the width value (1420, 1520) and height value (1422, 1522). One matrix may be derived by setting all its elements to a value equal to the normalized width value, and another matrix may be derived by setting all its elements to a value equal to the normalized height value. These two matrices may be referred to as width matrix (1426, 1526) and height matrix (1428, 1528), respectively.

[00168] The width matrix (1426, 1526) and the height matrix (1428, 1528) may be stacked or concatenated to one or more matrices or tensors that are input to the motion estimation operation. For example, when the inputs to the motion estimation neural network comprise a tensor of features extracted from the target frame (1402, 1502) and a tensor of features extracted from a reference frame (1404, 1504), the width matrix (1426, 1526) and the height matrix (1428, 1528) may be stacked (at 1430, 1530) to the tensor of features extracted from the target frame (1402, 1502) and (at 1432, 1532) to the tensor of features extracted from a reference frame (1404, 1504). In this embodiment and in other embodiments, to stack refers to joining or concatenating two or more matrices or tensors along a given axis.

[00169] FIG. 14 illustrates the Example implementation 1 of Embodiment 1. As further shown in FIG. 14 and FIG. 15, the motion estimation operation (1406, 1506) generates a motion field (1408, 1508).

[00170] Example implementation 2 of Embodiment 1

[00171] The width and height values may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00172] Two matrices may be derived from the width value and height value. One matrix may be derived by setting all its elements to a value equal to the normalized width value, and another matrix may be derived by setting all its elements to a value equal to the normalized height value. These two matrices may be referred to as width matrix and height matrix, respectively.

[00173] With reference to FIG. 15, the width matrix 1526 and the height matrix 1528 may be first mapped to features by one or more neural network layers (refer to extract features 1538), thus obtaining width features 1540 and height features 1542. The width features and the height features may be represented as two tensors (respectively 1540 and 1542).

[00174] The width feature tensor 1540 and the height feature tensor 1542 may be stacked or concatenated to one or more matrices or tensors that are input to the motion estimation operation. For example, when the inputs to the motion estimation neural network comprise a tensor of features extracted from the target frame and a tensor of features extracted from a reference frame, the width feature tensor 1540 and the height feature tensor 1542 may be stacked (at 1530) to the tensor of features extracted from the target frame 1502 and (at 1532) to the tensor of features extracted from a reference frame 1504.

[00175] FIG. 15 illustrates the Example implementation 2 of Embodiment 1.

[00176] Embodiment 2: resolution-based scaling

[00177] In this embodiment, with reference to FIG. 16 and FIG. 17, resolution information is used to adapt the output of the motion estimation operation.

[00178] The resolution information may be represented as two non-zero integer values: a width value (1620, 1720) and a height value (1622, 1722). Other forms or formats for representing the resolution information may be considered and are in the scope of the examples described herein.

[00179] Example implementation 1 of Embodiment 2

[00180] The width and height values may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00181] Two matrices may be derived (at 1624, 1724) from the width value and height value. One matrix may be derived by setting all its elements to a value equal to the normalized width value, and another matrix may be derived by setting all its elements to a value equal to the normalized height value. These two matrices may be referred to as width matrix (1626, 1726) and height matrix (1628, 1728), respectively.

[00182] The height matrix and the width matrix may be mapped (using extract features 1638, 1738) to resolution features (1644, 1740, 1742) by one or more neural network layers. The resolution features (1644, 1740, 1742) may be represented by a tensor which has the same shape as the shape of the output (1608, 1708) of the motion estimation block (1606, 1706). For example, the resolution feature tensor may have shape (2, 256, 256), where the value on the first dimension may be the number of feature maps provided by a convolutional neural network and where the value on the second and third dimensions may be the height and width of the feature maps, respectively, and the output (1608, 1708) of the motion estimation block (1606, 1706) may be a tensor of shape (2, 256, 256). The resolution features (1644, 1740, 1742) may be used to modify (at 1646, 1746) the output (1608, 1708) of the motion estimation operation (1606, 1706). For example, the modification may comprise performing scaling (1646, 1746) of the output (1608, 1708) of the motion estimation operation (1606, 1706) by the resolution features (1644, 1640, 1742), where the scaling (1646, 1746) may comprise performing element-wise multiplication between the resolution features (1644, 1640, 1742) and the output (1608, 1708) of the motion estimation operation (1606, 1706).

[00183] FIG. 16 illustrates the Example implementation 1 of Embodiment 2. As further shown in FIG. 16, the target frame 1602 and reference frame 1604 are input into the motion estimation operation 1606, and the scaling 1646 generates scaled motion field 1648.

[00184] Example implementation 2 of Embodiment 2

[00185] The width and height values may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00186] Two matrices may be derived from the width value and height value. One matrix may be derived by setting all its elements to a value equal to the normalized width value, and another matrix may be derived by setting all its elements to a value equal to the normalized height value. These two matrices may be referred to as width matrix and height matrix, respectively.

[00187] With reference to FIG. 17, the height matrix 1728 and the width matrix 1726 may be mapped to corresponding height features 1742 and width features 1740 by one or more neural network layers (e.g., within extract features 1738). The height features 1742 and the width features 1740 may be represented as two tensors, where the two tensors may have a combined shape (that is, when stacked along one of the dimensions) that is same as the shape of the output 1708 of the motion estimation block 1706. For example, the height feature tensor 1742 and the width feature tensor 1740 may each have shape (1, 256, 256), where the value on the first dimension may be the number of feature maps provided by a convolutional neural network and where the value on the second and third dimensions may be the height and width of the feature maps, respectively, and the output 1708 of the motion estimation block 1706 may be a tensor of shape (2, 256, 256). The height features may be used to modify at least part of the output of the motion estimation operation. The width features may be used to modify at least part of the output of the motion estimation operation. For example, the modification may comprise performing scaling 1746 of the output 1708 of the motion estimation operation 1706 by the height features 1742 and by the width features 1740, where the scaling 1746 may comprise performing element-wise multiplication between the height features 1742 and the output 1708 of the motion estimation operation 1706, or between the width features 1740 and the output 1708 of the motion estimation operation 1706.

[00188] FIG. 17 illustrates the Example implementation 2 of Embodiment 2. As further shown in FIG. 17, the target frame 1702 and reference frame 1704 are input into the motion estimation operation 1706, and the scaling 1746 generates scaled motion field 1748.

[00189] Embodiment 1 and Embodiment 2 were tested in the context of the herein described end-to-end learned video codec. The plots in FIG. 18 show the validation loss (smaller is better) for the baseline (1802), Embodiment 1 (1804) and Embodiment 2 (1806). In these plots, the validation loss is a rate-distortion loss equal to distortion + alpha*rate, where distortion is 1 minus MS-SSIM, and where rate is an estimated bitrate of the output of the encoder. MS-SSIM is computed based on the output of the decoder and the original or ground-truth video frames.

[00190] Embodiment 3: analysis based auxiliary input

[00191] In this embodiment, with reference to FIG. 19, analysis results (1954, 1956, 1958, 1960) are input to the motion estimation operation 1906. Here, analysis results refer to one or more outputs of one or more analysis algorithms (1950, 1952, etc.) performed on a certain input.

[00192] The analysis algorithms (1950, 1952) may be neural networks. The analysis algorithms may provide analysis results (1954, 1956, 1958, 1960) with respect to analysis tasks, such as object detection 1950, semantic segmentation 10952, image classification, etc. In the case of an object detection task 1950, the output of an object detection algorithm may be a set of bounding boxes (1954, 1956) and associated object class labels. Each bounding box represents the estimated area within the input frame where an object belonging to the class indicated by the class label is. In the case of semantic segmentation task 1952, the output (1958, 1960) of a semantic segmentation algorithm may be a tensor of the same spatial resolution as the input frame, where each matrix of the tensor represents the segmentation results for one class of object or area.

[00193] The input to the analysis algorithms (1950, 1952) may be the target frame 1902 and/or one or more reference frames 1904.

[00194] Example imnlementation

[00195] The analysis results may be mapped (at 1939) to features by one or more neural network layers - these features may be referred to as analysis results features (1962, 1964). This may be done, for example, for mapping analysis results which are not in matrix or tensor form to a feature tensor. For example, the results of semantic segmentation 1952 may be already in matrix or tensor form, whereas the output of object detection 1950 may not be in matrix or tensor form (or vice versa). However, even when the analysis results are already in tensor form, analysis results features may be extracted from the analysis results.

[00196] The analysis results (1954, 1956, 1958, 1960) or features extracted from the analysis results (1962, 1964) may be stacked (1930, 1932) to one or more inputs of the motion estimation operation 1906. For example, when the inputs to the motion estimation neural network are features extracted from the target frame 1902 and features extracted from a reference frame 1904, the analysis results (1954, 1956, 1958, 1960) or features extracted from the analysis results (1962, 1964) may be stacked (at 1930) to the tensors representing the target frame’s features and or (at 1932) to the tensors representing the reference frame’s features.

[00197] FIG. 19 illustrates this example implementation. As further shown in FIG. 19, the stacking 1930 generates a stacked target frame and its analysis results 1966, and the stacking 1932 generates a stacked reference frame and its analysis results 1968. The motion estimation block 1906 outputs a motion field 1908.

[00198] Embodiment 4: analysis-based scaling [00199] In this embodiment, with reference to FIG. 20, analysis results (2054, 2056) are used to modify the output 2008 of the motion estimation operation 2006.

[00200] Here, analysis results refer to one or more outputs (e.g., 2054, 2056) of one or more analysis algorithms 2050 performed on a certain input (2002, 2004).

[00201] The analysis algorithms (e.g., 2050) may be neural networks. The analysis algorithms may provide analysis results with respect to analysis tasks, such as object detection 2050, semantic segmentation, image classification, etc. In the case of object detection task 2050, the output of an object detection algorithm may be a set of bounding boxes (2054, 2056) and associated object class labels. Each bounding box represents the estimated area within the input frame where an object belonging to the class indicated by the class label is. In the case of semantic segmentation task, the output of a semantic segmentation algorithm may be a tensor of the same spatial resolution as the input frame, where each matrix of the tensor represents the segmentation results for one class of object or area.

[00202] The input to the analysis algorithms may be the target frame 2002 and/or one or more reference frame 2004.

[00203] Example implementation

[00204] The analysis results may be mapped (at 2039) to features (2062, 2064) by one or more neural network layers - these features may be referred to as analysis results features. This may be done, for example, for mapping analysis results which are not in matrix or tensor form to a feature tensor. For example, the results of semantic segmentation may be already in matrix or tensor form, whereas the output of object detection may not be in matrix or tensor form. However, even when the analysis results are already in tensor form, analysis results features may be extracted from the analysis results.

[00205] The analysis results (2054, 2056) or features extracted from the analysis results (2062, 2064) may be used to modify the output 2008 of the motion estimation operation 2006. For example, the modification may comprise performing scaling 2046 of the output 2008 of the motion estimation operation 2006 by using the analysis results (e.g., 2054, 2056) or features extracted from the analysis results (2062, 2064), where the scaling 2046 may comprise performing element-wise multiplication between the analysis results (2054, 2056) or features extracted from the analysis results (2062, 2064) and the output 2008 of the motion estimation operation 2006. [00206] FIG. 20 illustrates this example implementation. As further shown in FIG. 20 scaling 2046 generated scaled motion field 2048.

[00207] Embodiment 5: temporal distance information as auxiliary input to motion estimation

[00208] In this embodiment, with reference to FIG. 21 and FIG. 22, temporal distance information (2170, 2172, 2270, 2272) is provided as one of the inputs to the motion estimation operation, or as one of the inputs to one or more other operations whose outputs are input to the motion estimation operation. For example, the inputs to the motion estimation operation may comprise a target frame, one reference frame, and the temporal distance information. In another example, the inputs to the motion estimation operation may comprise features extracted from a target frame and features extracted from one reference frame, where the features extracted from a target frame may be extracted based on a target frame and the temporal distance information, and/or where the features extracted from the one reference frame may be extracted based on the one reference frame and the temporal distance information. The feature extraction operation may be performed by a neural network.

[00209] The temporal distance information may be the distance between the reference frame and the target frame; the distance between the reference frame and the beginning of the video; the distance between the target frame and the beginning of the video; the distance between the reference frame and an anchor frame; the distance between the target frame and an anchor frame; or any combinations of the aforementioned distances. The anchor frame to derive the temporal distance can be a frame where a scene cut occurs or a frame where the beginning of an action starts. The starting of an action can be determined by an auxiliary component, which is often implemented by a NN, on the decoder side or the encoder side. When the auxiliary component is on the encoder side, the encoder may signal the anchor frame and features related to the detected action to the decoder.

[00210] The temporal distance may be represented by the number of frames or by a measure of time, for example, in second or millisecond.

[00211] The temporal distance information may be represented by one or more floating-point or integer values. Other forms or formats for representing the temporal distance information may be considered and are in the scope of the examples described herein. [00212] Example implementation 1 of Embodiment 5

[00213] With reference to FIG. 21, the temporal distance values (2170, 2172) may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized temporal distance value to a specific range, such as between 0 and 1.

[00214] One or more matrices (2174, 2176) may be derived (at 2124) from the temporal distance values (2170, 2172). Each matrix (2174, 2176) may be derived (at 2124) by setting all its elements to a value equal to one of the normalized temporal distance values.

[00215] The temporal distance matrices (2174, 2176) may be stacked or concatenated to one or more matrices or tensors that are input (2178, 2180) to the motion estimation operation 2106. For example, when the inputs to the motion estimation neural network comprise a tensor of features extracted from the target frame and a tensor of features extracted from a reference frame, the temporal distance matrices (2174, 2176) may be stacked (at 2130) to the tensor of features extracted from the target frame 2102 and (at 2132) to the tensor of features extracted from a reference frame 2104.

[00216] FIG. 21 illustrates the Example implementation 1 of Embodiment 5. As further shown in FIG. 21, motion estimation 2106 generates a motion field 2108.

[00217] Example implementation 2 of Embodiment 5

[00218] With reference to FIG. 22, the temporal distance values (2270, 2272) may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00219] One or more matrices (2274, 2276) may be derived (at 2224) from the temporal distance values (2270, 2272). Each matrix may be derived by setting all its elements to a value equal to one of the normalized temporal distance values. [00220] Each of the temporal distance matrices (2274, 2276) may be first mapped (at 2238) to features (2282, 2284) by one or more neural network layers, thus obtaining temporal distance features (2282, 2284). The temporal distance features (2282, 2284) may be represented as one or more tensors.

[00221] The temporal distance features (2282, 2284) may be stacked or concatenated to one or more tensors that are input (2278, 2280) to the motion estimation operation. For example, when the inputs (2278, 2280) to the motion estimation neural network 2206 comprise a tensor of features extracted from the target frame and a tensor of features extracted from a reference frame, the temporal distance feature tensors (2282, 2284) may be stacked (at 2230) to the tensor of features extracted from the target frame 2202 and (at 2232) to the tensor of features extracted from a reference frame 2204.

[00222] FIG. 22 illustrates the Example implementation 2 of Embodiment 5. As further shown in FIG. 22, motion estimation 2206 generated the motion field 2208.

[00223] Embodiment 6: temporal distance-based scaling

[00224] In this embodiment, with reference to FIG. 23 and FIG. 24, temporal distance information is used to adapt the output of the motion estimation operation.

[00225] The temporal distance information (2370, 2372, 2470, 2472) may be represented by one or more floating-point or integer values. Other forms or formats for representing the temporal distance information may be considered and are in the scope of the examples described herein.

[00226] Example implementation 1 of Embodiment 6

[00227] With reference to FIG. 23, the temporal distance values (2370, 2372) may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00228] One or more matrices (2374, 2376) may be derived (at 2324) from the temporal distance values (2370, 2372). Each matrix (2374, 2376) may be derived by setting all its elements to a value equal to one of the normalized temporal distance values. [00229] The temporal distance matrices (2374, 2376) may be first mapped (at 2338) to a temporal distance feature tensor (1082, 1084) by one or more neural network layers of the extract features operation 2338. The temporal distance feature tensor (2382, 2384) may be represented by a tensor which has the same shape as the shape of the output 2308 of the motion estimation block 2306. For example, the temporal distance feature tensor may have shape (2, 256, 256), where the value on the first dimension may be the number of feature maps provided by a convolutional neural network and where the value on the second and third dimensions may be the height and width of the feature maps, respectively, and the output of the motion estimation block may be a tensor of shape (2, 256, 256). The temporal distance feature tensor (2382, 2384) may be used to modify the output of the motion estimation operation. For example, the modification may comprise performing scaling 2346 of the output 2308 of the motion estimation operation 2306 by the temporal distance feature tensor (2382, 2384), where the scaling 2346 may comprise performing element-wise multiplication between the temporal distance feature tensor (2382, 2384) and the output 2308 of the motion estimation operation 2306.

[00230] FIG. 23 illustrates the Example implementation 1 of Embodiment 6. As further shown in FIG. 23, motion estimation operation 2306 takes as input a target frame 2302 and a reference frame 2304. The output of the scaling operation 2346 is the scaled motion field 2348.

[00231] Example implementation 2 of Embodiment 6

[00232] With reference to FIG. 24, the temporal distance values may be normalized by using one or more maximum value and one or more minimum value. The normalization may comprise using one or more maximum value and one or more minimum value for restricting the range of the normalized width and height to a specific range, such as between 0 and 1.

[00233] One or more matrices (2474, 2476) may be derived (at 2424) from the temporal distance values (2470, 2472). Each matrix may be derived by setting all its elements to a value equal to one of the normalized temporal distance values.

[00234] Each of the temporal distance matrices (2474, 2476) may be first mapped (at 2438) to temporal distance features (2482, 2484) by one or more neural network layers of extract features operation 2438. The temporal distance features (2482, 2484) may be represented as tensors which have the same shape as the output 2408 of the motion estimation block 2406. For example, the temporal distance features (2482, 2484) may each have shape (1, 256, 256), where the value on the first dimension may be the number of feature maps provided by a convolutional neural network and where the value on the second and third dimensions may be the height and width of the feature maps, respectively, and the output 2408 of the motion estimation block 2406 may be a tensor of shape (2, 256, 256). Each of the temporal distance features (2482, 2484) may be used to modify at least part of the output 2408 of the motion estimation operation 2406. For example, the modification may comprise performing scaling 2446 of the output 2408 of the motion estimation operation 2406 by one or more of the temporal distance features (2482, 2484), where the scaling 2446 may comprise performing element-wise multiplication between the temporal distance features (2482, 2484) and the output 2408 of the motion estimation operation 2406.

[00235] FIG. 24 illustrates the Example implementation 2 of Embodiment 6. As further shown in FIG. 24, motion estimation operation 2406 takes as input a target frame 2402 and a reference frame 2404. The output of the scaling operation 2446 is the scaled motion field 2448.

[00236] Embodiment 7: learning the proper image size in conjunction with original size and resized size

[00237] Learning the most suitable size of the input image or frame for a certain neural network performing a computer vision task has recently gained some interest.

[00238] In this embodiment, with reference to FIG. 25, the original size 2586 is inputted as auxiliary input to a learnable image resizer 2590, by using, for example, some of the aspects in Embodiment 1. Also input into the learnable image resizer 2590 is the image/video frame 2588. The output of the learnable resizer, e.g., the resized image size 2592, may then be input to the motion estimation 2506. The original image size 2586 could be optionally input also to the motion estimation 2506. The proposed architecture looks as shown in FIG. 25.

[00239] In yet another embodiment, an ensemble of learnable image resizers could apply, for which the learnable image resizers or a subset of them could also be chosen for obtaining the input size to motion estimation.

[00240] FIG. 26 is an example apparatus 2600, which may be implemented in hardware, configured to implement learned adaptive motion estimation for neural video coding, based on the examples described herein. The apparatus 2600 comprises a processor 2602, at least one non- transitory or transitory memory 2604 including computer program code 2605, wherein the at least one memory 2604 and the computer program code 2605 are configured to, with the at least one processor 2602, cause the apparatus to implement learned adaptive motion estimation / motion compensation 2606, based on the examples described herein. The apparatus 2600 optionally includes a display or I/O 2608 that may be used to display content during motion estimation, or receive user input from, for example, a keypad. The apparatus 2600 optionally includes one or more network (N/W) interfaces (I/F(s)) 2610. The N/W I/F(s) 2610 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique. The N/W I/F(s) 2610 may comprise one or more transmitters and one or more receivers. The N W I/F(s) 2610 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitry(ies) and one or more antennas. In some examples, the processor 2602 is configured to implement learned adaptive motion estimation / motion compensation 2606 without use of memory 2604.

[00241] The memory 2604 may be implemented using any suitable data storage technology, such as semiconductor based memory devices, flash memory, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The memory 2604 may comprise a database for storing data. Interface 2612 enables data communication between the various items of apparatus 2600, as shown in FIG. 26. Interface 2612 may be one or more buses, or interface 2612 may be one or more software interfaces configured to pass data within computer program code 2605. For example, the interface 2612 may be an object-oriented interface in software, or the interface 2612 may be one or more buses such as address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. The apparatus 2600 need not comprise each of the features mentioned, or may comprise other features as well. The apparatus 2600 may be an embodiment of any of the apparatuses shown in FIGS. 1 through 17 and FIGS. 19 through 25, including any combination of those. The apparatus 2600 may be an encoder or decoder.

[00242] FIG. 27 is an example method 2700 to implement learned adaptive motion estimation for neural video coding, based on the examples described herein. At 2702, the method includes receiving an input data stream, the input data stream comprising a target frame and one or more reference frames. At 2704, the method includes adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation. At 2706, the method includes wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream. At 2708, the method includes encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation. Method 2000 may be implemented with an encoder apparatus.

[00243] FIG. 28 is another example method 2800 to implement learned adaptive motion estimation for neural video coding, based on the examples described herein. At 2802, the method includes receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation. At 2804, the method includes wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation. At 2806, the method includes decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output. At 2808, the method includes performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames. At 2810, the method includes generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation. Method 2800 may be implemented with a decoder apparatus.

[00244] FIG. 29 is another example method 2900 to implement learned adaptive motion estimation for neural video coding, based on the examples described herein. At 2902, the method includes receiving a bitstream. At 2904, the method includes adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation. At 2906, the method includes decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output. At 2908, the method includes performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames. At 2910, the method includes generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation. Method 2900 may be implemented with a decoder apparatus.

[00245] References to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential /parallel architectures but also specialized circuits such as field-programmable gate arrays (FPGAs), application specific circuits (ASICs), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, etc.

[00246] As used herein, the term ‘circuitry’, ‘circuit’ and variants may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and memory(ies) that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor (s), that require software or firmware for operation, even when the software or firmware is not physically present. As a further example, as used herein, the term ‘circuitry’ would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware. The term ‘circuitry’ would also cover, for example, and when applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device. Circuitry or circuit may also be used to mean a function or a process used to execute a method.

[00247] An example apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive an input data stream, the input data stream comprising a target frame and one or more reference frames; adapt a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encode at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

[00248] The apparatus may further include wherein the input data stream is video.

[00249] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: estimate the motion between the one or more reference frames and the target frame to generate a predicted frame, using the adapted motion estimation operation.

[00250] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on resolution information of the input data stream.

[00251] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from resolution information of the input data stream; and adapt the motion estimation operation or an input of the motion estimation operation using the features determined from the resolution information.

[00252] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on resolution information of the input data stream.

[00253] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from resolution information of the input data stream; and adapt the output of the motion estimation operation using the features determined from the resolution information. [00254] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapt the motion estimation operation or an input of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

[00255] The apparatus may further include wherein the analysis operation is object detection.

[00256] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on an output of an analysis operation performed on the input data stream; or adapt the output of the motion estimation operation based on information derived from the output of the analysis operation performed on the input data stream.

[00257] The apparatus may further include wherein the analysis operation is object detection.

[00258] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the motion estimation operation or an input of the motion estimation operation based on temporal distance information of the input data stream.

[00259] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from temporal distance information of the input data stream; and adapt the motion estimation operation or an input of the motion estimation operation using the features determined from the temporal distance information.

[00260] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: adapt the output of the motion estimation operation based on temporal distance information of the input data stream.

[00261] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from temporal distance information of the input data stream; and adapt the output of the motion estimation operation using the features determined from the temporal distance information.

[00262] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: determine features from information of the input data stream using a neural network; wherein the information is at least one of: resolution information of the input data stream, information derived from an output of an analysis operation performed on the input data stream, or temporal distance information of the input data stream; and adapt the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation using the neural network determined features.

[00263] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: input the input data stream, and an original size of a frame of the input data stream into a learnable resizer; output from the learnable resizer a resized frame of the input data stream; adapt the motion estimation operation using the resized frame of the input data stream.

[00264] An example apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decode the encoded output of the motion estimation operation, or decode the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; perform a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generate a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation. [00265] The apparatus may further include wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation has been adapted based on at least one of: resolution information of an input data stream; features determined from the resolution information; an output of an analysis operation performed on the input data stream; information derived from the output of the analysis operation performed on the input data stream; temporal distance information of the input data stream; or features determined from the temporal distance information.

[00266] An example apparatus includes at least one processor; and at least one non-transitory memory including computer program code; wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to: receive a bitstream; adapt a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decode an encoded output of the motion estimation operation, or decode an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; perform a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generate a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00267] The apparatus may further include wherein the motion estimation operation or the input of the motion estimation operation or the output of the motion estimation operation is adapted based on at least one of: resolution information of the bitstream; features determined from the resolution information; an output of an analysis operation performed on the bitstream; information derived from the output of the analysis operation performed on the bitstream; temporal distance information of the bitstream; or features determined from the temporal distance information.

[00268] The apparatus may further include wherein the at least one memory and the computer program code are further configured to, with the at least one processor, cause the apparatus at least to: perform the motion estimation operation to output a motion field or optical flow. [00269] An example apparatus includes means for receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; means for adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and means for encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

[00270] An example apparatus includes means for receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; means for decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; means for performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and means for generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00271] An example apparatus includes means for receiving a bitstream; means for adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; means for decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; means for performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and means for generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00272] An example method includes receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

[00273] An example method includes receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation. [00274] An example method includes receiving a bitstream; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00275] An example non-transitory program storage device readable by a machine, tangibly embodying a program of instructions executable with the machine for performing operations is provided, the operations comprising: receiving an input data stream, the input data stream comprising a target frame and one or more reference frames; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation with respect to rate-distortion performance of the motion estimation operation; wherein the motion operation is adapted based on one or more aspects derived from the input data stream, or based on values of one or more aspects derived from the input data stream; and encoding at least one of the following into a bitstream: the output of the motion estimation operation, the output of the motion estimation operation being a motion field or optical flow that represents motion of content in the target frame and/or the one or more reference frames; or a motion prediction error based on a prediction of the output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation.

[00276] An example non-transitory program storage device readable by a machine, tangibly embodying a program of instructions executable with the machine for performing operations is provided, the operations comprising: receiving a bitstream comprising at least one of: an encoded output of a motion estimation operation, the encoded output of the motion estimation operation being a motion field or optical flow that represents motion of content in a target frame and/or one or more reference frames; or an encoded motion prediction error based on a prediction of the encoded output of the motion estimation operation, based on at least one previously decoded output of the motion estimation operation; wherein the motion estimation operation or an input of the motion estimation operation or the output of the motion estimation operation has been adapted based on one or more aspects derived from an input data stream, or based on values of one or more aspects derived from the input data stream, with respect to rate-distortion performance of the motion estimation operation; decoding the encoded output of the motion estimation operation, or decoding the encoded motion prediction error to predict the output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00277] An example non-transitory program storage device readable by a machine, tangibly embodying a program of instructions executable with the machine for performing operations is provided, the operations comprising: receiving a bitstream; adapting a motion estimation operation or an input of the motion estimation operation or an output of the motion estimation operation, based on one or more aspects derived from the bitstream, or based on values of one or more aspects derived from the bitstream, with respect to rate-distortion performance of the motion estimation operation; decoding an encoded output of the motion estimation operation, or decoding an encoded motion prediction error to predict an output of the motion estimation operation based on at least one previously decoded motion estimation output; performing a motion compensation operation using the decoded output of the motion estimation operation and one or more decoded reference frames, or using the predicted output of the motion estimation operation and one or more decoded reference frames; and generating a predicted frame during a reconstructing of the bitstream, based on the motion compensation operation.

[00278] It should be understood that the foregoing description is only illustrative. Various alternatives and modifications may be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different embodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims.

[00279] The following acronyms and abbreviations that may be found in the specification and/or the drawing figures are defined as follows:

3GPP 3rd generation partnership project

4G fourth generation of broadband cellular network technology

5G fifth generation cellular network technology 802.x family of IEEE standards dealing with local area networks and metropolitan area networks

ASIC application specific integrated circuit

A VC advanced video coding

CDMA code-division multiple access

Cnet convolutional neural network

DCT discrete cosine transform

Dec decoding/decoder

DSP digital signal processor

ECSEL Electronic Components and Systems for European Leadership

Enc encoding/encoder

FDMA frequency division multiple access

FPGA field programmable gate array

Gnet graph convolutional network

GSM global system for mobile communications

H.222.0 MPEG-2 systems, standard for the generic coding of moving pictures and associated audio information

H.2xx family of video coding standards in the domain of the ITU-T

HE VC high efficiency video coding

IBC intra block copy

IEC International Electrotechnical Commission

IEEE Institute of Electrical and Electronics Engineers

I/F interface

IMD integrated messaging device

IMS instant messaging service

Inv. Inverse

I/O input/output

IoT internet of things

IP internet protocol

ISO International Organization for Standardization

ISOBMFF ISO base media file format

ITU International Telecommunication Union

ITU-T ITU Telecommunication Standardization Sector

JU joint undertaking

LTE long-term evolution

MC motion compensation ME motion estimation

MMS multimedia messaging service

MPEG moving picture experts group

MPEG-2 H.222/H.262 as defined by the ITU

MSE mean squared error

NAL network abstraction layer

Net network

NN neural network

N/W network

Opt optimization

PC personal computer

PDA personal digital assistant

PID packet identifier

PLC power line communication

Pred prediction

RFID radio frequency identification

RGB red green blue

RFM reference frame memory

SEI supplemental enhancement information

SMS short messaging service

TCP-IP transmission control protocol-internet protocol

TDMA time divisional multiple access

TS transport stream

TV television

UICC universal integrated circuit card

UMTS universal mobile telecommunications system

USB universal serial bus

VCM video coding for machines

VSEI versatile supplemental enhancement information

VVC versatile video coding/codec

WLAN wireless local area network