Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
POST PROCESSING FILTERS SUITABLE FOR NEURAL-NETWORK-BASED CODECS
Document Type and Number:
WIPO Patent Application WO/2023/208638
Kind Code:
A1
Abstract:
The embodiments relate to method for encoding and decoding. The method for decoding comprises decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec. The embodiments also relate to technical equipment for implementing the methods.

Inventors:
GHAZNAVI YOUVALARI RAMIN (FI)
LE NAM HAI (FI)
CRICRÌ FRANCESCO (FI)
ZHANG HONGLEI (FI)
HANNUKSELA MISKA MATIAS (FI)
AHONEN JUKKA ILARI (FI)
REZAZADEGAN TAVAKOLI HAMED (FI)
Application Number:
PCT/EP2023/059951
Publication Date:
November 02, 2023
Filing Date:
April 18, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/117; H04N19/12; H04N19/154; H04N19/46; H04N19/80; H04N19/85
Domestic Patent References:
WO2023280558A12023-01-12
Other References:
LI CHEN ET AL: "CNN based post-processing to improve HEVC", 2017 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, 17 September 2017 (2017-09-17), pages 4577 - 4580, XP033323442, DOI: 10.1109/ICIP.2017.8297149
YASIN HAJAR MASEEH ET AL: "Review and Evaluation of End-to-End Video Compression with Deep-Learning", 2021 INTERNATIONAL CONFERENCE OF MODERN TRENDS IN INFORMATION AND COMMUNICATION TECHNOLOGY INDUSTRY (MTICTI), IEEE, 4 December 2021 (2021-12-04), pages 1 - 8, XP033997612, DOI: 10.1109/MTICTI53925.2021.9664790
AHONEN JUKKA I ET AL: "Learned Enhancement Filters for Image Coding for Machines", 2021 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA (ISM), IEEE, 29 November 2021 (2021-11-29), pages 235 - 239, XP034000019, DOI: 10.1109/ISM52913.2021.00046
Attorney, Agent or Firm:
NOKIA EPO REPRESENTATIVES (FI)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method, comprising: decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec.

2. The method according to claim 1, wherein: the decoding is performed using a decoding path through which the bitstream passes for the decoding, the bitstream is an end result of an encoding path, and the codec from which the auxiliary information is derived is in one or both of the encoding path or decoding path; the filtering uses auxiliary information derived from a codec in one or both of the encoding path or decoding path.

3. The method according to claim 1 or 2, wherein: the auxiliary information forms part of the received bitstream; and the decoding comprises: demultiplexing the received bitstream into the information and into the auxiliary information; decoding the information corresponding to an image or a video to form the decoded information, the decoding the information performed at least in part by a first neural network; and filtering the decoded information using the auxiliary information to create the output image or video. The method according to claim 1 or 2, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the method further comprises encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the method further comprises extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the method comprises combining the decoded information and the enhanced information, using at least a mask determined using the auxiliary information, to form combined information as the output image or video. The method according to claim 1 or 2, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the method further comprises encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the method further comprises extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the method comprises combining the decoded information and the enhanced information, using at least a mask received at the decoder, to form combined information as the output image or video. The method according to claim 1 or 2, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the method further comprises encoding the decoded information using at least conventional video compression techniques for form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the method further comprises extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the method comprises combining the decoded information and the enhanced information, using at least a mask derived from the auxiliary information, to form combined information as the output image or video. The method according to claim 1 or 2, wherein: the decoding the received bitstream comprises: decoding the information using at least a lossless decoder to form a latent tensor; and decoding a first part of the latent tensor at least by using a neural network to form the decoded information; and the filtering uses information from a second part of the latent tensor as the auxiliary information and the decoded information to form enhanced information as the output image or video. The method according to claim 1 or 2, wherein: the received bitstream comprises a bitstream for a learned image compression- encoded intra frame and a bitstream for conventional video compression- encoded intra frame, the intra frame and inter frame corresponding to video; the decoding the received bitstream comprises: decoding the received bitstream using a decoding performing learned image compression techniques to form a learned image compression-decoded intra frame; encoding the learned image compression-decoded intra frame using conventional video compression techniques to form a bitstream for conventional video compression-encoded intra frame; decoding the conventional video compression-encoded intra frame to form the auxiliary information; ordering the bitstream for conventional video compression-encoded intra frame and the bitstream for conventional video compression-encoded intra frame to form a conventional video compression bitstream; decoding the conventional video compression bitstream to form the decoded information; and the filtering filters the decoded information using the auxiliary information to form the output video. The method according to one of claims 1 to 8, wherein the auxiliary information comprises one or more of the following: high level information including picture or slice type; quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs; temporal layer ID of one or more of pictures or slices; block partitioning information; block level coding mode information, comprising one or more of intra or inter coded block information; block level intra prediction information; block level inter prediction information; reference picture resampling information; block level transform modes; block level DCT, DST or LFNST coefficients or other representation of an input block; block level information about residual information of that corresponding block; information about in-loop and post processing filters comprising one or more of the following: which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; any weighting or scaling which is applied to the output of any of these filters; or coefficients and parameters of a filter or filters that were determined at an encoder side; information on encoding configuration; information on encoder-side analysis results; or information about pre-processing operations performed prior to encoding. A method, comprising: encoding information corresponding to an image or a video to form encoded information; extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video; forming a bitstream from the auxiliary information and the encoded information; and outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance. The method according to claim 10, wherein: the encoding information performs encoding using at least a first neural network; the extracting is performed on the encoded information and comprises: decoding the encoded information using at least a second neural network to form decoded information; performing, using the codec, encoding and decoding of the decoded information to form second decoded information; and extracting the auxiliary information from the second decoded information. The method according to claim 10 or 11, wherein the codec uses conventional video compression techniques for its encoding and its decoding. The method according to claim 10, wherein: the encoding information performs encoding using at least a neural network; the extracting is performed on the information corresponding to the image or video and uses a codec that uses conventional video compression techniques for encoding and decoding the information. The method according to one of claims 10 to 13, wherein the forming the bitstream comprises multiplexing the auxiliary information and the encoded information together to form the bitstream. The method according to one of claims 10 to 14, wherein the auxiliary information comprises one or more of the following: high level information including picture or slice type; quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs; temporal layer ID of one or more of pictures or slices; block partitioning information; block level coding mode information, comprising one or more of intra or inter coded block information; block level intra prediction information; block level inter prediction information; reference picture resampling information; block level transform modes; block level DCT, DST or LFNST coefficients or other representation of an input block; block level information about residual information of that corresponding block; information about in-loop and post processing filters comprising one or more of the following: which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; any weighting or scaling which is applied to the output of any of these filters; or coefficients and parameters of a filter or filters that were determined at an encoder side; information on encoding configuration; information on encoder-side analysis results; or information about pre-processing operations performed prior to encoding. A computer program, comprising code for performing the methods of any of claims 1 to 15, when the computer program is run on a computer. The computer program according to claim 16, wherein the computer program is a computer program product comprising a computer-readable medium bearing computer program code embodied therein for use with the computer. The computer program according to claim 16, wherein the computer program is directly loadable into an internal memory of the computer. An apparatus comprising means for performing: decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec. The apparatus according to claim 19, wherein: the decoding is performed using a decoding path through which the bitstream passes for the decoding, the bitstream is an end result of an encoding path, and the codec from which the auxiliary information is derived is in one or both of the encoding path or decoding path; the filtering uses auxiliary information derived from a codec in one or both of the encoding path or decoding path. The apparatus according to claim 19 or 20, wherein: the auxiliary information forms part of the received bitstream; and the decoding comprises: demultiplexing the received bitstream into the information and into the auxiliary information; decoding the information corresponding to an image or a video to form the decoded information, the decoding the information performed at least in part by a first neural network; and filtering the decoded information using the auxiliary information to create the output image or video. The apparatus according to claim 19 or 20, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the means are further configured for performing: extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask determined using the auxiliary information, to form combined information as the output image or video. The apparatus according to claim 19 or 20, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the means are further configured for performing: extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask received at the decoder, to form combined information as the output image or video. The apparatus according to claim 19 or 20, wherein: the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information; the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques for form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information; the means are further configured for performing: extracting the auxiliary information using the second decoded information; the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask derived from the auxiliary information, to form combined information as the output image or video. The apparatus according to claim 19 or 20, wherein: the decoding the received bitstream comprises: decoding the information using at least a lossless decoder to form a latent tensor; and decoding a first part of the latent tensor at least by using a neural network to form the decoded information; and the filtering uses information from a second part of the latent tensor as the auxiliary information and the decoded information to form enhanced information as the output image or video. The apparatus according to claim 19 or 20, wherein: the received bitstream comprises a bitstream for a learned image compression- encoded intra frame and a bitstream for conventional video compression- encoded intra frame, the intra frame and inter frame corresponding to video; the decoding the received bitstream comprises: decoding the received bitstream using a decoding performing learned image compression techniques to form a learned image compression-decoded intra frame; encoding the learned image compression-decoded intra frame using conventional video compression techniques to form a bitstream for conventional video compression-encoded intra frame; decoding the conventional video compression-encoded intra frame to form the auxiliary information; ordering the bitstream for conventional video compression-encoded intra frame and the bitstream for conventional video compression-encoded intra frame to form a conventional video compression bitstream; decoding the conventional video compression bitstream to form the decoded information; and the filtering filters the decoded information using the auxiliary information to form the output video. The apparatus according to one of claims 19 to 26, wherein the auxiliary information comprises one or more of the following: high level information including picture or slice type; quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs; temporal layer ID of one or more of pictures or slices; block partitioning information; block level coding mode information, comprising one or more of intra or inter coded block information; block level intra prediction information; block level inter prediction information; reference picture resampling information; block level transform modes; block level DCT, DST or LFNST coefficients or other representation of an input block; block level information about residual information of that corresponding block; information about in-loop and post processing filters comprising one or more of the following: which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; any weighting or scaling which is applied to the output of any of these filters; or coefficients and parameters of a filter or filters that were determined at an encoder side; information on encoding configuration; information on encoder-side analysis results; or information about pre-processing operations performed prior to encoding. An apparatus comprising means for performing: encoding information corresponding to an image or a video to form encoded information; extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video; forming a bitstream from the auxiliary information and the encoded information; and outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance. The apparatus according to claim 28, wherein: the encoding information performs encoding using at least a first neural network; the extracting is performed on the encoded information and comprises: decoding the encoded information using at least a second neural network to form decoded information; performing, using the codec, encoding and decoding of the decoded information to form second decoded information; and extracting the auxiliary information from the second decoded information. The apparatus according to claim 28 or 29, wherein the codec uses conventional video compression techniques for its encoding and its decoding. The apparatus according to claim 28, wherein: the encoding information performs encoding using at least a neural network; the extracting is performed on the information corresponding to the image or video and uses a codec that uses conventional video compression techniques for encoding and decoding the information. The apparatus according to one of claims 28 to 31, wherein the forming the bitstream comprises multiplexing the auxiliary information and the encoded information together to form the bitstream. The apparatus according to one of claims 28 to 32, wherein the auxiliary information comprises one or more of the following: high level information including picture or slice type; quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs; temporal layer ID of one or more of pictures or slices; block partitioning information; block level coding mode information, comprising one or more of intra or inter coded block information; block level intra prediction information; block level inter prediction information; reference picture resampling information; block level transform modes; block level DCT, DST or LFNST coefficients or other representation of an input block; block level information about residual information of that corresponding block; information about in-loop and post processing filters comprising one or more of the following: which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; any weighting or scaling which is applied to the output of any of these filters; or coefficients and parameters of a filter or filters that were determined at an encoder side; information on encoding configuration; information on encoder-side analysis results; or information about pre-processing operations performed prior to encoding. The apparatus of any preceding apparatus claim, wherein the means comprises: at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus. An apparatus, comprising: one or more processors; and one or more memories including computer program code, wherein the one or more memories and the computer program code are configured, with the one or more processors, to cause the apparatus to: decode a received bitstream comprising information corresponding to an image or a video to form decoded information; and filter the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec. A computer program product comprising a computer-readable storage medium bearing computer program code embodied therein for use with a computer, the computer program code comprising: code for decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and code for filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec. An apparatus, comprising: one or more processors; and one or more memories including computer program code, wherein the one or more memories and the computer program code are configured, with the one or more processors, to cause the apparatus to: encode information corresponding to an image or a video to form encoded information; extract, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video; form a bitstream from the auxiliary information and the encoded information; and output the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance. A computer program product comprising a computer-readable storage medium bearing computer program code embodied therein for use with a computer, the computer program code comprising: code for encoding information corresponding to an image or a video to form encoded information; code for extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video; code for forming a bitstream from the auxiliary information and the encoded information; and code for outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance.

Description:
POST PROCESSING FILTERS SUITABLE FOR NEURAL-NETWORK-BASED CODECS

TECHNICAL FIELD

[0001] Exemplary embodiments herein relate generally to video coding and decoding and, more specifically, relate to filtering in video coding and decoding.

BACKGROUND

[0002] Video coding and decoding is performed by encoders and decoders, typically with the encoder used prior to transmission of coded video, and a decoder used after reception of coded video. The encoder takes input video and encodes it, and often the encoding substantially reduces the size of the video, making the transmission need smaller bandwidth. The decoder then recreates a version of the input video based on the received coded video. The encoder and decoder are often contained together in a codec (coder/decoder), and the codec can therefore perform whichever of the encoding or decoding functions is needed.

[0003] Video coding and decoding has recently begun using neural networks (NNs). As such, neural networks and other types of machine learning (ML) devices may play important roles in video encoding and decoding.

[0004] A post-processing filter may be used in a video decoder to improve the quality of the output video. Within these and other systems, the post-processing may be improved.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] In the attached Drawing Figures:

[0006] FIG. 1 is a block diagram illustrating examples of how NNs can function as components of a traditional codec’s pipeline;

[0007] FIG. 2 is a block diagram of an approach where a traditional video coding pipeline might be reused, where most or all of the components are replaced with NNs;

[0008] FIG. 3 is a block diagram of an approach where the entire codec is redesigned using NN components;

[0009] FIG. 4 is a block diagram of a typical neural network-based end-to-end learned video coding system; [0010] FIG. 5 is a general illustration of the pipeline of video coding for machines;

[0011] FIG. 6 is an illustration of FIG. 5, where the encoder and decoder are implemented in part with NNs;

[0012] FIG. 7 illustrates an example of how the end-to-end learned system, such as in FIG. 6, may be trained;

[0013] FIG. 8 illustrates an example including an encoder, a decoder, a postprocessing filter, a set of task-NNs;

[0014] FIG. 9 illustrates an MLC system that allows an end-to-end learned image compression system (LIC) to be used together with a conventional video compression (CVC) codec;

[0015] FIG. 10 illustrates FIG. 9 but with CVC pre-filtering added;

[0016] FIG. 11 is a block diagram of a system where the NNC-decoded content are re-encoded and then decoded with CVC codec for extracting the auxiliary information, in accordance with an exemplary embodiment;

[0017] FIG. 11 A is a block diagram of a system similar to that in FIG. 11, only with a different chain of codecs;

[0018] FIG. 12 is a block diagram of a system where transcoding or re-encoding of NNC-decoded data with CVC is performed in the encoder side of the system, in accordance with an exemplary embodiment;

[0019] FIG. 13 is a block diagram of a system where the auxiliary information obtained from a CVC decoder or from a region-proposal neural network (or from similar sources) is used to combine the output of the post-processing filter with its input, in an exemplary embodiment;

[0020] FIG. 14 is a block diagram of a system where the bitstream output by the encoder of the NNC codec is lossless decoded into a latent tensor which includes two portions or subsets of the latent tensor, in accordance with an exemplary embodiment;

[0021] FIG. 15 is a block diagram of a system that illustrates an example of using the methods herein in an MLC system, in accordance with an exemplary embodiment; and

[0022] FIG. 16 is a block diagram of a communication device suitable for implementing the exemplary embodiments. DETAILED DESCRIPTION OF THE DRAWINGS

[0023] Abbreviations that may be found in the specification and/or the drawing figures are defined below, at the end of the detailed description section.

[0024] The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. All of the embodiments described in this Detailed Description are exemplary embodiments provided to enable persons skilled in the art to make or use the invention and not to limit the scope of the invention which is defined by the claims.

[0025] When more than one drawing reference numeral, word, or acronym is used within this description with “/”, and in general as used within this description, the “/” may be interpreted as “or”, “and”, or “both”.

[0026] As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “has”, “having”, “includes” and/or “including”, when used herein, specify the presence of stated features, elements, and/or components and the like., but do not preclude the presence or addition of one or more other features, elements, components and/ or combinations thereof.

[0027] For ease of reference, this document is divided into sections. These sections are not considered to be limiting.

[0028] I. Overview of exemplary technical areas

[0029] This section contains information about the technical areas that might be applicable to the exemplary embodiments.

[0030] La. Fundamentals of neural networks

[0031] A neural network (NN) is a computation graph comprising several layers of computation. Each layer comprises one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may be associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.

[0032] Two of the most widely used architectures for neural networks are feedforward and recurrent architectures. Feed-forward neural networks are such that there is no feedback loop: each layer takes input from one or more of the layers before and provides its output as the input for one or more of the subsequent layers. Also, units inside a certain layer take input from units in one or more of preceding layers, and provide output to one or more of following layers.

[0033] Initial layers (those close to the input data) extract semantically low-level features such as edges and textures in images, and intermediate and final layers extract more high- level features. After the feature extraction layers, there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, and the like. In recurrent neural nets, there is a feedback loop, so that the network becomes stateful, i.e., it is able to memorize information or a state.

[0034] Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, and the like.

[0035] The most important property of neural nets (and other machine learning tools) is that they are able to learn properties from input data, either in a supervised way or in an unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.

[0036] In general, the training algorithm includes changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category to which the object in the input image belongs. Training usually happens by minimizing or decreasing the output’s error, also referred to as the loss. Examples of losses are mean squared error, cross-entropy, and the like. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss.

[0037] In this document, the terms “model”, “neural network”, “neural net” and “network” are used interchangeably, and also the weights of neural networks are sometimes referred to as learnable parameters or simply as parameters.

[0038] Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following things:

[0039] 1) If the network is learning at all - in this case, the training set error should decrease, otherwise the model is in the regime of underfitting.

[0040] 2) If the network is learning to generalize - in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or the validation set error does not decrease, or it even increases, the model is in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set, but performs poorly on a set not used for tuning its parameters.

[0041] Lately, neural networks have been used for compressing and decompressing data such as images, i.e., in an image codec. The most widely used architecture for realizing one component of an image codec is the auto-encoder, which is a neural network including two parts: a neural encoder and a neural decoder (these are referred to herein simply as encoder and decoder, even though we may refer to algorithms which are learned from data instead of being tuned by hand). The encoder takes as input an image and produces a code which requires less bits than the input image. This code may be obtained by applying a binarization or quantization process to the output of the encoder. The decoder takes in this code and reconstructs the image which was input to the encoder.

[0042] Such encoder and decoder are usually trained to minimize a combination of bitrate and distortion, where the distortion may be based on one or more of the following metrics: Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), or similar. These distortion metrics are meant to be correlated to the human visual perception quality, so that minimizing or maximizing one or more of these distortion metrics results into improving the visual quality of the decoded image as perceived by humans.

[0043] Lb. Fundamentals of video/image coding

[0044] A video codec includes an encoder that transforms the input video into a compressed representation suited for storage/transmission, and a decoder that can decompress the compressed video representation back into a viewable form. Typically, the encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).

[0045] Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly, pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e., the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g., Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (e.g., picture quality) and size of the resulting coded video representation (e.g., file size or transmission bitrate).

[0046] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction, the sources of prediction are previously decoded pictures.

[0047] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in the spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied.

[0048] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.

[0049] The decoder reconstructs the output video by applying prediction similar to the encoder to form a predicted representation of the pixel blocks (e.g., using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (e.g., inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering to improve the quality of the output video before passing the video for display and/or storing the video as prediction reference for the forthcoming frames in the video sequence.

[0050] In typical video codecs, the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently, those are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs, the predicted motion vectors are created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or colocated blocks in temporal reference pictures, and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co-located blocks in temporal reference picture. Moreover, typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks. [0051] In typical video codecs, the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding.

[0052] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g., the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:

[0053] C = D + /J .

[0054] where C is the Lagrangian cost to be minimized, D is the image distortion (e.g., Mean Squared Error) with the mode and motion vectors considered, and R is the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).

[0055] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike. An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified. [0056] A design principle has been followed for SEI message specifications: the SEI messages are generally not extended in future amendments or versions of the standard.

[0057] I.c. Information on filters in video codecs

[0058] Conventional image and video codecs use a set of filters to enhance the visual quality of the predicted visual content, and can be applied either in-loop or out-of-loop, or both. In the case of in-loop filters, the filter applied on one block in the currently-encoded frame will affect the encoding of another block in the same frame and/or in another frame which is predicted from the current frame. An in-loop filter can affect the bitrate and/or the visual quality. In fact, an enhanced block will cause a smaller residual (difference between original block and predicted-and-filtered block), thus requiring less bits to be encoded. An out-of-the loop filter will be applied on a frame after the frame has been reconstructed, the filtered visual content won't be as a source for prediction, and thus the out-of-the-loop filter may only impact the visual quality of the frames that are output by the decoder.

[0059] I d. Information on neural network-based image/video coding

[0060] Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches.

[0061] In one approach, NNs are used to replace one or more of the components of a traditional codec such as VVC/H.266. Here, by “traditional”, it is meant those codecs whose components and their parameters are typically not learned from data. Examples of components that may be implemented as neural networks are:

[0062] 1) An additional in-loop filter, for example by having the NN as an additional in-loop filter with respect to the traditional loop filters.

[0063] 2) A single in-loop filter, for example by having the NN replacing all traditional in-loop filters.

[0064] 3) Intra-frame prediction.

[0065] 4) Inter-frame prediction.

[0066] 5) A transform and/or inverse transform.

[0067] 6) A probability model for the arithmetic codec.

[0068] Referring to FIG. 1, this figure illustrates examples of how NNs (neural networks) can function as components of a traditional codec’s pipeline. It is noted that deep NNs are used. The term “deep NN” is typically defined as a neural network with more than two layers, such as a network that has an input layer, an output layer and at least one hidden layer in between. [0069] The block diagram illustrates circuitry (or functions implemented in circuitry, or both) for an exemplary and typical codec 50. The following elements are in the signal path from the video signal 21 to the reconstruction signal 22 or to the bitstream signal 23: a switch 25 (controlled by the encoder control function); a resolution adaptation function; a T/Q (transform/quantize) function; an encoder control function; a luma intra-prediction function (Luma Intra Pred); a chroma intra-prediction function (Chroma Intra Pred); an intra-prediction function (Intra Pred); an inter-prediction function (Inter Pred); a switch 30 to select between the intra- and inter-prediction functions; an ME/MC function (ME/MC); a post-processing filter function (Post Filter); a Q'VT' 1 (dequantize/inverse transform) function; an adder 32; an entropy coding function; an upsampling function; a switch 35 that selects between bypassing or selecting the upsampling function; and an in-loop filter function.

[0070] Block 20 indicates exemplary, but non-limiting, locations where various NNs can be implemented: 1-a deep intra-prediction neural network (Deep Intra Pred Net) can be applied to the luma intra-prediction function; 2-a deep cross-component prediction neural network (Deep Cross Component Pred Net) can be applied to the chroma intra-prediction function; 3 -a deep inter-prediction neural network (Deep Inter Pred Net) can be applied to the inter-prediction function; 4-a deep probability estimation neural network (Deep Probability Estimation Net) can be applied at the entropy coding function; 5-a deep transformation neural network (Deep Transform Net) can be applied to the T/Q function; 6-a deep loop filter can be applied to the in-loop filter function; 7-a deep post-processing filter neural network can be applied to the post-processing filter function; 8-a deep up-/down-sampling coding neural network can be applied for the resolution adaptation function; and/or 9-a deep encoder optimization neural network can be applied to the encoder control function.

[0071] In another approach, commonly referred to as “end-to-end learned compression”, NNs are used as the main components of the image/video codecs. In this second approach, there are two main options. In a first option, Option 1, this option re-uses the traditional video coding pipeline, but replaces most or all the components with NNs. See FIG. 2, which is a block diagram of an approach where a traditional video coding pipeline might be reused, where most or all of the components are replaced with NNs.

[0072] In FIG. 2, an input signal (X) 205 is applied to an adder 207, which is coupled to a neural transform 215. The rest of the elements in FIG. 2 are as follows: an encoder parameter control function 210; a quantization function 220; an inverse quantization/neural transform 225; another adder 227; an entropy coding function 230 that produces a bitstream 290; a neural intra-codec 235 comprising an encoder 250, an intra-coding function 240, and a decoder 235, and producing an output 236; an inter-prediction function 255 comprising an ME/MC function 260 and a Gnet(Cnet(*)) function producing, the inter-prediction function 255 producing an output 267; a switch 257 that can select between outputs 236 and 267; a decode picture buffer 270 comprising reconstructed frames 275 and an enhanced reference frame 20; and a deep loop filter 285.

[0073] In a second option, Option 2, the entire pipeline is redesigned, as follows: [0074] 1) An encoder NN performs a non-linear transform.

[0075] 2) Quantization and lossless encoding is performed of the encoder NN’s output.

[0076] 3) Lossless decoding and dequantization is performed.

[0077] 4) A decoder NN performs a non-linear inverse transform.

[0078] The encoder and decoder NNs may be parts of a neural auto-encoder architecture.

[0079] See FIG. 3, which is a block diagram of an approach where the entire codec is redesigned using NN components. In this figure, the Analysis Network is the Encoder NN, and the Synthesis Network is the Decoder NN. In FIG. 3, reconstruction loss 304 and auxiliary loss 310 can be determined using a target 315 and reconstruction 320 (e.g., video), where the target 315 is applied to an analysis network 335 of a set of spatial correlation tools 330, then to a quantization function 350, then to an adaptive codelength regularization function 345 and an arithmetic coding function 355, where 350, 345, and 355 are part of the bitrate control strategy function 360, The arithmetic coding function 355 produces output (e.g., a bitstream) 370 routed to the adaptive codelength regularization function 345 and to an arithmetic decoding function 365. The arithmetic decoding function 365 produces output for the synthesis network 340 (part of the set of spatial correlation tools 330) produces the reconstruction 320.

[0080] In Option 2, the input data is analyzed by the Encoder NN (Analysis Network 335), which outputs a new representation of that input data. The new representation may be more compressible. This new representation may then be quantized (via 35) to a discrete number of values. The quantized data is then lossless encoded, for example by an arithmetic encoder 355, thus obtaining a bitstream 370. On the decoding side, the bitstream is first lossless decoded, for example by using an arithmetic decoder 365. The lossless decoded data is dequantized and then input to the Decoder NN (Synthesis Network 340). The output is the reconstructed or decoded data (see 320).

[0081] In case of lossy compression, the lossy steps may comprise the Encoder NN and/or the quantization.

[0082] In order to train this system, a training objective function (also called “training loss”) is typically utilized, which usually comprises one or more terms, or loss terms, or simply losses. In one example, the training loss comprises a reconstruction loss term and a rate loss term. The reconstruction loss encourages the system to decode data that is similar to the input data, according to some similarity metric. Reconstruction losses may be derived from one or more of the following metrics:

[0083] 1) Mean squared error (MSE).

[0084] 2) Multi-scale structural similarity (MS-SSIM).

[0085] 3) Losses derived from the use of a pretrained neural network. For example, error(fl, f2), where fl and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as LI norm or L2 norm.

[0086] 4) Losses derived from the use of a neural network that is trained simultaneously with the end-to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings proposed in the context of Generative Adversarial Networks (GANs) and their variants.

[0087] The rate loss encourages the system to compress the output of the encoding stage, such as the output of the arithmetic encoder. By “compressing”, what is meant is reducing the number of bits output by the encoding stage is reduced, relative to what would occur without compression.

[0088] When an entropy-based lossless encoder is used, such as an arithmetic encoder, the rate loss typically encourages the output of the encoder NN to have low entropy. Rate losses may be computed on the output of the encoder, either before or after quantization, or on the output of the probability model. Example of rate losses are the following:

[0089] 1) A differentiable estimate of the entropy. [0090] 2) A sparsification loss, i.e., a loss that encourages the output of the encoder NN or the output of the quantization to have many zeros. Examples are L0 norm, LI norm, LI norm divided by L2 norm.

[0091] 3) A cross-entropy loss applied to the output of a probability model, where the probability model may be a NN used to estimate the probability of the next symbol to be encoded by an arithmetic encoder.

[0092] One or more of reconstruction losses may be used, and one or more of the rate losses may be used, as a weighted sum. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of ratedistortion loss. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy (as measured by a metric that correlates with the reconstruction losses). These weights are usually considered to be hyper-parameters of the training session, and may be set manually by the person designing the training session, or automatically for example by grid search or by using additional neural networks.

[0093] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as the arithmetic codec.

[0094] I.e. Exemplary fundamentals of neural network-based end-to-end learned video coding

[0095] As shown in FIG. 4, a typical neural network-based end-to-end learned video coding system contains an encoder 410, a quantizer 415, a probability model 430, an entropy (also referred to as arithmetic) codec 470 (for example arithmetic encoder 420/decoder 425), a dequantizer 435, and a decoder 440. The “x” indicates the video input 405, the “x” indicates the reconstructed video output 450, and the bitstream 480 is subject to whatever channel through which it is communicated. The encoder 410 and decoder 440 are typically two neural networks, or mainly comprise neural network components. The probability model 430 may also comprise mainly neural network components. The probability model 430 provides one of the inputs to the entropy codec, i.e., a probability estimate. The quantizer 415, dequantizer 435, and entropy codec 470 are typically not based on neural network components, but they may also comprise neural network components, potentially.

[0096] An entropy model 455 comprises a quantizer 415, an arithmetic encoder 420, an arithmetic decoder 425, a dequantizer 435, and a probability model 430. For simplicity, in FIG. 5, 6, 7, and 8, we use an entropy model 555 that interacts with a bitstream to present a function of conversion between a latent representation and a bitstream.

[0097] On the encoder side, the encoder 410 component takes a video 405 as input and converts the video from its original signal space into a latent representation that may comprise a more compressible representation of the input. In this document, latent representation and latent feature(s) are considered to be equivalent. A latent representation is a representation of the input signal in a new space, normally called a latent space, for the purpose of analysis or compression. The latent representation is obtained by computation from an observed input. In the case of an input image, for instance, one exemplary latent representation may be a 3 -dimensional tensor, where two dimensions represent the vertical and horizontal spatial dimensions, and the third dimension represents the “channels” which contain information at that specific location. If the input image is a 128x128x3 RGB image (with horizontal size of 128 pixels, vertical size of 128 pixels, and 3 channels for the Red, Green, Blue color components), and if the encoder downsamples the input tensor by 2 and expands the channel dimension to 32 channels, then the latent representation is a tensor of dimensions (or “shape”) 64x64x32 (i.e., with horizontal size of 64 elements, vertical size of 64 elements, and 32 channels). Note that the order of the different dimensions may differ depending on the convention which is used; in some cases, for the input image, the channel dimension may be the first dimension, so for the above example, the shape of the input tensor may be represented as 3x128x128, instead of 128x128x3. In the case of an input video (instead of just an input image), another dimension in the input tensor may be used to represent temporal information. Similar information may be used for audio, as audio latent feature 535.

[0098] It should also be noted that the video or audio latent features can be generalized to features. For instance, for the channels in the new (e.g., video) space described above, these could correspond to features such as spatial resolution, color bit depth, chroma subsampling, gamma, color space, frame rate, field rate (temporal), and/or the like, or other compression techniques may be used with corresponding features. Audio can have similar features, as is known.

[0099] The quantizer 415 component quantizes the latent representation into discrete values given a predefined set of quantization levels. The probability model 430 and arithmetic encoder 420 component work together to perform lossless compression for the quantized latent representation and generate a bitstream 480 to be sent to the decoder side. Given a symbol to be encoded into the bitstream 480, the probability model 430 estimates the probability distribution of all possible values for that symbol based on a context that is constructed from available information at the current encoding/decoding state, such as the data that has already been encoded/decoded. Then, the arithmetic encoder 420 encodes the input symbols to bitstream using the estimated probability distributions.

[00100] On the decoder side, the opposite operations are performed. The arithmetic decoder 425 and the probability model 430 first decode symbols from the bitstream to recover the quantized latent representation. Then, the dequantizer 435 reconstructs the latent representation in continuous values and passes the result to the decoder 440 to recover the input video/image, e.g., as the reconstructed video 450. Note that the probability model in this system is shared between the encoding and decoding systems. In practice, this means that a copy of the probability model is used at the encoder side, and another exact copy is used at the decoder side.

[00101] In this system, the encoder, probability model, and decoder are normally based on deep neural networks. The system is trained in an end-to-end manner by minimizing the following rate-distortion loss function:

[00102] I. D /JC

[00103] where D is a distortion loss term, R is a rate loss term, and is a weight that controls the balance between the two losses. The distortion loss term may be the mean square error (MSE), structure similarity (SSIM) or other metrics that evaluate the quality of the reconstructed video. Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM. The rate loss term is normally the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the encoded symbols, for example, bits-per-pixel (bpp).

[00104] For lossless video/image compression, the system contains only the probability model and arithmetic encoder/decoder. The system loss function contains only the rate loss, since the distortion loss is always zero (i.e., no loss of information).

[00105] I.f. Information on Video Coding for Machines (VCM)

[00106] Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e., consuming/watching the decoded image. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (i.e., autonomous agents) that analyze data independently from humans and that may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, and the like. Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, and the like. This may raise the following question: when decoded data is consumed by machines, shouldn’t one aim at a different quality metric -other than human perceptual quality- when considering media compression in inter-machine communications? Also, dedicated algorithms for compressing and decompressing data for machine consumption are likely to be different than those for compressing and decompressing data for human consumption. The set of tools and concepts for compressing and decompressing data for machine consumption is referred to here as video coding for machines.

[00107] It is likely that the receiver-side device has multiple “machines” such as neural networks (NNs) or other machine learning machines. These multiple machines may be used in a certain combination which is, for example, determined by an orchestrator sub-system. The multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel. For example, a video which was compressed and then decompressed may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.

[00108] Please note that in this document, the terms “machine” and “task neural network” are used interchangeably, and what is meant is any process or algorithm (learned or not from data), implemented via hardware, which analyzes or processes data for a certain task. In the rest of the document, specification may be made in further details about other assumptions made regarding the machines considered herein.

[00109] Also, please note that the terms “receiver-side” or “decoder-side” are used to refer to the physical entity or device which contains one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation, which is encoded by another physical entity or device, the “encoder-side device”.

[00110] The encoded video data may be stored into a memory device, for example as a file. The stored file may later be provided to another device. Alternatively, the encoded video data may be streamed from one device to another. [00111] Referring to FIG. 5, this figure is a general illustration of the pipeline of Video Coding for Machines (VCMs). A VCM encoder 510 encodes the input video 505 into a bitstream 515. A bitrate 525 may be computed 520 from the bitstream 515 in order to evaluate the size of the bitstream. A VCM decoder 530 decodes the bitstream 515 output by the VCM encoder 510. The output of the VCM decoder 530 is referred to in the figure as “Decoded data 540 for machines”. This data 540 may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline, this data may not have the same or similar characteristics as the original video which was input to the VCM encoder 510. For example, this data 540 may not be easily understandable by a human by simply rendering the data onto a screen. The output of VCM decoder 530 is then input to one or more task neural networks 545-1 through 545-X, which produce corresponding outputs 560-1 through 560-X. In the figure, for the sake of illustrating that there may be any number (e.g., “X”) of task-NNs, there are three example task-NNs, and a nonspecified one (Task-NN X). The goal of VCM is to obtain a low bitrate while guaranteeing that the task-NNs still perform well in terms of the evaluation metric associated to each task. These tasks may have their performance evaluated 550-1 through 550-X, resulting in the corresponding task’s performance 555-1 through 555-X.

[00112] One of the possible approaches to realize video coding for machines is an end-to-end learned approach. In this approach, the VCM encoder and VCM decoder mainly comprise neural networks. FIG. 6 is an example of a pipeline for the end-to-end learned approach of FIG. 5, where the encoder and decoder are implemented in part with NNs. The VCM encoder 510 includes a neural network encoder 565, a lossless encoder 570, and a probability model 580. The VCM decoder 530 includes a neural network decoder 595, a lossless decoder 585, and a probability model 590.

[00113] The video 505 is input to a neural network encoder 565. The output of the neural network encoder 565 is input to a lossless encoder 570, such as an arithmetic encoder, which outputs a bitstream 515. The lossless codec (e.g., 580 and 590) may include a probability model, both in the lossless encoder (probability model 580) and in the lossless decoder (probability model 590), which predicts the probability of the next symbol to be encoded and decoded. The probability model may also be learned, for example it may be a neural network. At decoder-side, the bitstream is input to a lossless decoder 585, such as an arithmetic decoder, whose output is input to a neural network decoder 530. The output of the neural network decoder 530 is the decoded data 540 for machines, which may be input to one or more task-NNs 545. [00114] FIG. 7 illustrates an example of how the end-to-end learned system, such as in FIG. 6, may be trained. For the sake of simplicity, only one task-NN 545 is illustrated. A rate loss 715 may be computed 710 from the output of the probability model 580. The rate loss 715 provides an approximation of the bitrate required to encode the input video 505. A task loss 730 may be computed 735 from the output of the task-NN 545.

[00115] The rate loss 715 and the task loss 730 may then be used to train the neural networks used in the system, such as the neural network encoder 565, the probability model 580, and the neural network decoder 595. Training may be performed by first computing gradients of each loss with respect to the neural networks that are contributing or affecting the computation of that loss. See the training in blocks 720 and 735. The gradients are then used by an optimization method, such as Adam, for updating the trainable parameters of the neural networks.

[00116] The machine tasks may be performed at decoder side (instead of at encoder side) for multiple reasons, for example because the encoder-side device does not have the capabilities (computational, power, memory) for running the neural networks that perform these tasks, or because some aspects or the performance of the task neural networks may have changed or improved by the time that the decoder-side device needs the tasks results (e.g., different or additional semantic classes, better neural network architecture). Also, there could be a customization need, where different clients would run different neural networks for performing these machine learning tasks.

[00117] Alternatively to an end-to-end trained codec, a video codec for machines can be realized by using a traditional codec such as H.266/VVC.

[00118] Alternatively, as described already above for the case of video coding for humans, another possible design may comprise using a traditional codec such as H.266/VVC, which includes one or more neural networks. In one possible implementation, the one or more neural networks may replace one of the components of the traditional codec, such as:

[00119] 1) One or more in-loop filters.

[00120] 2) One or more intra-prediction modes.

[00121] 3) One or more inter-prediction modes.

[00122] 4) One or more transforms.

[00123] 5) One or more inverse transforms.

[00124] 6) One or more probability models, for lossless coding.

[00125] 7) One or more post-processing filters. [00126] In another possible implementation, the one or more neural networks may function as an additional component, such as:

[00127] 1) One or more additional in-loop filters.

[00128] 2) One or more additional intra-prediction modes.

[00129] 3) One or more additional inter-prediction modes.

[00130] 4) One or more additional transforms.

[00131] 5) One or more additional inverse transforms.

[00132] 6) One or more additional probability models, for lossless coding.

[00133] 7) One or more additional post-processing filters.

[00134] Alternatively, another possible design may comprise using any codec architecture (such as a traditional codec, or a traditional codec which includes one or more neural networks, or an end-to-end learned codec), and having a post-processing neural network which adapts the output of the decoder so that the output can be analyzed more effectively by one or more machines or task neural networks. For example, the encoder and decoder may be conformant to the H.266/VVC standard, a post-processing neural network takes the output of the decoder, and the output of the post-processing neural network is then input to an object detection neural network. In this example, the object detection neural network is the machine or task neural network.

[00135] FIG. 8 illustrates an example including an encoder 810, a decoder 830, a post-processing filter 835, and a set of task-NNs 545. The encoder 810 and decoder 830 may represent a traditional image or video codec, such as a codec conformant with the VVC/H.266 standard, or may represent an end-to-end (E2E) learned image or video codec. The encoder 810 produces a bitstream 515-1, and the post-processing filter 835 produces decoded data 540-1 for machines. The post-processing filter 835 may be a neural network-based filter. The task-NNs 545 may be neural networks that perform tasks such as object detection, object segmentation, object tracking, and the like.

[00136] I.g. Enabling Neural Network based Intra Coding for Conventional Video Decoding

[00137] This section contains example for enabling neural network based intra coding for conventional video decoding.

[00138] I .g. l. General Framework

[00139] In FIG. 9, a framework 900 is illustrated that allows an end-to-end learned image compression (LIC) system to be used together with a conventional video compression (CVC) codec, where the LIC performs intra-frame coding and the CVC performs primarily interframe coding, and where the LIC-decoded intra frame may be used as a reference frame in a CVC codec. This framework 900 may also be considered to be a system. The CVC codec may be any conventional codec, for example, VVC or HEVC.

[00140] The framework 900 has a Mixed Learned and Conventional codec (MLC) encoder 915 hat has inputs of “intra” frames 905 and “inter” frames 910, and produces an output of a bitstream 965 for LCVC-encoded inter frames. The “intra” frames 905 are handled by the LIC codec 930, which produces LIC-decoded intra frames 940. The LIC codec 930 has an LIC encoder 920 that forms a bitstream 925 for LIC-encoded intra frames, which are operated on by an LIC decoder 935, creating the LIC-decoded intra frames 940. The CVC encoder 945 in the MLC encoder 915 has an LL-CVC codec 950 that operates on the LIC-decode intra frames 940 and produces LL-C VC-decoded intra frames 955. The “inter” frames 910 are operated on by the LCVC encoding 960, which also uses the LL-CVC-decoded intra frames 955 and produces the bitstream 965 for LCVC-encoded inter frames.

[00141] The MLC decoder 970 comprises an LIC decoder 973 that operates on the bitstream 925 for LIC-encoded intra frame and forms LIC-decoded intra-frames 978. An LL-CVC encoder 979 produces a bitstream 980 for LL-CVC-encoded intra frames from the LIC-decoded intra-frame 978. The bitstreams 980 and 965 are ordered 985 to a CVC bitstream 990. A CVC decoder 995 operates on the CVC bitstream 990 to produce decoded frames 998.

[00142] An output of a LIC (de)coder, i.e., a LIC-decoded intra frame 940, is input to a first set of algorithms that are part of a CVC encoder 945. This set is referred to as a LL-CVC encoder 950. Outputs of an LL-CVC encoder 950 comprise a decoded intra frame 955 and, in some embodiments, additional information such as partitioning information. An output 955 of the LL- CVC encoder, i.e., at least an LL-CVC-decoded intra frame, is input to a second set of algorithms that are part of a CVC encoder, referred to as a LCVC encoder 960, where the encoder may be used for inter-frame coding purposes. The LCVC encoder 960 may perform lossy compression. In some embodiments, the first set of algorithms and the second set of algorithms may be the same set of algorithms. In some other embodiments, the first set of algorithms and the second set of algorithms may be different. In one example, the first set of algorithms is a set of lossless coding algorithms, whereas the second set of algorithms is a set of lossy coding algorithms. [00143] The transmitted bitstream from MLC encoder 915 to MLC decoder 970 may comprise the bitstream 925 output by the LIC encoder 920, and the bitstream 965 output by the LCVC encoder 960.

[00144] In the MLC encoder 915, the intra frame is encoded and decoded by the LIC codec 930. The LIC encoder 930 gets as input an intra frame 905 and outputs a bitstream 925 representing the LIC-encoded intra frame. The LIC encoder 930 may for example comprise one or more NN encoders, one or more quantization operations, one or more probability models, and one or more arithmetic encoders (see the section above for examples of end-to-end learned image codecs). The bitstream 925 output by the LIC encoder 920 is input to the LIC decoder 935, which outputs the LIC-decoded intra frame 940. The LIC decoder 935 may for example comprise one or more arithmetic decoders, one or more probability models, one or more inverse quantization operations, and one or more NN decoders.

[00145] I.g.2. Pre-filtering of intra frames

[00146] The output 940 of the LIC decoder 935 may be filtered before providing the output 940 to the LL-CVC encoding. There may be one or more filters, which are referred to as CVC pre-filter. This proposed filtering may modify the LIC-decoded intra frame 940 to be more similar to intra frames that are expected by LCVC encoding, i.e., more similar to the intra frames that were considered when the LCVC encoding tools were designed, where the similarity may be measured for example based on a distortion metric such as mean-squared error (MSE). However, the instant techniques are not limited to any specific similarity metric. FIG. 10 is an illustration of the proposed filtering for a framework 900-1.

[00147] The CVC pre-filter 1010, 1020 is used both at encoder side and decoder side, to filter the LIC-decoded intra frame 940, 978, respectively. At encoder side, the output of the CVC pre-filter 1010 is input to the LL-CVC codec 950. At decoder side, the output of the CVC pre-filter 1020 is input to the LL-CVC encoding 979.

[00148] The current conventional video codec international standard for both human consumption and machine consumption is the Versatile Video Coding standard (H.266/VVC). End-to-end (E2E) learned codecs in VCM tend to blur (or, anyway, decrease the visual quality of) the information that are not useful for the target task network, while keeping the important objects/areas relatively sharp. The post-processing filters that have been previously designed aim to enhance the entire frame. However, this may not be very good approach, in some situations, as the taskNN cares mostly about the target objects of the machine. [00149] One way of having post-processing filters (often referred to as post-filters) for enhancing only the important areas is to use the “loss term” of the target taskNN in the training stage of the post-filter. However, in VCM use cases, quite often the target taskNN is not known during development/training phase and at the encoder side. Moreover, even if the taskNN was known, the generated post-processing filter may not be useful for other machine vision tasks and this is problematic in generalizing the post-filter usage for multiple tasks.

[00150] II. Overview of Exemplary Embodiments

[00151] The problems described above are addressed by the exemplary embodiments, which propose different designs of a post-processing filter for Neural Network (NN)-based image and video codecs that use certain auxiliary information derived from, e.g., conventional video compression (CVC) codecs in order to enhance the subjective and/or objective quality of the decoded content for human and/or machine consumption. It is noted that the quality may be subjective, objective, or both subjective and objective. The objective quality for human consumption may be measured by MSE or PSNR. The objective quality for machine tasks may be measured by the performance of machine tasks, for example, mean Average Precision (mAP) for object detection and instance segmentation tasks. Subjective quality is mainly measured by experiments involving evaluations by testers. Often, metrics such as mean opinion score (MOS) are reported after the viewing session.

[00152] The auxiliary information may be obtained through various means such as re-encoding the NN-decoded image or video in the decoder and/or encoder side by CVC encoders. The CVC codec may output certain information to be used as enhancement features for the postprocessing filter(s). Alternatively, the auxiliary information may be extracted from the CVC bitstream by parsing some or all of the CVC bitstream in the decoder side.

[00153] The auxiliary information may be used as additional input to the postprocessing filter, for example, as one or more additional input channels. In an alternative method, the auxiliary information may be injected to one or more of the components of the NN instead of or in addition to using them as additional input channels to the post-processing NN.

[00154] III. Further Details of Exemplary Embodiments

[00155] The exemplary embodiments propose methods for enhancing the objective and/or subjective quality of decoded images and video with neural network-based codecs (referred to as NNC-codecs hereafter) by using one or more NN-based post-processing filters. However, at least some of the embodiments herein are applicable also to the case of enhancing the objective and/or subjective quality of images and video decoded by conventional codecs. The postprocessing filters may use certain auxiliary information as “helpers” in order to identify the regions-of-interest (ROIs) for the target use cases. The target consumption use case may be for human consumption and/or one or more machine consumption tasks.

[00156] In an embodiment, the auxiliary /helper information may be obtained through encoding the NNC-decoded content with conventional video codecs (CVC) such as ones that apply the Versatile Video Coding (VVC/H.266) standard.

[00157] The auxiliary information from CVC may be one or more of the following items:

[00158] - High level information such as picture/slice type (for example, intra/inter picture/slice, or any other picture/slice types that are defined in video coding standards such as IDR, CRA, RADL, RASL, and the like).

[00159] - Quantization parameters of the pictures/slices/CTUs/CUs/PUs.

[00160] - Temporal layer ID of the pictures/slices.

[00161] - Block partitioning information such as CTU partitioning, CU partitioning, sub-block partitioning, and the like.

[00162] - Block level coding mode information, such as intra or inter coded block.

[00163] - Block level intra prediction information such as intra prediction directions, intra prediction modes, and the like.

[00164] - Block level inter prediction information such as inter prediction modes (affine, merge, AMVP, and the like.), reference pictures used for inter prediction, motion vectors or motion vector field information in block level or sub-block level.

[00165] - Reference picture resampling information, such as the scaling windows of the current picture and the reference picture(s).

[00166] - Block level transform modes such as DCT, DST, or corresponding LFNST transform modes or sub-block transform coding information.

[00167] - Block level DCT, DST or LFNST coefficients or alike representation of the input block.

[00168] - Block level information about the residual information of that block. For example, whether the block includes any residual coding or not. The axillary information may include the block level residuals as well. [00169] - The information about in-loop and post processing filters such as Adaptive Loop Filter (ALF), Sample Adaptive Offset (SAO), De-Blocking Filter (DBF), neural network-based filters, and the like. This information may include the following:

[00170] -.a) Which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture, i.e., ON/OFF signaling.

[00171] - b) Any weighting or scaling which is applied to the output of any of these filters.

[00172] - c) Coefficients and parameters of the filter(s) which were determined at encoder side, for example the adaptation parameters of the ALF.

[00173] - Information on encoding configuration, such as the lambda value or alike indicating a weighting between rate and distortion in encoder's coding mode choices, or bit budget for pictures/slices/CTUs/CUs/PUs. Such information could be delivered to the method used herein through for example SEI message mechanism of the underlying codec.

[00174] - Information on encoder-side analysis results, such as scene cut detection. Such information could be delivered to the method of the invention through for example SEI message mechanism of the underlying codec.

[00175] - The information about pre-processing operations prior to encoding for example Motion Compensated Temporal Filter (MCTF), denoising, and the like. Such information could be delivered to the method of the invention through for example SEI message mechanism of the underlying codec.

[00176] Note that the term “block” may refer to a Coding Unit (CU), Prediction Unit (PU), Transform Unit (TU), Coding Tree Unit (CTU), a Tile, a Slice, and the like.

[00177] FIG. 11 illustrates an example where the NNC-decoded content are reencoded and then decoded with CVC codec for extracting the auxiliary information, in accordance with an exemplary embodiment. The system (which may also be considered to be a framework) 1100 includes an encoder side 1110 and a decoder side 1130, communicating using a communications channel 1127, which may include wired and/or wireless networks 1170. In FIG. 11 (and all other figures herein) there is and encoding path and a decoding path. The encoding path in FIG. 11 is from the original video 1115 to an output (or outputs in other examples) of the encoder side, which in this case is the bitstream 1125. The decoding path is from the received bitstream (or other received information) to the output of the decoder side 1130, which in this case is the enhanced video 1165. It is also noted that in this example (and this is applicable to other examples herein) video 1115 is input and enhanced video 1165 is output, although text is also possible for both.

[00178] The encoder side has original video 1115 that is input to and processed by the NNC encoder 1120 to produce the bitstream 1125 that is communicated to the decoder side 1130 via the communications channel 1127. The decoder side 1130 includes an NNC decoder 1135, which produces the decoded video 1140 from the bitstream 1125. The decoded video is routed to the CVC encoder 1145 and the NN post-filter 1160. The CVC encoder 1145 encodes the decoded video, which the CVC decoder 1150 then processes and the decoder’s output is routed to the auxiliary information extraction 1155. The auxiliary information extraction 1155 extracts the auxiliary information 1156 and sends this to the NN post-filter 1160, which uses this information when processing the decoded video 1140 to create the enhanced video 1165.

[00179] In an embodiment, the CVC codec 1180 (e.g., a combination of 1145 and 1150) may output certain information 1156 to be used as enhancement features for the postprocessing filter(s) 1160 (e.g., the extraction 1155 might not be performed). Alternatively, the auxiliary information 1156 may be extracted 1155 from the CVC bitstream 1146 by parsing some or all of the CVC bitstream (from the encoder 1145) in the decoder side 1130. In one implementation option of this alternative embodiment, the bitstream 1146 may be input to the NN post-filter 1160 directly. In another implementation option of this alternative embodiment, the bitstream 1146 may be input to a second neural network (e.g., as the auxiliary information extraction 1155), which extracts one or more features from the bitstream. The extracted one or more features may be then input to the NN post-filter 1160.

[00180] In an embodiment, the auxiliary information may be extracted from the decisions made by the CVC encoder 1145. In an embodiment, the encoder side 1110 may signal information about whether to use the auxiliary information as additional input to the NN post-filter.

[00181] In an embodiment, the encoder side 1110 may signal information about which of the auxiliary information items outlined above shall be used as additional input to the NN post-filter. For example, the encoder may signal that only information about the QP shall be used for a certain block or picture or video.

[00182] In an embodiment, one or more auxiliary features may be generated from the one or more auxiliary information. The one or more auxiliary features 1156 may be used as one or more inputs to a task network 1190 (e.g., along with the enhanced video 1165), for example, injected into one of the intermediate layers of a NN post-processing filter 1160. In an embodiment, multiple types of auxiliary information may be combined to generate one auxiliary feature.

[00183] In an embodiment, instead of or in addition to using CVC to get the auxiliary information, a detection network 1151 may be used to locate the “important” regions, e.g., of video. The detection network 1151 may be a region-proposal neural network, i.e., a neural network that outputs one or more regions or bounding boxes that are likely to contain objects. The output of the detection network may be used to enhance the auxiliary information that falls into the important regions and/or to attenuate the auxiliary information that is outside of the important regions.

[00184] In an embodiment, the region of interest or important regions of the video could be obtained by some algorithmic approaches 1151 other than a detection network. For example, regions based on specific frequency features such as high frequencies, saliency estimation methods, feature contrastive algorithms and such. Such algorithmic approaches 1151 could rely on the video or the frequency domain transforms of the video codec and/or the auxiliary information.

[00185] In an embodiment, the architecture of obtaining auxiliary information for post-processing filter(s) may be a chain of codecs. FIG. 11 A is a block diagram of a system similar to that in FIG. 11, only with a different chain of codecs. In FIG. 11, a chain of codecs is an NNC codec (1120 and 1135) is followed by a CVC codec 1180, whereas these are reversed in the chain shown in FIG. 11 A. The system 1100-1 has an encoder side 1110-1 with a CVC encoder 1145, and the decoder side 1130-1 has a CVC decode 1150 and a codec 1180-1 with an NNC encoder 1120 and NNC decoder 1135.

[00186] Thus, as part of a chain of codes, a CVC codec (1145 and 1150) may be used as the main codec and an NNC codec 1180-1 is used at the decoder side 1130-1. In another example, an NNC codec (1191, 1192) may be used as the main codec and a second NNC codec 11180-1 may be used at the decoder side 1130 to provide auxiliary information for the NN postfilter 1160. The NNC at the decoder side 1130 may be used to extract features from the one or more layers of the NNC, and the extracted features may be used as the auxiliary information for the NN post-filter. The input to the NNC codec at the decoder side may be the decoded image or frame from the main codec. [00187] In an embodiment, the CVC codec (1145 and 1150) may use fast operations to determine the auxiliary information. For example, one or more of the below processes may be used for CVC encoding (and/or decoding):

[00188] 1) Downsampling the NNC decoded data before encoding the data with CVC.

[00189] 2) One or more of the components or channels of NNC decoded images and videos may be used for CVC encoding.

[00190] 3) Temporal subsampling may be applied to the NNC-decoded videos.

[00191] 4) Entropy coding of the CVC may be bypassed, including the estimation of probability of symbols to entropy-encode/decode.

[00192] 5) CVC encoder 1145 may be configured in such a way that the encoder uses subset of coding modes of the CVC codec.

[00193] In an embodiment, the transcoding or re-encoding of NNC-decoded data with CVC may be performed in the encoder side 1110 of the system 1100, as shown in FIG. 12. An encoding path is from the original video 1115 to the bitstream 1215, and a decoding path is from the (received) bitstream 1215 to the enhanced video 1165. The system 1200 is similar to system 1100, and only differences are described. In this example, the encoder side 1110 has the bitstream output of the NNC encoder 1120 as being an NN bitstream 1125, which is input to a MUX (multiplexer) 1210. An NNC decoder 1220 decodes the NN bitstream 1125 to create decoded video 1225, which is encoded by a CVC encoder 1230, the output of which is decoded by a CVC decoder 1235. Auxiliary information 1270 is extracted 1240 from the output of the CVC decoder 1235, and the extracted auxiliary information is multiplexed with the NN bitstream 1125 to form the bitstream 1215.

[00194] On the decoder side, the bitstream 1215 is demultiplexed by the Demux (demultiplexer) 1250 into an NN bitstream 1260 and auxiliary information 1270. The NN bitstream 1260 is decoded by the NNC decoder 1135 into decoded video 1140, and the NN postfilter 1160 uses the auxiliary information 1270 when processing the decoded video 1140 to create the enhanced video 1165.

[00195] In this case, the auxiliary information 1270 is derived in the encoder side 1110 and signaled to the decoder side 1130 by encapsulating the information into the final bitstream 1215 or alternatively could be signaled through the SEI mechanism of the codec or any out-of-band signaling mechanism. [00196] In an embodiment, the encoder side 1110 includes only a CVC encoder 1230, where the information extracted during encoding, for example from CVC encoder decisions, could become available at the decoder side to the NN post-filter 1160 for example by signaling them into the bitstream or SEI-like mechanisms, or out-of-band signaling. In an embodiment, the auxiliary information 1270 may be compressed before signaling to the decoder side 1130.

[00197] In another embodiment, the transcoding of NNC-decoded data may be performed both in the encoder side 1110 and decoder side 1130. In this case, the encoder side transcoding may be performed more than once to find the best possible configurations to achieve suitable auxiliary information for post-processing filter(s). The best configuration parameter(s) is/are signaled into the bitstream for performing the transcoding in the decoder side 1130. The configuration may indicate for example enabling or disabling certain tools in CVC codec, quantization parameters, and the like. In another example, the auxiliary information may be generated at both the encoder side 1110 and decoder side 1130. The auxiliary information 1270 generated at the encoder side 1110 may be transferred to the decoder side 1130 and combined with the auxiliary information generated at the decoder side (e.g., auxiliary information 1156 as generated in FIG. 11) to serve as input to the post-processing filter(s) 1160.

[00198] In one or more of the previous embodiments, the auxiliary information 1156/1270 obtained from a CVC decoder or from a region-proposal neural network (or from similar sources) may be used to combine the output of the post-processing filter 1160 with its input. In particular, the auxiliary information 1156/1270 may be used to weight the contribution of the post-processing filter 1160 to the final output of the decoder.

[00199] In one example, the output of the combination may be performed as follows: output = Ppfilter(input) * M + (1-M)*input, where output is the final output of the decoder side 1130, Ppfilter is the post-processing filter, M is a mask (for example a binary mask, or a mask of Real numbers in the range [0,1]) that is either determined based at least on the auxiliary information or signaled from the encoder to the decoder, aux is the auxiliary information, input is the input to the post-processing filter 1160. The mask M is to combine the enhanced video output and the decoded video output with different weights. For example, the mask values are greater than 0.5 for the pixels in the regions where enhanced video output has better quality than the decoded video output. In one example, mask M may be determined from the auxiliary information using one or more predefined rules or by one or more neural networks. In another example, mask M is determined at the encoder side by optimizing a rate-distortion loss function and signaled from the encoder to the decoder.

[00200] FIG. 13 is a block diagram of a system 1300 where the auxiliary information obtained from a CVC decoder or from a region-proposal neural network (or from similar sources) is used to combine the output of the post-processing filter with its input, in an exemplary embodiment. An encoding path is from the original video 1115 to the bitstream 1125, and a decoding path is from the (received) bitstream 1125 to the combined video 1320. This example is similar to FIG. 11, and only the differences are described. The combiner 1310 combines the enhanced video 1165 and the decoded video 1140 using the mask that is either determined by the auxiliary information or signaled from the encoder to the decoder, creating combined video 1320, which is used as the output of the decoder side 1130 in this system 1300. The mask 1391 can be determined by two alternative methods, for instance: either determined using the auxiliary information or signaled from encoder to decoder. For the former case, there is a link (indicated by 1391-2) from the auxiliary information 1155 to the combiner 1310. The mask is derived from the auxiliary information 1155 extracted from the output of CVC decoder 1150. For the former case, the NNC encoder 1120 derives the mask 1391-1 and sends via the bitstream to the combiner 1310.

[00201] In an alternative or additional embodiment, the bitstream output by the encoder of the NNC codec is losslessly decoded into a latent tensor, which is an intermediate representation of the input data. The latent tensor may include two portions or subsets of the latent tensor. Consider FIG. 14, which illustrates this concept. An encoding path is from the original video 1115 to the bitstream 1125, and a decoding path is from the (received) bitstream 1125 to the enhanced video 1165. In this example (e.g., as compared with FIG. 11), the system 1400 has a modified encoder side 1110 and a modified decoder side 1130. The modified encoder side 1110 includes a CVC codec 1450 (including both a CVC encoder and decoder, not shown in this example), that forms the auxiliary information 1155, which is input to the NNC encoder 1120. The NNC encoder 1120 forms a latent tensor 1420 having portions 1 and 2, and this is described below. The decoder side 1130 includes a lossless decoder 1410 that outputs to a latent tensor 1420. Note that it is assumed the communications channel 1127 (or receiver, not shown) is lossless or any errors are corrected in this example.

[00202] The latent tensor 1420 has two output portions, which on the decoder side are a first portion 1421 (routed to the NNC decoder 1135) and a second portion 1422 (routed to the NN post-filter 1160). In further detail, the first portion 1421 is input to the NNC decoder 1135, whereas the second portion 1422 is input to the post-filter 1160 as auxiliary data with respect to the NNC-decoded data, i.e., the NN post-filter 1160 gets as input both the NNC-decoded data (decoded video 1140) and the second portion 1422 of the latent tensor. The second portion 1422 of the latent tensor may have been derived at the encoder side 1110 also based on the CVC codec’s auxiliary information. For example, the CVC codec’s auxiliary information may be input to the NNC encoder 1120 together with the image/frame/video (e.g., original video 1115) to encode. The NNC encoder 1120 may output a first portion 1421 of a latent tensor representing the image/frame/video encoded by the NNC encoder, and a second portion 1422 of latent tensor representing the CVC encoder’s auxiliary information processed and encoded by the NNC encoder 1120. In this example, the latent tensor is “intermediate” data inside an NNC encoder 1120 in this figure. The NNC encoder 1120 comprises an NN encoder, quantizer, probability model and entropy encoder. The latent tensor 1420 refers to the quantized data of the output of the NN encoder.

[00203] In one implementation option, the system may be trained end-to-end, i.e., both the NNC codec (1125 and 1135) and the post-processing filter(s) 1160 may be trained jointly at least for part of the training of each of the NNC codec and the post-processing filter(s).

[00204] Additionally, the second portion 1422 of the latent tensor 1420 (when performed by the NNC encoder 1120) could be compressed using a feature compression scheme, and provided as additional information along with the NNC bitstream 1125. Such information could be part of the bitstream, transported via some SEI-based scheme, or out of band.

[00205] The proposed methods could be also used in the MLC codec where a CVC encoder is available in the MLC decoder. FIG. 15 illustrates an example of using the methods herein in an MLC system 1500. Compare with FIG. 9. An encoding path is from the original video 1115 to the outputs 925, 965, and a decoding path is from the (received) outputs 925, 965 to the decoded frames 998. In the example of FIG. 15, the MLC encoder 915 includes a CVC encoder 1545 that accepts input of LIC-decoded intra frames 940 and includes a CVC codec 1550 that outputs CVC-decoded intra frames 1555. The CVC encoder 945 also includes CVC encoding 1560, which accepts both inter frames 910 and CVC-decoded intra frames 1555, and outputs the bitstream 965 for CVC-encoded inter frames.

[00206] The MLC decoder 970 includes CVC encoding 1579, which accepts LIC- decoded intra frames 978 from the LIC decoder 973, and produces bitstream 1580 for CVC- encoded intra frames. The MLC decoder 970 also includes a CVC decoder 1510 that outputs to the auxiliary information extraction 1520, which outputs auxiliary information 1521 to the postprocessing filter 1530. The bitstreams 1580 and 965 are ordered 985 to a CVC bitstream 990. The CVC decoder 995 takes this bitstream 990 and outputs to the post-processing filter 1530, which also uses the auxiliary information 1521 and outputs decoded frames 998.

[00207] In further detail, the LIC coded images (corresponds to NNC decoded images) are re-encoded with a CVC codec 1550. The one or more auxiliary information 1521 may be derived from the corresponding bitstreams of the CVC-coded intra frames. The one or more of the extracted auxiliary information or the features may be transferred to the MLC decoder 970 and further used as additional input to the post-processing filter 1530 for filtering the corresponding intra frames and/or also non-intra frames. The one or more of the extracted auxiliary information or features may be compressed at the MLC encoder side before transferring to the MLC decoder and decompressed at the MLC decoder side.

[00208] In an embodiment, different auxiliary information 1521 may be derived from the intra frames and inter frames based on the CVC coded bitstream to be used as additional input to the post-processing filters. In an embodiment, the intra and inter frames auxiliary information may be combined, for example a weighted averaging, and then used as inputs to the post-processing filter.

[00209] Turning to FIG. 16, this figure is a block diagram of an apparatus 110 suitable for implementing the exemplary embodiments. One nonlimiting example of the apparatus 110 is a wireless, typically mobile device that can access a wireless network. The apparatus 110 includes one or more processors 120, one or more memories 125, one or more transceivers 130, and one or more network (N/W) interfaces (IF(s)) 161, interconnected through one or more buses 127. Each of the one or more transceivers 130 includes a receiver, Rx, 132 and a transmitter, Tx, 133. The one or more buses 127 may be address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like.

[00210] The apparatus 110 may communicate via wired, wireless, or both interfaces. For wireless communication, the one or more transceivers 130 are connected to one or more antennas 128. The one or more memories 125 include computer program code 123. The N/W I/F(s) communicate via one or more wired links 162.

[00211] The apparatus 110 includes a control module 140, comprising one of or both parts 140-1 and/or 140-2, which include reference 90 that includes encoder 1110, decoder 1130, or a codec of both 1110/1130, and which may be implemented in a number of ways. For ease of reference, reference 90 will be referred to herein as a codec. The control module 140 may be implemented in hardware as control module 140-1, such as being implemented as part of the one or more processors 120. The control module 140-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the control module 140 may be implemented as control module 140-2, which is implemented as computer program code 123 and is executed by the one or more processors 120. For instance, the one or more memories 125 and the computer program code 123 may be configured to, with the one or more processors 120, cause the user equipment 110 to perform one or more of the operations as described herein. The codec 90 may be similarly implemented as codec 90-1 as part of control module 140-1, or as codec 90-2 as part of control module 140-2, or both.

[00212] The computer readable memories 125 may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, flash memory, firmware, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The computer readable memories 125 may be means for performing storage functions. The processors 120 may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on a multi-core processor architecture, as non-limiting examples. The processors 120 may be means for performing functions, such as controlling the apparatus 110, and other functions as described herein.

[00213] In general, the various embodiments of the apparatus 110 can include, but are not limited to, cellular telephones (such as smart phones, mobile phones, cellular phones, voice over Internet Protocol (IP) (VoIP) phones, and/or wireless local loop phones), tablets, portable computers, room audio equipment, immersive audio equipment, vehicles or vehicle-mounted devices for, e.g., wireless V2X (vehicle-to-everything) communication, image capture devices such as digital cameras, gaming devices, music storage and playback appliances, Internet appliances (including Internet of Things, loT, devices), loT devices with sensors and/or actuators for, e.g., automation applications, as well as portable units or terminals that incorporate combinations of such functions, laptops, laptop-embedded equipment (LEE), laptop-mounted equipment (LME), Universal Serial Bus (USB) dongles, smart devices, wireless customer-premises equipment (CPE), an Internet of Things (loT) device, a watch or other wearable, a head-mounted display (HMD), a vehicle, a drone, a medical device and applications (e.g., remote surgery), an industrial device and applications (e.g., a robot and/or other wireless devices operating in an industrial and/or an automated processing chain contexts), a consumer electronics device, a device operating on commercial and/or industrial wireless networks, and the like. That is, the apparatus 110 could be any device that may be capable of wireless or wired communication.

[00214] Without in any way limiting the scope, interpretation, or application of the claims appearing below, a technical effect and advantage of one or more of the example embodiments disclosed herein is enhancement of the subjective and objective quality of the decoded content for human and/or machine consumption. Another technical effect and advantage of one or more of the example embodiments disclosed herein is potentially improved spatial localization of quality enhancement filtering when compared to filtering that is applied to an entire picture.

[00215] The following are additional examples.

[00216] Example 1. A method, comprising:

[00217] decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and

[00218] filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec.

[00219] Example 2. The method according to example 1, wherein:

[00220] the decoding is performed using a decoding path through which the bitstream passes for the decoding, the bitstream is an end result of an encoding path, and the codec from which the auxiliary information is derived is in one or both of the encoding path or decoding path;

[00221] the filtering uses auxiliary information derived from a codec in one or both of the encoding path or decoding path.

[00222] Example 3. The method according to example 1 or 2, wherein:

[00223] the auxiliary information forms part of the received bitstream; and

[00224] the decoding comprises:

[00225] demultiplexing the received bitstream into the information and into the auxiliary information; [00226] decoding the information corresponding to an image or a video to form the decoded information, the decoding the information performed at least in part by a first neural network; and

[00227] filtering the decoded information using the auxiliary information to create the output image or video.

[00228] Example 4. The method according to example 1 or 2, wherein:

[00229] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00230] the method further comprises encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00231] the method further comprises extracting the auxiliary information using the second decoded information;

[00232] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and

[00233] the method comprises combining the decoded information and the enhanced information, using at least a mask determined using the auxiliary information, to form combined information as the output image or video.

[00234] Example 5. The method according to example 1 or 2, wherein:

[00235] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00236] the method further comprises encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00237] the method further comprises extracting the auxiliary information using the second decoded information;

[00238] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and [00239] the method comprises combining the decoded information and the enhanced information, using at least a mask received at the decoder, to form combined information as the output image or video.

[00240] Example 6. The method according to example 1 or 2, wherein:

[00241] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00242] the method further comprises encoding the decoded information using at least conventional video compression techniques for form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00243] the method further comprises extracting the auxiliary information using the second decoded information;

[00244] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and

[00245] the method comprises combining the decoded information and the enhanced information, using at least a mask derived from the auxiliary information, to form combined information as the output image or video.

[00246] Example 7. The method according to example 1 or 2, wherein:

[00247] the decoding the received bitstream comprises:

[00248] decoding the information using at least a lossless decoder to form a latent tensor; and

[00249] decoding a first part of the latent tensor at least by using a neural network to form the decoded information; and

[00250] the filtering uses information from a second part of the latent tensor as the auxiliary information and the decoded information to form enhanced information as the output image or video.

[00251] Example 8. The method according to example 1 or 2, wherein:

[00252] the received bitstream comprises a bitstream for a learned image compression-encoded intra frame and a bitstream for conventional video compression-encoded intra frame, the intra frame and inter frame corresponding to video;

[00253] the decoding the received bitstream comprises: [00254] decoding the received bitstream using a decoding performing learned image compression techniques to form a learned image compression-decoded intra frame;

[00255] encoding the learned image compression-decoded intra frame using conventional video compression techniques to form a bitstream for conventional video compression-encoded intra frame;

[00256] decoding the conventional video compression-encoded intra frame to form the auxiliary information;

[00257] ordering the bitstream for conventional video compression-encoded intra frame and the bitstream for conventional video compression-encoded intra frame to form a conventional video compression bitstream;

[00258] decoding the conventional video compression bitstream to form the decoded information; and

[00259] the filtering filters the decoded information using the auxiliary information to form the output video.

[00260] Example 9. The method according to one of examples 1 to 8, wherein the auxiliary information comprises one or more of the following:

[00261] high level information including picture or slice type;

[00262] quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs;

[00263] temporal layer ID of one or more of pictures or slices;

[00264] block partitioning information;

[00265] block level coding mode information, comprising one or more of intra or inter coded block information;

[00266] block level intra prediction information;

[00267] block level inter prediction information;

[00268] reference picture resampling information;

[00269] block level transform modes;

[00270] block level DCT, DST or LFNST coefficients or other representation of an input block;

[00271] block level information about residual information of that corresponding block; [00272] information about in-loop and post processing filters comprising one or more of the following:

[00273] which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture;

[00274] any weighting or scaling which is applied to the output of any of these filters; or

[00275] coefficients and parameters of a filter or filters that were determined at an encoder side;

[00276] information on encoding configuration;

[00277] information on encoder-side analysis results; or

[00278] information about pre-processing operations performed prior to encoding.

[00279] Example 10. A method, comprising:

[00280] encoding information corresponding to an image or a video to form encoded information;

[00281] extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video;

[00282] forming a bitstream from the auxiliary information and the encoded information; and

[00283] outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance.

[00284] Example 11. The method according to example 10, wherein:

[00285] the encoding information performs encoding using at least a first neural network;

[00286] the extracting is performed on the encoded information and comprises:

[00287] decoding the encoded information using at least a second neural network to form decoded information;

[00288] performing, using the codec, encoding and decoding of the decoded information to form second decoded information; and

[00289] extracting the auxiliary information from the second decoded information. [00290] Example 12. The method according to example 10 or 11, wherein the codec uses conventional video compression techniques for its encoding and its decoding.

[00291] Example 13. The method according to example 10, wherein:

[00292] the encoding information performs encoding using at least a neural network;

[00293] the extracting is performed on the information corresponding to the image or video and uses a codec that uses conventional video compression techniques for encoding and decoding the information.

[00294] Example 14. The method according to one of examples 10 to 13, wherein the forming the bitstream comprises multiplexing the auxiliary information and the encoded information together to form the bitstream.

[00295] Example 15. The method according to one of examples 10 to 14, wherein the auxiliary information comprises one or more of the following:

[00296] high level information including picture or slice type;

[00297] quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs;

[00298] temporal layer ID of one or more of pictures or slices;

[00299] block partitioning information;

[00300] block level coding mode information, comprising one or more of intra or inter coded block information;

[00301] block level intra prediction information;

[00302] block level inter prediction information;

[00303] reference picture resampling information;

[00304] block level transform modes;

[00305] block level DCT, DST or LFNST coefficients or other representation of an input block;

[00306] block level information about residual information of that corresponding block;

[00307] information about in-loop and post processing filters comprising one or more of the following:

[00308] which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; [00309] any weighting or scaling which is applied to the output of any of these filters; or

[00310] coefficients and parameters of a filter or filters that were determined at an encoder side;

[00311] information on encoding configuration;

[00312] information on encoder-side analysis results; or

[00313] information about pre-processing operations performed prior to encoding.

[00314] Example 16. A computer program, comprising code for performing the methods of any of examples 1 to 15, when the computer program is run on a computer.

[00315] Example 17. The computer program according to example 16, wherein the computer program is a computer program product comprising a computer-readable medium bearing computer program code embodied therein for use with the computer.

[00316] Example 18. The computer program according to example 16, wherein the computer program is directly loadable into an internal memory of the computer.

[00317] Example 19. An apparatus comprising means for performing:

[00318] decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and

[00319] filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec.

[00320] Example 20. The apparatus according to example 19, wherein:

[00321] the decoding is performed using a decoding path through which the bitstream passes for the decoding, the bitstream is an end result of an encoding path, and the codec from which the auxiliary information is derived is in one or both of the encoding path or decoding path;

[00322] the filtering uses auxiliary information derived from a codec in one or both of the encoding path or decoding path.

[00323] Example 21. The apparatus according to example 19 or 20, wherein:

[00324] the auxiliary information forms part of the received bitstream; and

[00325] the decoding comprises: [00326] demultiplexing the received bitstream into the information and into the auxiliary information;

[00327] decoding the information corresponding to an image or a video to form the decoded information, the decoding the information performed at least in part by a first neural network; and

[00328] filtering the decoded information using the auxiliary information to create the output image or video.

[00329] Example 22. The apparatus according to example 19 or 20, wherein:

[00330] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00331] the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00332] the means are further configured for performing: extracting the auxiliary information using the second decoded information;

[00333] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and

[00334] the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask determined using the auxiliary information, to form combined information as the output image or video.

[00335] Example 23. The apparatus according to example 19 or 20, wherein:

[00336] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00337] the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques to form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00338] the means are further configured for performing: extracting the auxiliary information using the second decoded information;

[00339] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and [00340] the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask received at the decoder, to form combined information as the output image or video.

[00341] Example 24. The apparatus according to example 19 or 20, wherein:

[00342] the decoding the received bitstream comprises decoding the information using at least a neural network to form the decoded information;

[00343] the means are further configured for performing: encoding the decoded information using at least conventional video compression techniques for form encoded information, decoding the encoded information using at least conventional video compression techniques to form second decoded information;

[00344] the means are further configured for performing: extracting the auxiliary information using the second decoded information;

[00345] the filtering is performed using the decoded information and the auxiliary information to form enhanced information; and

[00346] the means are further configured for performing: combining the decoded information and the enhanced information, using at least a mask derived from the auxiliary information, to form combined information as the output image or video.

[00347] Example 25. The apparatus according to example 19 or 20, wherein:

[00348] the decoding the received bitstream comprises:

[00349] decoding the information using at least a lossless decoder to form a latent tensor; and

[00350] decoding a first part of the latent tensor at least by using a neural network to form the decoded information; and

[00351] the filtering uses information from a second part of the latent tensor as the auxiliary information and the decoded information to form enhanced information as the output image or video.

[00352] Example 26. The apparatus according to example 19 or 20, wherein:

[00353] the received bitstream comprises a bitstream for a learned image compression-encoded intra frame and a bitstream for conventional video compression-encoded intra frame, the intra frame and inter frame corresponding to video;

[00354] the decoding the received bitstream comprises: [00355] decoding the received bitstream using a decoding performing learned image compression techniques to form a learned image compression-decoded intra frame;

[00356] encoding the learned image compression-decoded intra frame using conventional video compression techniques to form a bitstream for conventional video compression-encoded intra frame;

[00357] decoding the conventional video compression-encoded intra frame to form the auxiliary information;

[00358] ordering the bitstream for conventional video compression-encoded intra frame and the bitstream for conventional video compression-encoded intra frame to form a conventional video compression bitstream;

[00359] decoding the conventional video compression bitstream to form the decoded information; and

[00360] the filtering filters the decoded information using the auxiliary information to form the output video.

[00361] Example 27. The apparatus according to one of examples 19 to 26, wherein the auxiliary information comprises one or more of the following:

[00362] high level information including picture or slice type;

[00363] quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs;

[00364] temporal layer ID of one or more of pictures or slices;

[00365] block partitioning information;

[00366] block level coding mode information, comprising one or more of intra or inter coded block information;

[00367] block level intra prediction information;

[00368] block level inter prediction information;

[00369] reference picture resampling information;

[00370] block level transform modes;

[00371] block level DCT, DST or LFNST coefficients or other representation of an input block;

[00372] block level information about residual information of that corresponding block; [00373] information about in-loop and post processing filters comprising one or more of the following:

[00374] which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture;

[00375] any weighting or scaling which is applied to the output of any of these filters; or

[00376] coefficients and parameters of a filter or filters that were determined at an encoder side;

[00377] information on encoding configuration;

[00378] information on encoder-side analysis results; or

[00379] information about pre-processing operations performed prior to encoding.

[00380] Example 28. An apparatus comprising means for performing:

[00381] encoding information corresponding to an image or a video to form encoded information;

[00382] extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video;

[00383] forming a bitstream from the auxiliary information and the encoded information; and

[00384] outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance.

[00385] Example 29. The apparatus according to example 28, wherein:

[00386] the encoding information performs encoding using at least a first neural network;

[00387] the extracting is performed on the encoded information and comprises:

[00388] decoding the encoded information using at least a second neural network to form decoded information;

[00389] performing, using the codec, encoding and decoding of the decoded information to form second decoded information; and

[00390] extracting the auxiliary information from the second decoded information. [00391] Example 30. The apparatus according to example 28 or 29, wherein the codec uses conventional video compression techniques for its encoding and its decoding.

[00392] Example 31. The apparatus according to example 28, wherein:

[00393] the encoding information performs encoding using at least a neural network;

[00394] the extracting is performed on the information corresponding to the image or video and uses a codec that uses conventional video compression techniques for encoding and decoding the information.

[00395] Example 32. The apparatus according to one of examples 28 to 31, wherein the forming the bitstream comprises multiplexing the auxiliary information and the encoded information together to form the bitstream.

[00396] Example 33. The apparatus according to one of examples 28 to 32, wherein the auxiliary information comprises one or more of the following:

[00397] high level information including picture or slice type;

[00398] quantization parameters of one or more of pictures, slices, CTUs, CUs, or PUs;

[00399] temporal layer ID of one or more of pictures or slices;

[00400] block partitioning information;

[00401] block level coding mode information, comprising one or more of intra or inter coded block information;

[00402] block level intra prediction information;

[00403] block level inter prediction information;

[00404] reference picture resampling information;

[00405] block level transform modes;

[00406] block level DCT, DST or LFNST coefficients or other representation of an input block;

[00407] block level information about residual information of that corresponding block;

[00408] information about in-loop and post processing filters comprising one or more of the following:

[00409] which of these filters is active for each CU, PU, TU, CTU, tile, slice or picture; [00410] any weighting or scaling which is applied to the output of any of these filters; or

[00411] coefficients and parameters of a filter or filters that were determined at an encoder side;

[00412] information on encoding configuration;

[00413] information on encoder-side analysis results; or

[00414] information about pre-processing operations performed prior to encoding.

[00415] Example 34. The apparatus of any preceding apparatus example, wherein the means comprises:

[00416] at least one processor; and

[00417] at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.

[00418] Example 35. An apparatus, comprising:

[00419] one or more processors; and

[00420] one or more memories including computer program code,

[00421] wherein the one or more memories and the computer program code are configured, with the one or more processors, to cause the apparatus to:

[00422] decode a received bitstream comprising information corresponding to an image or a video to form decoded information; and

[00423] filter the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec.

[00424] Example 36. A computer program product comprising a computer- readable storage medium bearing computer program code embodied therein for use with a computer, the computer program code comprising:

[00425] code for decoding a received bitstream comprising information corresponding to an image or a video to form decoded information; and

[00426] code for filtering the decoded information to form an output image or video, the filtering performed to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance, the filtering using auxiliary information derived from a codec.

[00427] Example 37. An apparatus, comprising:

[00428] one or more processors; and

[00429] one or more memories including computer program code,

[00430] wherein the one or more memories and the computer program code are configured, with the one or more processors, to cause the apparatus to:

[00431] encode information corresponding to an image or a video to form encoded information;

[00432] extract, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video;

[00433] form a bitstream from the auxiliary information and the encoded information; and

[00434] output the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance.

[00435] Example 38. A computer program product comprising a computer- readable storage medium bearing computer program code embodied therein for use with a computer, the computer program code comprising:

[00436] code for encoding information corresponding to an image or a video to form encoded information;

[00437] code for extracting, using a codec, auxiliary information from one of the encoded information or the information corresponding to the image or video;

[00438] code for forming a bitstream from the auxiliary information and the encoded information; and

[00439] code for outputting the bitstream for use by a decoder to perform filtering to improve visual quality of the output image or video for human consumption or improve performance of machine tasks for machine consumption or improve both the visual quality and the performance.

[00440] As used in this application, the term “circuitry” may refer to one or more or all of the following: [00441] (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and

[00442] (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and

[00443] (c) hardware circuit(s) and or processor(s), such as a microprocessor s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

[00444] This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

[00445] Embodiments herein may be implemented in software (executed by one or more processors), hardware (e.g., an application specific integrated circuit), or a combination of software and hardware. In an example embodiment, the software (e.g., application logic, an instruction set) is maintained on any one of various conventional computer-readable media. In the context of this document, a “computer-readable medium” may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer, with one example of a computer described and depicted, e.g., in FIG. 16. A computer-readable medium may comprise a computer-readable storage medium (e.g., memories 125 or other device) that may be any media or means that can contain, store, and/or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. A computer-readable storage medium does not comprise propagating signals. [00446] If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined.

[00447] Although various aspects are set out above, other aspects comprise other combinations of features from the described embodiments, and not solely the combinations described above.

[00448] It is also noted herein that while the above describes example embodiments of the invention, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention.

[00449] The following abbreviations that may be found in the specification and/or the drawing figures are defined as follows:

[00450] ALF Adaptive Loop Filter

[00451] AVC Advanced Video Coding

[00452] AVMP Advanced Motion Vector Prediction

[00453] codec coder/decoder

[00454] CRA Clean Random Access

[00455] CTU Coding Tree Unit

[00456] CU Coding Unit

[00457] CVC Conventional Video Compression

[00458] Demux demultiplexer

[00459] DCT Discrete Cosine Transform

[00460] DST Discrete Sine Transform

[00461] E2E End-to-End

[00462] HEVC High Efficiency Video Coding

[00463] ID identification

[00464] IDR Instantaneous Decoder Refresh

[00465] PSNR Peak Signal-to-Noise Ratio

[00466] LFNST Low-Frequency Non-Separable Transform

[00467] LIC Learned Image Compression

[00468] mAP mean Average Precision

[00469] ML Machine Learning [00470] MLC Mixed Learned and Conventional

[00471] MSE Mean Squared Error

[00472] MUX multiplexer

[00473] NAL Network Abstraction Layer

[00474] NN Neural Network

[00475] NNC neural network-based codec

[00476] PSNR Peak signal-to-noise ratio

[00477] PU Prediction Unit

[00478] QP Quantization Parameter

[00479] RADL Random Access Decodable Leading

[00480] RASL Random Access Skipped Leading

[00481] ROI Region-of-Interest

[00482] SEI Supplemental Enhancement Information

[00483] SSIM Structural Similarity Index Measure

[00484] TU Transform Unit

[00485] VCM Video Coding for Machines

[00486] VSEI Versatile Supplemental Enhancement

Information

[00487] VVC Versatile Video Coding