Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD OR APPARATUS RESCALING A TENSOR OF FEATURE DATA USING INTERPOLATION FILTERS
Document Type and Number:
WIPO Patent Application WO/2024/076518
Kind Code:
A1
Abstract:
At least a method and an apparatus are presented for efficiently encoding or decoding video using neural networks wherein the bitstream is adapted to hybrid machine/human vision applications. For example, the scalable decoding comprises applying to a tensor of reconstructed data a neural network-based feature synthesis processing to generate a tensor of input feature representative of a feature of image data samples, resizing the tensor of input feature to generate a tensor of output feature intended to be fed a neural network-based vision inference processing to generate a collection of inference results. Advantageously, resizing the tensor of input feature adapt at least a dimension of the tensor of input feature to the neural network-based vision inference processing.

Inventors:
CHOI HYOMIN (US)
RACAPE FABIEN (US)
OZYILKAN EZGI (US)
UL HAQ SYED (CA)
Application Number:
PCT/US2023/034255
Publication Date:
April 11, 2024
Filing Date:
October 02, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INTERDIGITAL VC HOLDINGS INC (US)
International Classes:
H04N19/117; G06N3/045; G06T9/00; G06V10/82; H04N19/187; H04N19/33; H04N19/70; H04N19/85
Domestic Patent References:
WO2022128137A12022-06-23
Foreign References:
US20220101095A12022-03-31
Other References:
HYOMIN CHOI ET AL: "Scalable Image Coding for Humans and Machines", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 July 2021 (2021-07-18), XP091012673
H. CHOII. V. BAJIC: "Scalable Image Coding for Humans and Machines", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 31, 2022, pages 2739 - 2754
Attorney, Agent or Firm:
KOLCZYNSKI, Ronald (US)
Download PDF:
Claims:
CLAIMS

1. A method, comprising: obtaining a tensor of reconstructed data (yx) representative of image data samples partially reconstructed from a base layer of a bitstream, the tensor of reconstructed data comprises a number of channels (C1) of two-dimensional data; applying to the tensor of reconstructed data (?x) a neural network-based feature synthesis processing (fs) to generate a tensor of input feature (F,) representative of a feature of image data samples, the tensor of input feature comprises a number of channels (Cl) of two-dimensional data; resizing the tensor of input feature (F7) to generate a tensor of output feature (Fo), the tensor of output feature comprises a number of channels of two-dimensional data; and applying to the tensor of output feature (Fo), a neural network-based vision inference processing (ve) to generate a collection of inference results (T); wherein resizing the tensor of input feature comprises applying at least one interpolation filter to the tensor of input feature to adapt at least a dimension of the tensor of input feature to the neural network-based vision inference processing.

2. An apparatus comprising a memory and one or more processors, wherein the one or more processors are configured to obtain a tensor of reconstructed data (Px) representative of image data samples partially reconstructed from a base layer of a bitstream, the tensor of reconstructed data comprises a number of channels (C1) of two-dimensional data; apply to the tensor of reconstructed data (yx) a neural network-based feature synthesis processing (fs) to generate a tensor of input feature (F7) representative of a feature of image data samples, the tensor of input feature comprises a number of channels (C;) of two- dimensional data; resize the tensor of input feature (F7) to generate a tensor of output feature (Fo), the tensor of output feature comprises a number of channels of two-dimensional data; and apply to the tensor of output feature (Fo), a neural network-based vision inference processing (ye) to generate a collection of inference results (T); wherein to resize the tensor of input feature, at least one interpolation filter is applied to the tensor of input feature to adapt at least a dimension of the tensor of input feature to the neural network-based vision inference processing.

3. The method of claim 1 or the apparatus of claim 2, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are equal and wherein resizing the tensor of input feature further comprises applying one 2D interpolation filter per channel of the tensor of input feature.

4. The method of claim 1 or the apparatus of claim 2, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are equal and wherein resizing the tensor of input feature further comprises applying a same 2D interpolation filter for each channel of the tensor of input feature.

5. The method of any of claims 3, 4 or the apparatus of any of claims 3, 4 further comprising obtaining parameters representative of a 2D interpolation filter from metadata of the bitstream.

6. The method of claim 4 or the apparatus of claim 4, further comprising obtaining an index from metadata of the bitstream, the index indicating an interpolation filter among a set of interpolation filters.

7. The method of claim 1 or the apparatus of claim 2, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are different and wherein resizing the tensor of input feature further comprises applying at least one convolutional filter to the tensor of input feature to scale the number of channels.

8. The method of claim 7 or the apparatus of claim 7 further comprising obtaining parameters representative of at least one convolutional filter from metadata of the bitstream.

9. The method of any of claims 5, 6, 8 or the apparatus of any of claims 5, 6, 8 wherein obtaining parameters representative of at least one interpolation filter from metadata of the bitstream further comprises: parsing a flag indicating a resizing of the tensor of input feature ( , responsive to the flag indicating a resizing of the tensor of input feature (F;), parsing a flag indicating whether the resizing is inferred from a configuration associated with an expected dimension of the neural network-based vision inference processing or the resizing is obtained from metadata of the bitstream; and responsive to the flag indicating the resizing is obtained from metadata of the bitstream, parsing parameters representative of a resized dimension of tensor of output feature (Fo), parsing an indication of a number of interpolation filters used in the resizing, parsing parameters representative of an interpolation filter among the number of interpolation filters used in the resizing.

10. The method of any of claims 1 , 3-9 or the apparatus of any of claims 2-9 further comprising obtaining a tensor of enhancement data (P2) representative of image data samples partially reconstructed from an enhancement layer of the bitstream; applying to the tensor of reconstructed data (V, ) and to a tensor of enhancement data (P2), a neural network-based image synthesis processing (gs) to generate a reconstructed image.

11 . A method, comprising: obtaining a tensor of input feature (F,) representative of a feature of image data samples, the tensor of input feature comprises a number of channels (Cl) of two-dimensional data; and resizing the tensor of input feature (F,) to generate a tensor of output feature (Fo), the tensor of output feature comprises a number of channels of two-dimensional data adapted to a dimension of a neural network-based vision inference processing (ve); wherein resizing the tensor of input feature comprises applying at least one interpolation filter to the tensor of input feature to adapt a dimension of the tensor of input feature to the dimension of the neural network-based vision inference processing (ye).

12. An apparatus comprising a memory and one or more processors, wherein the one or more processors are configured to obtain a tensor of input feature ( ;) representative of a feature of image data samples, the tensor of input feature comprising a number of channels (C() of two-dimensional data; and resize the tensor of input feature (F,) to generate a tensor of output feature (Fo), the tensor of output feature comprising a number of channels of two-dimensional data adapted to a dimension of a neural network-based vision inference processing (ve); wherein to resize the tensor of input feature, at least one interpolation filter is applied to the tensor of input feature to adapt a dimension of the tensor of input feature to the dimension of the neural network-based vision inference processing (ve).

13. The method of claim 11 or the apparatus of claim 12, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are equal and wherein to resize the tensor of input feature, one 2D interpolation filter is applied per channel of the tensor of input feature.

14. The method of claim 11 or the apparatus of claim 12, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are equal and wherein to resize the tensor of input feature, a same 2D interpolation filter is applied to each channel of the tensor of input feature.

15. The method of claim 11 or the apparatus of claim 12, wherein the number of channels of the tensor of input feature and the number of channels of the tensor of output feature are different and wherein to resize the tensor of input feature, at least one convolutional filter is applied to the tensor of input feature to scale the number of channels.

16. A non-transitory program storage device, readable by a computer, tangibly embodying a program of instructions executable by the computer for performing the method according to any one of claims 1 , 3-10.

17. A bitstream comprising scalable neural-network based coded data representative of image data samples for a neural network-based vision inference processing and associated metadata, wherein the associated metadata comprises at least one of: an indication of a resizing of a tensor of input feature; an indication on whether the resizing is inferred from a configuration associated with an expected dimension of the neural network-based vision inference processing orthe resizing is embedded from the associated metadata of the bitstream; one or more parameters representative of a resized dimension of tensor of output feature; an indication of a number of interpolation filters used in the resizing; and one or more parameters representative of an interpolation filter among a number of interpolation filters used in the resizing.

18. A computer readable medium comprising a bitstream according to claim 17.

Description:
METHOD OR APPARATUS RESCALING A TENSOR OF FEATURE DATA USING INTERPOLATION FILTERS

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 63/414053, filed October 7 th 2020, which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

At least one of the present embodiments generally relates to a method or an apparatus for video encoding or decoding, and more particularly, to a method or an apparatus for decoding a video using scalable NN-based transforms, the decoding further comprising rescaling a tensor of feature data intended to be fed to a NN-based vision inference task.

BACKGROUND

Traditional compression standards reach low bitrates by transforming and degrading the video content using methods optimized to preserve signal fidelity or visual quality. To that end, traditional image and video coding schemes usually employ prediction, including motion vector prediction, and transform to leverage spatial and temporal redundancy in the video content. Generally, intra or inter prediction is used to exploit the intra or inter frame correlation, then the differences between the original image and the predicted image, often denoted as prediction errors or prediction residuals, are transformed, quantized, and entropy coded. To reconstruct the video, the compressed data are decoded by inverse processes corresponding to the entropy coding, quantization, transform, and prediction.

In recent years, novel image and video compression methods based on neural networks (NNs) have been developed. In contrast with traditional methods which apply pre-defined prediction modes and transforms, neural network-based methods rely on many parameters that are learned on a large dataset during a training stage, by iteratively minimizing a loss function using some gradient descent algorithm. In the case of compression, the loss function is defined by the rate-distortion cost, where the rate stands for the estimation of the bitrate of the encoded bitstream, and the distortion quantifies the quality of the decoded video against the original input with respect to some visual quality metric. Traditionally, the quality of the decoded input image is optimized, for example based on the measure of the mean squared error or an approximation of the human-perceived visual quality.

However, an increasing amount of visual content is now also analyzed directly by machines via deep learning-based computer vision algorithms. Existing methods for coding and decoding show some limitations as compression schemes are not optimized for computer vision algorithms. Therefore, there is a need to improve the state of the art by proposing a compression scheme of images and videos targeting both human and machine consumption.

SUMMARY

The drawbacks and disadvantages of the prior art are solved and addressed by the general aspects described herein.

According to a first aspect, there is provided a method. The method comprises scalable video decoding by obtaining a tensor of reconstructed data representative of image data samples partially reconstructed from a base layer of a bitstream, the tensor of reconstructed data comprises a number of channels of two-dimensional data; applying to the tensor of reconstructed data a neural network-based feature synthesis processing to generate a tensor of input feature representative of a feature of image data samples, the tensor of input feature comprises a number of channels of two-dimensional data; resizing the tensor of input feature to generate a tensor of output feature, the tensor of output feature comprises a number of channels of two-dimensional data; and applying to the tensor of output feature, a neural network-based vision inference processing to generate a collection of inference results. Advantageously, resizing the tensor of input feature comprises applying at least one interpolation filter to the tensor of input feature to adapt at least a dimension of the tensor of input feature to the neural network-based vision inference processing.

According to another aspect, there is provided an apparatus. The apparatus comprises one or more processors, wherein the one or more processors are configured to implement the method for video decoding according to any of its variants. According to another aspect, the apparatus for video decoding comprises means for implementing the method for video decoding according to any of its variants.

According to another general aspect of at least one embodiment, a 2D interpolation filter per channel of the tensor of input feature is applied to resize spatial dimension of the tensor.

According to another general aspect of at least one embodiment, at least one convolutional filter is applied to the tensor of input feature to scale the number of channels of the tensor.

According to another general aspect of at least one embodiment, information (filter type, filter coefficients, index of a filter in a pre-determined set of filters) representative of a filter to use in feature tensor resizing is parsed from metadata of the bitstream

According to another general aspect of at least one embodiment, there is provided a device comprising an apparatus according to any of the decoding embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including the video block, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video block, or (iii) a display configured to display an output representative of the video block. According to another general aspect of at least one embodiment, there is provided a non- transitory computer readable medium containing data content generated according to any of the described decoding embodiments or variants.

According to another general aspect of at least one embodiment, there is provided a signal comprising video data generated according to any of the described decoding embodiments or variants.

According to another general aspect of at least one embodiment, a bitstream is formatted to include data content generated according to any of the described decoding embodiments or variants.

According to another general aspect of at least one embodiment, there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the described decoding/decoding embodiments or variants.

These and other aspects, features and advantages of the general aspects will become apparent from the following detailed description of exemplary embodiments, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, examples of several embodiments are illustrated.

Figure 1 illustrates a block diagram of an example apparatus in which various aspects of the embodiments may be implemented.

Figure 2 illustrates a block diagram of an embodiment of traditional video encoder.

Figure 3 illustrates a block diagram of an embodiment of traditional video decoder.

Figure 4 illustrates a block diagram of an embodiment of an end-to-end neural-network-based video compression scheme.

Figure 5 illustrates a block diagram of an embodiment of a basic pipeline of an NN-based machine vision processing.

Figure 6 illustrates a block diagram of an embodiment of a basic pipeline of an NN-based video compression and machine vision processing.

Figure 7 illustrates a block diagram of a variant embodiment of a basic pipeline of an NN- based video compression and machine vision processing in which various aspects of the embodiments may be implemented.

Figure 8 illustrates a block diagram of a detailed embodiment of a basic pipeline an NN-based vision processing.

Figures 9 and 10 illustrate a generic tensor scaling method according to a general aspect of at least one embodiment.

Figure 1 1 illustrates a generic decoding method implementing tensor scaling according to a general aspect of at least one embodiment.

Figures 12, 13, 14, 15 illustrate variant embodiments of the generic tensor scaling method.

Figure 16 illustrates a generic method implementing parsing of information related to tensor scaling filtering according to a general aspect of at least one embodiment.

Figure 17 shows two remote devices communicating over a communication network in accordance with an example of present principles in which various aspects of the embodiments may be implemented.

Figure 18 shows the syntax of a signal in accordance with an example of present principles.

DETAILED DESCRIPTION

Various embodiments relate to a video coding system in which, in at least one embodiment, it is proposed to adapt video decoding tools to hybrid machine/human vision applications. Different embodiments are proposed hereafter, introducing some tools modifications to increase coding efficiency and improve the codec consistency when both applications are targeted. Amongst others, a decoding method, and a decoding apparatus implementing a tensor resizing module based on this principle are proposed.

The present aspects are described in the context of ISO/MPEG Working Group 2, called Video Coding for Machine (VCM) and of JPEG-AI. The Video Coding for Machines (VCM) is an MPEG activity aiming to standardize a bitstream format generated by compressing either a video stream or previously extracted features. The bitstream should enable multiple machine vision tasks by embedding the necessary information for performing multiple tasks at the receiver, such as segmentation, object tracking, as well as reconstruction of the video contents for human consumption. In parallel, JPEG is standardizing JPEG-AI which is projected to involve end-to-end NN-based image compression method that is also capable to be optimized for some machine analytics tasks. One can easily envision other similar flavor of standards and forthcoming systems in the near future for VCM paradigm as use cases are already ubiquitous such as video surveillance, autonomous vehicles, smart cities etc.

The present aspects are not limited to those standardization works and can be applied, for example, to other standards and recommendations, whether pre-existing or future-developed, and extensions of any such standards and recommendations. Unless indicated otherwise, or technically precluded, the aspects described in this application can be used individually or in combination.

The acronyms used herein are reflecting the current state of video coding developments and thus should be considered as examples of naming that may be renamed at later stages while still representing the same techniques.

Figure 1 illustrates a block diagram of an example of a system in which various aspects and embodiments can be implemented. System 100 may be embodied as a device including the various components described below and is configured to perform one or more of the aspects described in this application. Examples of such devices, include, but are not limited to, various electronic devices such as personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. Elements of system 100, singly or in combination, may be embodied in a single integrated circuit, multiple ICs, and/or discrete components. For example, in at least one embodiment, the processing and encoder/decoder elements of system 100 are distributed across multiple ICs and/or discrete components. In various embodiments, the system 100 is communicatively coupled to other systems, or to other electronic devices, via, for example, a communications bus or through dedicated input and/or output ports. In various embodiments, the system 100 is configured to implement one or more of the aspects described in this application.

The system 100 includes at least one processor 1 10 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this application. Processor 1 10 may include embedded memory, input output interface, and various other circuitries as known in the art. The system 100 includes at least one memory 120 (e.g. a volatile memory device, and/or a non-volatile memory device). System 100 includes a storage device 140, which may include non-volatile memory and/or volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 140 may include an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples.

System 100 includes an encoder/decoder module 130 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 130 may include its own processor and memory. The encoder/decoder module 130 represents module(s) that may be included in a device to perform the encoding and/or decoding functions. As is known, a device may include one or both of the encoding and decoding modules. Additionally, the encoder/decoder 130 module represents module(s) that may be included in a device to perform the machine vision processing (or network) on the decoded data to accomplish an inference output, thus implementing decoding tools to hybrid machine/human vision applications with NN-based tools. Additionally, encoder/decoder module 130 may be implemented as a separate element of system 100 or may be incorporated within processor 1 10 as a combination of hardware and software as known to those skilled in the art.

Program code to be loaded onto processor 110 or encoder/decoder 130 to perform the various aspects described in this application may be stored in storage device 140 and subsequently loaded onto memory 120 for execution by processor 110. In accordance with various embodiments, one or more of processor 1 10, memory 120, storage device 140, and encoder/decoder module 130 may store one or more of various items during the performance of the processes described in this application. Such stored items may include, but are not limited to, the input video, the decoded video or portions of the decoded video, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic, tensor, network or filter weights.

In several embodiments, memory inside of the processor 110 and/or the encoder/decoder module 130 is used to store instructions and to provide working memory for processing that is needed during encoding or decoding. In other embodiments, however, a memory external to the processing device (for example, the processing device may be either the processor 1 10 or the encoder/decoder module 130) is used for one or more of these functions. The external memory may be the memory 120 and/or the storage device 140, for example, a dynamic volatile memory and/or a non-volatile flash memory. In several embodiments, an external non-volatile flash memory is used to store the operating system of a television. In at least one embodiment, a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations, such as for HEVC, or WC.

The input to the elements of system 100 may be provided through various input devices as indicated in block 105. Such input devices include, but are not limited to, (i) an RF portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Composite input terminal, (iii) a USB input terminal, and/or (iv) an HDMI input terminal.

In various embodiments, the input devices of block 105 have associated respective input processing elements as known in the art. For example, the RF portion may be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band-limiting again to a narrower band of frequencies to select (for example) a signal frequency band which may be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets. The RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band-limiters, channel selectors, filters, down converters, demodulators, error correctors, and demultiplexers. The RF portion may include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband. In one set-top box embodiment, the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band. Various embodiments rearrange the order of the above-described (and other) elements, remove some of these elements, and/or add other elements performing similar or different functions. Adding elements may include inserting elements in between existing elements, for example, inserting amplifiers and an analog-to-digital converter. In various embodiments, the RF portion includes an antenna.

Additionally, the USB and/or HDMI terminals may include respective interface processors for connecting system 100 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, may be implemented, for example, within a separate input processing IC or within processor 1 10 as necessary. Similarly, aspects of USB or HDMI interface processing may be implemented within separate interface ICs or within processor 110 as necessary. The demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 110, and encoder/decoder 130 operating in combination with the memory and storage elements to process the datastream as necessary for presentation on an output device.

Various elements of system 100 may be provided within an integrated housing, Within the integrated housing, the various elements may be interconnected and transmit data therebetween using suitable connection arrangement 1 15, for example, an internal bus as known in the art, including the I2C bus, wiring, and printed circuit boards.

The system 100 includes communication interface 150 that enables communication with other devices via communication channel 190. The communication interface 150 may include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 190. The communication interface 150 may include, but is not limited to, a modem or network card and the communication channel 190 may be implemented, for example, within a wired and/or a wireless medium.

Data is streamed to the system 100, in various embodiments, using a Wi-Fi network such as IEEE 802. 11 . The Wi-Fi signal of these embodiments is received over the communications channel 190 and the communications interface 150 which are adapted for Wi-Fi communications. The communications channel 190 of these embodiments is typically connected to an access point or router that provides access to outside networks including the Internet for allowing streaming applications and other over-the-top communications. Other embodiments provide streamed data to the system 100 using a set-top box that delivers the data over the HDMI connection of the input block 105. Still other embodiments provide streamed data to the system 100 using the RF connection of the input block 105.

The system 100 may provide an output signal to various output devices, including a display 165, speakers 175, and other peripheral devices 185. The other peripheral devices 185 include, in various examples of embodiments, one or more of a stand-alone DVR, a disk player, a stereo system, a lighting system, and other devices that provide a function based on the output of the system 100. In various embodiments, control signals are communicated between the system 100 and the display 165, speakers 175, or other peripheral devices 185 using signaling such as AV. Link, CEC, or other communications protocols that enable device- to-device control with or without user intervention. The output devices may be communicatively coupled to system 100 via dedicated connections through respective interfaces 160, 170, and 180. Alternatively, the output devices may be connected to system 100 using the communications channel 190 via the communications interface 150. The display 165 and speakers 175 may be integrated in a single unit with the other components of system 100 in an electronic device, for example, a television. In various embodiments, the display interface 160 includes a display driver, for example, a timing controller (T Con) chip.

The display 165 and speaker 175 may alternatively be separate from one or more of the other components, for example, if the RF portion of input 105 is part of a separate set-top box. In various embodiments in which the display 165 and speakers 175 are external components, the output signal may be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.

Figure 2 illustrates an example video encoder 200, such as WC (Versatile Video Coding) encoder. Figure 2 may also illustrate an encoder in which improvements are made to the WC standard or an encoder employing technologies similar to WC.

In the present application, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeably, and the terms “image,” “picture” and “frame” may be used interchangeably. Usually, but not necessarily, the term “reconstructed” is used at the encoder side while “decoded” is used at the decoder side. Before being encoded, the video sequence may go through pre-encoding processing (201), for example, applying a color transform to the input color picture (e.g., conversion from RGB 4:4:4 to YCbCr 4:2:0), or performing a remapping of the input picture components in order to get a signal distribution more resilient to compression (for instance using a histogram equalization of one of the color components). Metadata can be associated with the preprocessing, and attached to the bitstream.

In the encoder 200, a picture is encoded by the encoder elements as described below. The picture to be encoded is partitioned (202) and processed in units of, for example, CUs. Each unit is encoded using, for example, either an intra or inter mode. When a unit is encoded in an intra mode, it performs intra prediction (260). In an inter mode, motion estimation (275) and compensation (270) are performed. The encoder decides (205) which one of the intra mode or inter mode to use for encoding the unit, and indicates the intra/inter decision by, for example, a prediction mode flag. Prediction residuals are calculated, for example, by subtracting (210) the predicted block from the original image block.

The prediction residuals are then transformed (225) and quantized (230). The quantized transform coefficients, as well as motion vectors and other syntax elements, are entropy coded (245) to output a bitstream. The encoder can skip the transform and apply quantization directly to the non-transformed residual signal. The encoder can bypass both transform and quantization, i.e., the residual is coded directly without the application of the transform or quantization processes.

The encoder decodes an encoded block to provide a reference for further predictions. The quantized transform coefficients are de-quantized (240) and inverse transformed (250) to decode prediction residuals. Combining (255) the decoded prediction residuals and the predicted block, an image block is reconstructed. In-loop filters (265) are applied to the reconstructed picture to perform, for example, deblocking/SAO (Sample Adaptive Offset) filtering to reduce encoding artifacts. The filtered image is stored at a reference picture buffer (280).

Figure 3 illustrates a block diagram of an example video decoder 300, such as WC decoder. In the decoder 300, a bitstream is decoded by the decoder elements as described below. Video decoder 300 generally performs a decoding pass reciprocal to the encoding pass as described in Figure 2. The encoder 200 also generally performs video decoding as part of encoding video data.

In particular, the input of the decoder includes a video bitstream, which can be generated by video encoder 200. The bitstream is first entropy decoded (330) to obtain transform coefficients, motion vectors, and other coded information. The picture partition information indicates how the picture is partitioned. The decoder may therefore divide (335) the picture according to the decoded picture partitioning information. The transform coefficients are dequantized (340) and inverse transformed (350) to decode the prediction residuals. Combining (355) the decoded prediction residuals and the predicted block, an image block is reconstructed. The predicted block can be obtained (370) from intra prediction (360) or motion-compensated prediction (i.e., inter prediction) (375). In-loop filters (365) are applied to the reconstructed image. The filtered image is stored at a reference picture buffer (380).

The decoded picture can further go through post-decoding processing (385), for example, an inverse color transform (e.g., conversion from YCbCr 4:2:0 to RGB 4:4:4) or an inverse remapping performing the inverse of the remapping process performed in the pre-encoding processing (201). The post-decoding processing can use metadata derived in the preencoding processing and signaled in the bitstream.

At least some embodiments relate to a method for decoding a video using scalable NN-based transforms, the decoding further comprising rescaling/resizing a tensor of feature data intended to be fed to a NN-based vision inference task. By enabling the resizing of an estimated deep feature tensor to another tensor with a different size, the specific input size constraint imposed by the vision network can be achieved.

Figure 4 illustrates a block diagram of an embodiment of an end-to-end neural-network-based video compression scheme. The encoder g a (also known as analysis transform) transforms the input image X is into latent space: Y = g a X). In most neural network-based compression frameworks, the latent representation Y is formed in 3-dimensional tensor (referred to as a latent tensor). Then, Y is quantized (Q) and entropy coded (EC) as a binary stream (bitstream) for storage or transmission. At the decoder, the bitstream is entropy decoded (ED) to obtain Y, the quantized version of Y. The decoder network g s (also known as synthesis transform) generates reconstructed input: X = g s (Z), an approximation of the original X from the quantized latent representation Y. For the sake of completeness, the skilled in the art will appreciate that, in this diagram, other modules, such as hyper-prior and context prediction used to further improve the rate-distortion performance, are omitted to keep the processing pipeline simple in the provided figure. The same style of omission for clarity will apply to the rest of the figures provided in this document.

Figure 5 illustrates a block diagram of an embodiment of a basic pipeline of a neural-network- based machine vision processing. The NN-based vision inference task (Vision Network) takes an image such as a reconstructed image X as input. Most object detection and segmentation networks require the input to be resized to a specific resolution or to meet the constraint before conducting the inference in order to maximize the task accuracy. This is because these networks either require to be run on images of a pre-defined size or are trained on images that make it easier for the algorithm to output bounding boxes associated with object categories onto identified objects. As such, X is first resized to X' and consequently, fed into the vision network to output a collection of inference results T.

Further optimizing the already existent video encoders directly for machine consumption, such as computer vision network, is not a trivial task because of the handcrafted coding tools of the compression scheme illustrated on figures 2 and 3 or on figure 4 are optimized for ratedistortion (RD) cost only in the standard codecs. The performance of NN-based computer vision algorithms may be deteriorated by the artifacts, such as ringing, blocking artifacts and the loss of high spatial frequencies, produced by the classical standard codecs targeting human consumption.

Figure 6 illustrates a block diagram of an embodiment of a basic pipeline of an NN-based video compression and machine vision processing. According to an embodiment for the Video Coding for Machine (VCM), the scheme of figures 4 and 5 are concatenated wherein the compressed input X is reconstructed and used as input to the vision network for inference in a sequential, ‘chain-like’ fashion. In this pipeline, the end-to-end compression network, including the encoder g a and the decoder g s and possibly, the vision network, can be jointly optimized for the two tasks under consideration, that are machine inference and input reconstruction, by maximizing the task accuracy altogether.

Figure 7 illustrates a block diagram of a variant embodiment of a basic pipeline of an NN- based video compression and machine vision processing. More recently, H. Choi and I. V. Bajic in “Scalable Image Coding for Humans and Machines,” (in IEEE Transactions on Image Processing, vol. 31 , pp. 2739-2754, 2022) introduced a scalable architecture of NN-based compression for VCM which is illustrated in Figure 7. Unlike in Figure 6, the analysis transform

H W g a produces two latent tensors, which are then quantized to obtain Y e m 1X s x * and Y 2 e IK 2X s x s , where C T and C 2 denote the number of channels for the first (base) and the second (enhanced) latent representation, respectively, s is a scale factor between the spatial resolution of the latent tensors and the input images. This scale factor is typically determined by g a that generally consists of several convolutional layers with a stride of 2. Consecutively, the independently encoded latent tensors, i.e., the first and the second bitstreams, are used as input to g a and f s at the decoder side. Advantageously, this architecture is designed to support functional scalability from some simpler task (e.g., object detection) to more complicated ones (e.g., input reconstruction). For example, reconstructing every pixel with signal fidelity is desired for input reconstruction, but it is unnecessary for object detection. Therefore, the first bitstream carries information with the latent representation for object detection and the second bitstream delivers remaining latent representation Y 2 containing enhancement information to be used together with Y ± for the input reconstruction. As Choi describes both two and three tasks supporting VCM architectures, the VCM framework proposed in Figure 7 is not limited to the two tasks variant, and the skilled in the art will easily adapt the described framework to more than two tasks. For the base task (e.g., object detection), the feature synthesis module f s (also referred to as "latent space transform” in Choi) solely takes Y as input to estimate the deep feature tensor F. This feature tensor is then input in v e , which corresponds to a part of the vision network starting from the Mh layer to the last layer. The vision network outputs a collection of T. In Choi, the decoder advantageously requires less computation to carry out the inference task since the front part of the vision network, from the input layerto the Z-1 th layer, is already omitted by the proposed architecture. To reconstruct X, it is necessary to take the both Y = {Pi, P 2 ) as input. However, the VCM architecture of Choi raises the issue of incompatible spatial resolution between the estimated feature tensor F generated when compressing the input X at the original scale as shown in Figure 7 and the expected feature tensor resolution of F, which is computed by inputting X' the resized input into the front-end vision network as implemented in the sequential scheme of Figure 6. Due to the inconsistent dimension between F and F, it is highly unlikely that the compression and vision task pipeline attains optimal performance for all tasks simultaneously.

Figure 8 illustrates a block diagram of a detailed embodiment of a basic pipeline an NN-based machine vision processing. Figure 8 presents the detailed exemplary dimensions of the associated intermediate tensors. The input X e rc 3xHxW ' is resized to X' e IR 3x ' VxM to comply with the constraint of input resolution N x M imposed by the pretrained vision network. Then, at the intermediate layer I of the vision network, the feature tensor F turns out to have a dimension of C l x - x — , where C l denotes the number of channels at the Z-th layer, and k corresponds to the scale factor between the spatial resolution of the tensor and the input image, which is typically determined by the front-end vision network that involves several convolutional layers with a stride of 2 as well as some pooling operations. Therefore, when the compression framework shown in Figure 7 takes the resized X' as input, instead of X, the optimal performance for the vision task is expected to be achieved as f s generates F with C l x x since that feature tensor size meets the dimension that v e expects as input at the Z-th layer when X' is used as input to the front-end vision network shown in Figure 8. In this scenario, however, g s would produce the reconstructed input X’ with the resolution 3 x N x M. Unless some auxiliary post processing module is incorporated to resize X’ to the original resolution of 3 x H x W, the coded bitstreams can only reconstruct the input with the resized resolution 3 x N x M.

A similar issue of inconsistent resolution happens when coding X instead of X' . In that case, the reconstructed F needs to be resized.

This is solved and addressed by the general aspects described herein, which are directed to a method for resizing a tensor of input feature (F 7 ) to generate a tensor of output feature (F o ), the tensor of output feature being adapted to a dimension of a neural network-based vision inference processing (v e ). Advantageously, the resizing module is implemented at the decoder side for the case(s) producing deep feature tensor(s) for various vision task algorithms as shown on figure 1 1. Figure 9 illustrates a generic tensor scaling method according to a general aspect of at least one embodiment. The block diagram of figure 9 partially represents modules of a decoder method, for instance implemented in the exemplary decoder of Figure 11 .

A feature tensor is obtained from the decoding ED and the neural network-based feature synthesis processing f s of the first (base) layer bitstream at the decoder. The feature tensor is intended to be fed to the neural network-based vision inference processing v e to produce a result T such as segmentation, object detection, object tracking... Advantageously, the decoder further includes the proposed resize operation so that the size of the feature tensor is adapted to the expected size of the tensor at the input of the NN-based vision inference processing. Therefore, the proposed resizing operation is applied to any task pipelines when supporting more than two tasks (meaning that the decoder may support more than one vision task and input reconstruction task) so that the size of the tensor of a given task pipeline (one vision task) is independent from the size of the tensor of another given task pipeline (input reconstruction task). As shown on Figure 9, the ED firstly reconstructs a tensor of reconstructed data e IK 1X s x s from the 1 st layer bitstream where the tensor of reconstructed data Yi is partially representative of image data samples to reconstruct. Subsequently, is fed into the feature synthesis module f s to generate the feature tensor F The generated tensor F ; can be a tensor with the rescaled spatial resolution x , where r is a scale determined by the architecture f s and the number of channels C l with respect to the input channel of v e . Then, the resize module resizes F 7 e IK x r x r to obtain using interpolation filters. These filters’ information can be conveyed either by signaling an index associated with standard filters shared in the decoder or by encoding the filter coefficients in the form of some bitstream. Finally, the resized F o is used as input to v e to accomplish the inference task and ultimately, the output T is obtained.

Figure 10 illustrates another generic tensor scaling method according to a general aspect of at least one embodiment. The block diagram of figure 10 partially represents modules of a decoder method, for instance implemented in the exemplary decoder of figure 11. In this variant embodiment, f s produces F 7 with the number of channels equal to C lt instead of the expected number of channels to be C l as shown in figure 10. In this variant, further information about a set of convolutional filters with or without bias parameters is signaled in the bitstream so that the resize module conducts not only the spatial resolution resize but also the convolutional operation to generate a tensor with resolution C l x x

According to yet another variant, depending on the vision task to support at each task layer, there may be different constraints on the input size and the number of input channels of v e . Therefore, the proper information about interpolation filters and/or convolutional filters can be carried in each bitstream for different task layers and be applied to the feature tensors.

Figure 11 illustrates a generic decoding method (1 100) implementing tensor scaling according to a general aspect of at least one embodiment. In a preliminary step not shown on figure 1 1 , a bitstream is received. As explained with in the scheme of Choi, the bitstream may comprise scalable data representative of video images including a base layer for image data intended to a computer vision task, an enhancement layer representative of additional image data intended to human vision. The bitstream may further comprises additional metadata used for processing the received bitstream. In a first step 11 10, a tensor of reconstructed data P, is obtained, the tensor of reconstructed data being partially representative of image data samples to reconstruct, the tensor of reconstructed data comprises a C 1 number of channels of two-dimensional data of size x y (with same notation as above). In a second step 1 120, a NN-based feature synthesis processing is applied to tensor of reconstructed data to generate a tensor of input feature F, e IR x r x r representative of a feature of the image data samples, the tensor of input feature comprises a C l number of channels of two-dimensional data of size x where r is a rescaling ratio. In a step 1130, the tensor of input feature F I is resized/rescaled to generate a tensor of output feature F o . Advantageously, the resizing allows to generate a tensor of output feature F o comprising a C l number of channels of two- dimensional data of size x matching the size of the tensor expected at the neural networkbased vision inference processing 1140 (indeed at a defined layer to generate a collection of inference results T.

According to another embodiment, the decoding method further comprises obtaining 1 150 information on the interpolation filter to apply for the resizing according to any of the signaling variants described here after.

According to another embodiment, the decoding method further comprises, in a step 1 160, obtaining a tensor of enhancement data Y 2 , the tensor of enhancement data being complementary representative of image data samples to reconstruct with regards to the tensor of reconstructed data Y ± , the tensor of reconstructed data comprises a C 2 number of channels of two-dimensional data of size x y (with same notation as above). From a NN-based image synthesis processing 1 170, the decoding method generates a reconstructed image of size H x w for instance to be reproduced on a display for human vision.

According to yet another embodiment, the decoding may further output additional tensor fi n G CPx-x-

I i > intended for a different vision inference task, the NN-based feature synthesis process 1120 and the resize process 1130 are thus instantiated an additional time (in parallel steps) cp x « x M to output the additional tensor F n G IK x ' x ' intended to a different vision inference task having its on tensor size requirements where j is here a scale factor between the spatial resolution of the tensor and the input image for the related vision inference task.

Various embodiments of the generic decoding method are described in the following.

Figure 12 illustrates a generic tensor rescaling method (1130) according to a general aspect of at least one embodiment.

According to a first variant the number (C 1 ) of channels of the tensor of input feature and the number (C ( ) of channels of the tensor of output feature are equal. In a variant, for each channel, the 2D data of size - x — of the input tensor is rescaled to 2D data of size - x — by applying one 2D interpolation filter. Alternatively, for each channel, I D-separable filters are applied to both dimension of the 2D data of size x to generate 2D data of size in any order. In another variant, parameters, such as the coefficients of the interpolation filters, representative of a 2D interpolation filtering are parsed from metadata embedded the

I H W bitstream. Figure 12 shows an example of resizing the input feature tensor F t G IK x r x >- to obtain an output feature tensor F o G R using a sequence of 2-D interpolation kernels K = {^1,^2, - ,K n , ... ,K Cl ] parsed from corresponding bitstream, where K L is used to resize i - th channel in the tensor F,. It is also possible to use separable filters for each channel instead of 2-D kernels, and this can be indicated in the bitstream.

Figure 13 illustrates another generic tensor rescaling method (1130) according to a general aspect of at least one embodiment.

In yet another variant, a set of 2D interpolation filters are predefined in the decoder, such as a bilinear filter, a bi-cubic filter, a trilinear filter. An index indicating a 2D interpolation filter among a set of interpolation filters may be parsed from metadata embedded the bitstream.

As shown on figure 13, the input feature tensor G IK x >- x r is resized to obtain the output feature tensor F o G R x k x k using a filter pre-existing in the decoder. In this variant, only the index j is parsed from the bitstream, and then the corresponding interpolation filter with j is applied to all channels in F, to output F o e R x k x k. According to yet another variant, it is possible to choose different interpolation filters for different channels by parsing multiple filter indices from the bitstream.

In a variant combining embodiments of Figure 12 and Figure 13, each channel can have different filter type among pre-defined filter (bilinear, bicubic, etc.) and a customized adaptive filter with coefficients. In this case, an individual filter index associated with each channel is signaled which covers any of the filter embodiments. For the case where the filter index presents the use of adaptive filter, subsequently transferred filter coefficients should be properly parsed to be used for that channel.

Figure 14 illustrates another generic tensor rescaling method (1130) according to a general aspect of at least one embodiment.

According to a second variant, the number of channels of the tensor of input feature and the number of channels C l of the tensor of output feature are different. Accordingly resizing the tensor of input feature further comprises applying at least one convolutional filter to the tensor of input feature to scale the number of channels. Indeed, For use cases where F t has a different number of channels than F o (i.e., C rather than C ( ), one must also resize the tensor along the channel axis, in addition to the spatial axes. One way to resize the number of channels is to conduct a convolutional operation with relevant filter information that should be transmitted in the bitstream. Figure 14 shows the method for resizing the input feature tensor using parsed filter information obtained from the bitstream. A convolutional operation is shown that precedes the spatial interpolation operation. However, it is also possible to swap the order of these operations. The parsed convolutional filters can include a set of filters with a kernel size and bias terms if needed. Then, the output of the convolution block generates an intermediate tensor with the channels C l . Subsequently, the intermediate tensor will be spatially resized by the parsed interpolation filters obtained from the bitstream to produce F o G IK x fc x fc . According to yet another variant, the resize network may be more complex than presented in Figures 12-14. The resize network may include, but not limited to, more than one convolution layer or any type of trainable layers and activation layers before the interpolation or even after the interpolation filter. In this case, not only the filter coefficient, but also all the corresponding weights for the layers constructing the resize network should be signaled to decoder. According to non-limiting examples, the layers of the resize network may include, fully connected layer, convolutional layer, deconvolutional layer, pooling layer such as max pooling, average pooling, activation layer.

Figure 15 illustrates another generic tensor rescaling method (1130) according to a general aspect of at least one embodiment. On the variant shown on figure 15, the input feature tensor Fj G IK 1X r x r is resized to yield F o IK x '< x ' with a filter pre-existing in the decoder for this spatial interpolation operation. For the convolutional operation, the same process described above for figure 14 is applied to generate the intermediate tensor with number the channels equal to C l . Then, the intermediate tensor will be spatially resized by the interpolation filter I N M corresponding to index j parsed from the bitstream to generate F o e K as described with Figure 13.

Figure 16 illustrates a generic method implementing parsing of information related to tensor scaling filtering according to a general aspect of at least one embodiment.

According to a variant embodiment, a process of parsing filter information from the bitstream to resize the relevant tensors is described. This process can be carried at the parsing of an individual bitstream associated with each task pipeline. In a first step 1610, a flag indicating the need for resizing the input tensor for the vision inference task is received and parsed. Responsive to the need for resizing the input tensor being true (flag being equal 1), further information about the resizing filters is parsed in 1620. Responsive to no need to resize the input tensor at the output of the synthesis module (flag being inferred to 0), the method ends. When the flag is equal to 1 , it may need to parse the target size or resolution to be generated by the resizing module to comply with the associated vision task network. Furthermore, there is a flag indicating if parsing the target size is needed. If the flag is equal to 0, relevant information can be inferred 1630 by referencing a configuration associated with the vision network. If the flag is equal to 1 , the target dimension to be achieved by the resize module can be parsed 1640. For example, the process parses 1640 the number of channels, width, and height of the output feature tensor by the resizing module. For example, the process parses 1640 a scaling factor for each of the tensor dimension (channel, width, and height) between the input and the output channel. Subsequently, num_filter_minus_1 is parsed 1650 to specify how many filters will be applied to resize the input feature tensor. Therefore, the actual number of filters to apply corresponds to num_filter_minus_1 + 1. In a variant, a filter type can then be consecutively parsed 1660. To give some examples, the filter type can be convolutional filter, interpolation filter, etc. Depending on the filter type, relevant parsing process can be involved to parse filter coefficients and relevant information about the filter. After parsing the information associated with the filter, l.e., the filter type and coefficients, etc., the same parsing flow is repeated to parse the rest of filter information until 1670 the number of parsed filter sets meet the num_filter_minus_1 + 1 . The parsed filter information is the same as the order to apply the filter to the input tensor in the resize module.

Normative methods to describe, compress and transmit neural network parameters have already been standardized. For instance, the so-called MPEG Neural Network Representations standard (NNR) provides tools and syntax to compress and transmit neural networks. It can be envisioned to use NNR as means to transmit the parameters of the proposed filters as they are generally composed of convolutional operations that are supported by NNR. If no compression is needed, for instance in the case of the size of the kernel parameters being negligible, exchange formats such as ONNX or NNEF can also be used as a syntax to specify the filter structure and its parameter values.

Such described syntax for machine vision processing may include additional information for instance related to the image, a part of the image or the bitstream itself that may be shared by both human and machine vision tasks. For instance, the additional information may include, but are not limited to, padding size for input image, input image resolution. These information typically exist in the bitstream for input reconstruction (traditional image/video codec), and the skilled in the art appreciate that such information are also needed for the vision task bitstream and should therefore be available for the vision task bitstream in particular since present principles are adapted to scalable bitstream. According to another variant, in case that the encoder codes only region of interest, it would be necessary for the decoder to know the topleft corner coordinates, width and height of the coded area. For instance, the additional information may include, but are not limited to, top-left corner coordinates of a coded area, width and height of the coded area. According to yet other variant, details of vision network configuration and network architecture should be useful, especially the layers interfacing with encoder and decoder (resize module) and signaled as additional data.

Figure 17 shows two remote devices communicating over a communication network in accordance with an example of present principles in which various aspects of the embodiments may be implemented. According to an example of the present principles, illustrated in Figure 17, in a transmission context between two remote devices A and B over a communication network NET, the device A comprises a processor in relation with memory RAM and ROM which are configured to implement any one of the embodiments of the method for NN scalable encoding as described in relation with the Figure 7 and the device B comprises a processor in relation with memory RAM and ROM which are configured to implement any one of the embodiments of the method for NN scalable decoding for hybrid human/machine vision application as described in relation with Figures 7, 9-1 1 or 16. In accordance with an example, the network is a broadcast network, adapted to broadcast/transmit encoded images from device A to decoding devices including the device B. A signal, intended to be transmitted by the device A, carries at least one scalable bitstream comprising coded data representative of at least one image along with metadata allowing to apply the resizing information. According to yet another embodiment, an encoding method, and an encoding apparatus embedding signaling information on a tensor resizing module implemented at the decoder and based on the present principles are proposed.

Figure 18 shows an example of the syntax of such a signal when the at least one coded image is transmitted over a packet-based transmission protocol. Each transmitted packet P comprises a header H and a payload PAYLOAD. The payload PAYLOAD may carry the above described scalable bitstream including metadata relative to machine vision application. In a variant, the payload comprises scalable neural-network based coded data representative of image data samples for a neural network-based vision inference processing and associated metadata, wherein the associated metadata comprises at least one of an indication of a resizing of a tensor of input feature; an indication on whether the resizing is inferred from a configuration associated with an expected dimension of the neural network-based vision inference processing or the resizing is embedded from the associated metadata of the bitstream; one or more parameters representative of a resized dimension of tensor of output feature; an indication of a number of interpolation filters used in the resizing; and one or more parameters representative of an interpolation filter among a number of interpolation filters used in the resizing.

Additional Embodiments and Information

Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.

Unless indicated otherwise, or technically precluded, the aspects described in this application can be used individually or in combination.

Various numeric values are used in the present application. The specific values are for example purposes and the aspects described are not limited to these specific values.

Various implementations involve decoding. “Decoding,” as used in this application, may encompass all or part of the processes performed, for example, on a received encoded sequence in order to produce a final output suitable for display. In various embodiments, such processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, inverse transformation, and differential decoding. Whether the phrase “decoding process” is intended to refer specifically to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art. Various implementations involve encoding. In an analogous way to the above discussion about “decoding”, “encoding” as used in this application may encompass all or part of the processes performed, for example, on an input video sequence in order to produce an encoded bitstream.

Note that the syntax elements as used herein are descriptive terms. As such, they do not preclude the use of other syntax element names.

The implementations and aspects described herein may be implemented as various pieces of information, such as for example syntax, that can be transmitted or stored, for example. This information can be packaged or arranged in a variety of manners, including for example manners common in video standards such as putting the information into an SPS, a PPS, a NAL unit, a header (for example, a NAL unit header, or a slice header), or an SEI message. Other manners are also available, including for example manners common for system level or application level standards such as putting the information into one or more of the following:

- SDP (session description protocol), a format for describing multimedia communication sessions for the purposes of session announcement and session invitation, for example as described in RFCs and used in conjunction with RTP (Real-time Transport Protocol) transmission;

DASH MPD (Media Presentation Description) Descriptors, for example as used in DASH and transmitted over HTTP, a Descriptor is associated to a Representation or collection of Representations to provide additional characteristic to the content Representation;

RTP header extensions, for example as used during RTP streaming;

ISO Base Media File Format, for example as used in OMAF and using boxes which are object-oriented building blocks defined by a unique type identifier and length also known as 'atoms' in some specifications;

HLS (HTTP live Streaming) manifest transmitted over HTTP. A manifest can be associated, for example, to a version or collection of versions of a content to provide characteristics of the version or collection of versions.

The implementations and aspects described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users. Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.

Additionally, this application may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.

Further, this application may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.

Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.

It is to be appreciated that the use of any of the following 7”, “and/or”, and “at least one of’, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.

Also, as used herein, the word “signal” refers to, among other things, indicating something to a corresponding decoder. For example, in certain embodiments the encoder signals a quantization matrix for de-quantization. In this way, in an embodiment the same parameter is used at both the encoder side and the decoder side. Thus, for example, an encoder can transmit (explicit signaling) a particular parameter to the decoder so that the decoder can use the same particular parameter. Conversely, if the decoder already has the particular parameter as well as others, then signaling can be used without transmitting (implicit signaling) to simply allow the decoder to know and select the particular parameter. By avoiding transmission of any actual functions, a bit savings is realized in various embodiments. It is to be appreciated that signaling can be accomplished in a variety of ways. For example, one or more syntax elements, flags, and so forth are used to signal information to a corresponding decoder in various embodiments. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.

As will be evident to one of ordinary skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

We describe a number of embodiments. Features of these embodiments can be provided alone or in any combination, across various claim categories and types. Further, embodiments can include one or more of the following features, devices, or aspects, alone or in any combination, across various claim categories and types:

- Adapting the size of a feature tensor intended for machine vision task in a NN-based scalable decoder and/or encoder.

Selecting a filter to apply for resizing a feature tensor in the decoder and/or encoder.

- Signaling an information relative to resizing of a feature tensor to apply in the decoder.

- Deriving an information relative to a filtering process to apply for resizing a feature tensor, the deriving being applied in the decoder and/or encoder.

Inserting in the signaling syntax elements that enable the decoder to identify the filtering process to use, such as filter indices.

- Selecting, based on these syntax elements, the at least one filtering process to apply at the decoder. - A bitstream or signal that includes one or more of the described syntax elements, or variations thereof.

- A bitstream or signal that includes syntax conveying information generated according to any of the embodiments described.

Inserting in the signaling syntax elements that enable the decoder to apply a feature tensor resizing process in a manner corresponding to that used by an encoder.

- Creating and/or transmitting and/or receiving and/or decoding a bitstream or signal that includes one or more of the described syntax elements, or variations thereof.

- Creating and/or transmitting and/or receiving and/or decoding according to any of the embodiments described.

- A method, process, apparatus, medium storing instructions, medium storing data, or signal according to any of the embodiments described.

- A TV, set-top box, cell phone, tablet, or other electronic device that performs a NN-based scalable decoding process adapted to resize a feature tensor intended for machine vision task according to any of the embodiments described.

- A TV, set-top box, cell phone, tablet, or other electronic device that performs a NN-based scalable decoding process adapted to resize a feature tensor intended for machine vision task according to any of the embodiments described, and that displays (e.g. using a monitor, screen, or other type of display) an image intended for human vision.

- A TV, set-top box, cell phone, tablet, or other electronic device that selects (e.g. using a tuner) a channel to receive a signal including an encoded image, and that performs a NN- based scalable decoding process adapted to resize a feature tensor intended for machine vision task according to any of the embodiments described.

- A TV, set-top box, cell phone, tablet, or other electronic device that receives (e.g. using an antenna) a signal over the air that includes an encoded image, and that performs a NN- based scalable decoding process adapted to resize a feature tensor intended for machine vision task according to any of the embodiments described.