Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD AND TECHNICAL EQUIPMENT FOR VIDEO PROCESSING
Document Type and Number:
WIPO Patent Application WO/2018/150083
Kind Code:
A1
Abstract:
The invention relates to a method and technical equipment for media compression/decompression. The method comprises receiving media data (300) for compression; determining, by a first neural network (310), an indication (320) of at least one part of the media data (300) that is determinable based on at least one other part of the media data (300); and providing the media data (300) and the indication (320) to a data compressor (330). Another aspect of the method comprises receiving media data (340) with an indication (360) of at least part of the media data (340) that is determinable based on at least one other part of the media data (340), and parameters (360) of a neural network (380); decompressing (350) the media data (340); and regenerating a final media data (390) in the neural network (380) by using the indication (360) and the parameters (360).

Inventors:
CRICRI FRANCESCO (FI)
AKSU EMRE BARIS (FI)
Application Number:
PCT/FI2018/050049
Publication Date:
August 23, 2018
Filing Date:
January 23, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NOKIA TECHNOLOGIES OY (FI)
International Classes:
H04N19/17; G06N3/04; G06N3/08; G06T7/10; G06T9/00; G06T5/00; G10L19/04; H03M7/30
Domestic Patent References:
WO2012033966A12012-03-15
Foreign References:
US20090067491A12009-03-12
US20070248272A12007-10-25
Other References:
PATHAK, DEEPAK ET AL.: "Context Encoders: Feature Learning by Inpainting", PROCEEDINGS OF THE 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2016, 12 December 2016 (2016-12-12), Las Vegas, NV, USA, pages 2536 - 2544, XP033021434, ISSN: 1063-6919, [retrieved on 20180612]
GREGOR, KAROL ET AL.: "Towards Conceptual Compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 29 April 2016 (2016-04-29), XP080805978, Retrieved from the Internet [retrieved on 20180608]
LIU, DONG ET AL.: "Image Compression With Edge-Based Inpainting", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 17, no. 10, 1 October 2007 (2007-10-01), pages 1273 - 1287, XP011193147, ISSN: 1051-8215, [retrieved on 20180613]
YEH, RAYMOND ET AL.: "Semantic Image Inpainting with Perceptual and Contextual Losses", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 14 November 2016 (2016-11-14), XP055532717, Retrieved from the Internet [retrieved on 20180613]
BOESEN LINDBO LARSEN, ANDERS ET AL.: "Autoencoding beyond pixels using a learned similarity metric", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 10 February 2016 (2016-02-10), XP055379931, Retrieved from the Internet [retrieved on 20180606]
MAKHZANI, ALIREZA ET AL.: "Adversarial Autoencoders", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 25 May 2016 (2016-05-25), XP055532752, Retrieved from the Internet [retrieved on 20180608]
DONG LIU ET AL., INPAINTING WITH IMAGE PATCHES FOR COMPRESSION, 31 August 2011 (2011-08-31)
JIANG WEI, RATE-DISTORTION OPTIMIZED IMAGE COMPRESSION BASED ON IMAGE INPAINTING, 2 November 2014 (2014-11-02)
See also references of EP 3583777A4
Attorney, Agent or Firm:
NOKIA TECHNOLOGIES OY et al. (FI)
Download PDF:
Claims:
Claims:

1 . A method, comprising:

- receiving media data for compression;

- determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and

- providing the media data and the indication to a data compressor. 2. The method according to claim 1 , further comprising compressing the media data with the data compressor according to the indication and transmitting the compressed media data and the indication to a receiver.

3. The method according to claim 1 , further comprising regenerating the media data by a second neural network to obtain regenerated media data, and training the first and the second neural network based on a quality indicator obtained by comparing the regenerated media data to training data by a third neural network.

4. The method according to claim 3, further comprising transmitting parameters of the second neural network to the receiver.

5. The method according to any of the claims 1 to 4, wherein the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.

6. The method according to claim 5, wherein the indication of said at least one part of the media comprises a binary mask indicating at least one region that is determinable based on the at least one other part of the media data. 7. The method according to any of the claims 4 to 6, wherein parameters for the first neural network and the parameters for the second neural network are updated based on a context of the media data.

8. The method according to claim 7, further comprising transmitting the updated parameters of the second neural network to the receiver.

9. A method, comprising: - receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network;

- decompressing the media data;

- regenerating a final media data in the neural network by using the indication and the parameters.

10. The method according to claim 9, wherein the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame.

1 1 . The method according to claim 10, wherein the indication of said at least one part of the media comprises a binary mask. 12. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform a method according to any of the claims 1 to 8. 13. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform a method according to any of the claims 9 to 1 1 . 14. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to implement a method according to any of the claims 1 to 8. 15. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to implement a method according to any of the claims 9 to 1 1 .

Description:
A METHOD AND TECHNICAL EQUIPMENT FOR VIDEO PROCESSING Technical Field The present solution generally relates to virtual reality and machine learning. In particular, the solution relates to streaming and processing of media content.

Background Many practical applications rely on the availability of semantic information about the content of the media, such as images, videos, etc. Semantic information may be represented by metadata which may express the type of scene, the occurrence of a specific action/activity, the presence of a specific object, etc. Such semantic information can be obtained by analyzing the media.

The analysis of media is a fundamental problem which has not yet been completely solved. This is especially true when considering the extraction of high-level semantics, such as object detection and recognition, scene classification (e.g., sport type classification) action/activity recognition, causal information, reasoning about objects and entities, etc.

Recently, the development of various neural network techniques has enabled learning to recognize image content directly from the raw image data, whereas previous techniques consisted of learning to recognize image content by computing manually-designed image features from the content.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for media compression/decompression. Various aspects of the invention include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising receiving media data for compression; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and providing the media data and the indication to a data compressor. According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data for compression; to determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.

According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data for compression; determine, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and to provide the media data and the indication to a data compressor.

According to an embodiment, the media data is compressed with the data compressor according to the indication and the compressed media data and the indication are transmitted to a receiver. According to an embodiment, the media data is regenerated by a second neural network to obtain regenerated media data, and the first and the second neural network are trained based on a quality indicator obtained by comparing the regenerated media data to training data by a third neural network. According to an embodiment, parameters of the second neural network are transmitted to the receiver.

According to an embodiment, the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame. According to an embodiment, the indication of said at least one part of the media comprises a binary mask indicating at least one region that is determinable based on the at least one other part of the media data. According to an embodiment, parameters for the first neural network and the parameters for the second neural network are updated based on a context of the media data.

According to an embodiment, the updated parameters of the second neural network are transmitted to the receiver.

According to a fourth aspect, there is provided a method, comprising receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters.

According to a fifth aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters.

According to a sixth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; to decompress the media data; and to regenerate a final media data in the neural network by using the indication and the parameters. According to an embodiment, the media data comprises visual media data and said at least one part of the media data comprises a region of an image or a video frame. According to an embodiment, the indication of said at least one part of the media comprises a binary mask.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows a computer graphics system suitable to be used in a computer vision process according to an embodiment;

Fig. 2 shows an example of a Convolutional Neural Network;

Fig. 3 shows a general overview of a method according to an embodiment;

Fig. 4 shows an embodiment for training neural networks for encoding and decoding;

Fig. 5 shows an example of a method of an encoder;

Fig. 6 shows an example of a method of a decoder;

Fig. 7a is a flowchart illustrating a method according to an embodiment; Fig. 7b is a flowchart illustrating a method according to another embodiment, and

Fig. 8 shows an apparatus according to an embodiment in a simplified block chart.

Description of Example Embodiments

Figure 1 shows a computer graphics system suitable to be used in image processing, for example in a media compression or decompression process according to an embodiment. The generalized structure of the computer graphics system will be explained in accordance with the functional blocks of the system. Several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor if desired. A data processing system of an apparatus according to an example of Fig. 1 comprises a main processing unit 100, a memory 102, a storage device 104, an input device 106, an output device 108, and a graphics subsystem 1 10, which are all connected to each other via a data bus 1 12.

The main processing unit 100 is a conventional processing unit arranged to process data within the data processing system. The main processing unit 100 may comprise or be implemented as one or more processors or processor circuitry. The memory 102, the storage device 104, the input device 106, and the output device 108 may include conventional components as recognized by those skilled in the art. The memory 102 and storage device 104 store data in the data processing system 100. Computer program code resides in the memory 102 for implementing, for example, computer vision process or a media compression process. The input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display or for transmission to a receiver. The data bus 1 12 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.

It needs to be understood that different embodiments allow different parts to be carried out in different elements. For example, various processes of the media compression or decompression system may be carried out in one or more processing devices; for example, entirely in one computer device, or in one server device or across multiple user devices. The elements of media compression or decompression process may be implemented as a software component residing on one device or distributed across several devices, as mentioned above, for example so that the devices form a so-called cloud.

The present embodiments relate to data compression, communication, and decompression, and to the field of machine learning and artificial intelligence.

Data compression, such as image and video compression, comprises reducing the amount of data used to represent certain information. The output of such an operation is a reduced set of data, which occupies less memory space or can be transmitted using less bitrate or bandwidth. For example, image compression consists of removing data from the original image, which can be easily predicted from the rest of the data by exploiting for example redundancies (smooth regions). An example of image compressor is JPEG (Joint Photographic Experts Group) standard. In the video domain, compression exploits also temporal redundancy, as objects and regions usually move at a low pace compared to the frame sampling rate. An example of video compressor is the H.264 standard. In general, compression can be either loss-less or lossy, meaning that the reconstruction of the original data from the compressed data may be perfect or non-perfect, respectively. Reconstruction of the original data, or an estimate of the original data, from the compressed data may be referred to as decompression.

Machine learning is a field which studies how to learn mappings from a certain input to a certain output, where the learning is performed based on data. In particular, a sub-field of machine learning which has been particularly successful recently is deep learning. Deep learning studies how to use artificial neural networks for learning from raw data, without preliminary feature extraction. Deep learning techniques may be used for recognizing and detecting objects in images or videos with great accuracy, outperforming previous methods. The fundamental difference of deep learning image recognition technique compared to previous methods is learning to recognize image objects directly from the raw data, whereas previous techniques are based on recognizing the image objects from hand-engineered features (e.g. SIFT features). During the training stage, deep learning techniques build hierarchical computation layers which extract features of increasingly abstract level.

An example of a feature extractor in deep learning techniques is included in the Convolutional Neural Network (CNN), shown in Fig. 2. A CNN is composed of one or more convolutional layers, fully connected layers, and a classification layer on top. CNNs are easier to train than other deep neural networks and have fewer parameters to be estimated. Therefore, CNNs are highly attractive architecture to use, especially in text, image, video and speech applications.

In Fig. 2, the input to a CNN is an image, but any other media content object, such as video file, could be used as well. Each layer of a CNN represents a certain abstraction (or semantic) level, and the CNN extracts multiple feature maps. A feature map may for example comprise a dense matrix of Real numbers representing values of the extracted features. The CNN in Fig. 2 has only three feature (or abstraction, or semantic) layers C1 , C2, C3 for the sake of simplicity, but CNNs may have more than three, and even over convolution layers.

The first convolution layer C1 of the CNN may be configured to extract 4 feature- maps from the first layer (i.e. from the input image). These maps may represent low- level features found in the input image, such as edges and corners. The second convolution layer C2 of the CNN, which may be configured to extract 6 feature-maps from the previous layer, increases the semantic level of extracted features. Similarly, the third convolution layer C3 may represent more abstract concepts found in images, such as combinations of edges and corners, shapes, etc. The last layer of the CNN, referred to as fully connected Multi-Layer Perceptron (MLP) may include one or more fully-connected (i.e., dense) layers and a final classification layer. The MLP uses the feature-maps from the last convolution layer in order to predict (recognize) for example the object class. For example, it may predict that the object in the image is a house. An artificial neural network is a computation graph consisting of successive layers of computation, usually performing a highly non-linear mapping in a highly- dimensional manifold. Neural networks work in two phases: the development or training phase, and the test or utilization phase. During training, the network exploits training data for learning the mapping. Training can be done unsupervised (where there are no manually-provided labels or targets) or supervised (the network receives manually-provided labels or targets). One of the most successful techniques for unsupervised learning is Generative Adversarial Networks (GAN), also referred to sometimes as Adversarial Training. In GAN, a teacher is another neural network, called Discriminator, which indirectly teaches the first neural network (i.e. the Generator) to generate data which looks realistic. One common use of GANs is in image generation, although GANs may be also used for other purposes, like style transfer, super-resolution, inpainting, etc. The Generator tries to generate images which look similar (but not the same) as those in the training dataset, with the goal of fooling the Discriminator (i.e., convincing the Discriminator that the image is from the training set and not generated by the Generator). More precisely, the Generator tries to model the probability distribution of the data, so that generated images look like they were drawn (or sampled) from the true probability distribution of the data. The Discriminator sometimes receives images from the training set, and sometimes from the Generator, and has the goal of learning to correctly discriminate them. The loss is computed on the Discriminator's side, by checking its classification (or discrimination) accuracy. This loss is then used for training both the Discriminator and the Generator.

When compressing images or videos, the known solutions mostly focus on the low- level characteristics by using traditional signal processing methodologies. For example, when compressing a face, the known algorithms need to compress and then store/transmit every part of the face, although, to an intelligent agent (e.g. humans) it would be easy to imagine how one eye would look like when the other eye is already visible, or even how one eye would look like when only half of it is visible. If a compressor (and a de-compressor) were able to perform such "imagining" operations, the whole pipeline would greatly benefit from it by obtaining big savings in bitrate. In fact, the "imaginable" or "determinable" parts of the image may be fully discarded from storage/transmission or kept with lower representation precision (e.g., lower bit-rate).

Recent advances in deep learning have shown that neural networks are able to extrapolate such semantic information, even to the most difficult point of generating or imagining the data itself. For example, neural networks were trained to imagine missing parts of an image, thus performing "inpainting" or "image completion". Thus in the present embodiments, a deep learning system is presented to cope with the problem of leveraging semantic aspects of data such as images and videos, in order to obtain a bit-rate reduction. In particular, a novel pipeline is proposed for both training and utilizing neural networks for this goal. In the present application, also such network topology parameters are disclosed that can be streamed and sent to the client in parallel to the encoded bitstream so that the neural network can be adapted and/or changed on-the-fly during a streaming session. The present embodiments are targeted to a neural network based framework for compression, streaming and de-compression of data such as images and videos. In an example, an image is compressed. The image may be an image of a face. The basic idea of the present embodiments is to have a neural network that is able to decide which regions of the image should be encoded with higher-quality and which other regions can be encoded with lower quality. The decision is based on how easy or difficult it is for a second neural network to imagine those regions. In particular, the regions which are encoded with low quality are those regions which are easily imaginable, such as specular regions (e.g. right eye after having observed left eye and general pose of the face) and regions which do not change much among different examples of the same region type (e.g., a certain region of the face which does not change much among different persons).

In addition, in the present embodiments, a method for achieving a collaborative- adversarial training is disclosed. In this approach, there are three neural networks which are trained simultaneously, each of them for a different task, but in effect they implicitly contribute to the training of each other. One of the neural networks is called Collaborator (or a Friend) "C" network. This network receives an input image and generates a masked image, where the masked or missing parts are supposed to be "easily imaginable" by a second neural network. Following this step, there are two alternatives: either the masked image is encoded by encoding only the non-missing (non-easily imaginable) parts, or it is encoded by using two or more different qualities (or bitrates) for the different regions of the image. For the actual encoding, any suitable encoder may be used, such as JPEG for images. The second neural network, called Generator "G", receives the masked image, or the mask and the image separately, and tries to imagine the missing parts or to improve the quality of the lower-quality parts. Finally, a third neural network called Adversary "A" network, tries to understand if the imagined image is a real image (from the training set) or an imagined image (from Generator). The output of third neural network A is used to produce a loss metric or a loss value, which may be a real number. This loss may then be used for updating the learnable parameters of all three networks (for example the "weights" of the neural networks).

Once the training process has converged, the Collaborator C represents the encoder or compressor, the Generator G represents the decoder or decompressor, and the Adversary A may be discarded or kept for future further training of the Collaborator C and Generator G networks. The decoding side needs to have appropriate parameters and topology information about the Generator neural network. The appropriate parameters and topology information refer to Generator parameters which were trained jointly with the Collaborator parameters used by the encoder. Therefore, parameters of Collaborator and Generator need to be compatible with each other and thus the version of the Collaborator and the Generator needs to be trackable, as multiple versions may be available at different points in time due to retraining and other updates. To this end, one simple method is to signal the Generator parameters version number inside the encoded image format, for example as a signalling field in the header portion of the image.

As an additional embodiment, there may be different neural network versions for different contexts, such as sport, concert, indoor, outdoor, artificial (man-made) scene, natural (e.g. forest) scene, etc. The system may decide to use one of these networks for encoding and decoding. The decision may be manual or automated. Automated decision may be implemented by using a context classifier at the encoder's side, and then the classified context is signaled to the decoder's side. In another embodiment, there may be different trained neural network instances and/or topologies based on the inpainting operation (face, building, natural content, synthetic content, etc.). The server may communicate with the client which neural network topology type is to be used for inpainting.

In another embodiment, the server may stream the network topology in-band or out- band of/from the video bitstream and have the new topology ready in the client before it is used for inpainting. Furthermore, instead of sending the whole topology and parameters at every update time, the system may send only the difference between the currently used topology and parameters and their updated or latest version, in order to further reduce the bitrate. The present embodiments are discussed in more detailed manner next.

The embodiments can be used to reduce required data rate in any type of media transmission, for example transmission of images, audio or video through local wired or wireless connections, and streaming, multicasting or broadcasting over wired or wireless networks such as cellular networks or terrestrial, satellite or cable broadcast networks. For the sake of clarity, in the present disclosure, a neural network can be implemented in different ways, also depending on the type of input data. As the present solution mainly concentrates on images (although the solution is easily extendible to video, audio and other types of data), one common neural network is a Convolutional Neural Network (CNN), which consists of a set of layers of convolutional kernel matrices and non-linearity functions.

The encoding side may be considered as a system that receives an input image and produces an encoded image as an output. The encoding side may comprise various components, e.g. a neural network and an image/video compression block. The decoding side may be considered as a system that receives an encoded image and outputs a decoded image, and may comprise various components, e.g. a decoding algorithm (such as JPEG, JPEG2000, H.264, H.265 or alike) and a neural network. The encoded image may be transmitted by a transmitter to a receiver, where the decoder resides, or it may be stored locally as a file onto a memory. The encoded image is assumed to require fewer bits to be represented than the original image. The receiver may comprise an apparatus similar to apparatus 50 or the computer graphics system of Figure 1 . The receiver may be also considered to be at least one physical or logical sub-function of such apparatus or a system. For example, the term receiver may refer to decompressor circuitry or a memory storing a neural network, which may reside in apparatus 50 or the computer graphics system of Figure 1 .

According to an embodiment, a neural network is trained and/or used to decide which regions of the input image are encoded and which ones are not encoded at all. Alternatively, the neural network may be trained and/or used to decide which regions are encoded with higher bitrate and which ones are encoded with lower bitrate. Furthermore, the specific bitrate used for a certain region may be adaptive to the region itself and may not be fixed to any of the two values.

The neural network may be also configured to decide about the regions based on the semantics of the region. The deep learning field has made possible for neural networks to generate missing data in images and videos, or "inpaint" them. Therefore, the neural network may decide based on how easy it is to imagine the considered region. The bitrate for that region will then be inversely proportional to how well it can be imagined. It is worth noticing that image enhancement techniques based on neural networks may be used not only for image inpainting (replacing missing content regions with plausible content) but also for quality improvement of existing data, such as increasing the resolution of images.

Figure 3 illustrates a general overview of the solution according to an embodiment. A first neural network 310 receives an input image 300 for analysis. In the analysis, a mask 320 of easily imaginable regions is produced as an output. The non-easily imaginable regions are encoded with higher bitrate 330, and streamed in a bitstream 340 to a receiver. The masked regions may be either encoded at lower bitrate and then streamed, or not encoded and streamed at all. Bitstream 340 may also include information about the mask and/or parameters for the second neural network 380. At receiver's side, the bitstream is de-compressed 350 to produce a de-compressed image 370. The de-compression may take into account the encoding method used in the compression stage 330, as in common encoding-decoding techniques. The de-compressed image 370 and the mask and the neural network parameters 360 are input to a second neural network 380. Then the easily-imaginable regions are regenerated, determined, or imagined by the second neural network 380 to produce an output image 390.

In order to train the first and second neural networks 310 and 380 (the former used for compression at transmitter side, and the latter used for decompression at receiver side), the present embodiments provide a training framework "Collaborative-Adversarial Training" (CAT), shown in Figure 4. As shown in Figure 4 the CAT uses one Generator network G and two auxiliary networks; Collaborator C, Adversary A. The first neural network C 410 receives an input image 400 and is configured to output a mask of regions 420 which are easily imaginable by a second network G 430. The second network G 430 receives the masked image 420 (or, alternatively, an image where the masked regions are encoded at lower bitrate) and is configured to imagine or reconstruct the masked regions. The imagined image 440 being output by the second network G 430 is then analyzed by a third neural network A 450. The purpose of the third neural network A 450 is to try to discriminate this image from an image in the training set. Images in the training set may be natural images, i.e., images not modified in their content and semantics with respect to the original images. In other words, the third neural network A 450 receives images that are either imagined images from the generator or training set images, and it is configured to determine whether a received image originated from the generator or from the training set. The output of the third neural network A 450 is classification probability, which is used to compute a loss 460 for training all the three networks, by comparison to the origin of image 400 input to the third neural network A 450. In particular, the first neural network C is trained to help or collaborate with the second neural network G, the second neural network G is trained to fool the third neural network A, and the third neural network A is trained to discriminate the second neural network G. At each iteration, the loss 470 tells the first neural network C how well the third neural network A has managed to discriminate the second neural network G in the last batch of training images, and thus the first neural network C needs to update its parameters accordingly, in order to help the second neural network G to improve in fooling the third neural network A.

Training is performed in unsupervised way, or more precisely, in self-supervised regime. In fact, no manually-provided labels are needed, but only images such as those in ImageNet dataset. The target needed to compute the loss is represented by the true label about the origin of the input image to A, i.e., either the generator or the training set. Each network may have a different architecture. For example, the Collaborator C may consist of a "neural encoder" that is configured to apply several convolutions and non-linearities, and optionally to reduce the resolution by using pooling layers or strided convolution. In addition, the Collaborator C may consist of a "neural encoder-decoder network", formed by a "neural decoder" that is configured to mirror a "neural encoder", by using convolutions, non-linearities and eventually up- sampling. The Collaborator may output a layer matrix representing the masked regions and the un-masked regions with one value per pixel representing how well that pixel is imaginable by the Generator G. If a binary mask is desired, the values may be simply thresholded. Alternatively, a binary mask may be obtained by having an output layer which performs regression on the coordinates of the masked regions. Then, the areas delimited by the coordinates will represent the masked regions. The binary mask isolates the masked media part from the rest of the image or video frame. The Generator G may be a "neural encoder-decoder", too, where the first part of the network extracts features and the second part reconstructs or imagines the image. The Adversary A may be a traditional classification CNN, formed by a set of convolutional layers and non-linearities, followed by fully-connected layers and softmax layer for outputting a probability vector. There are two possible alternatives after the Collaborator C has generated a mask of easily imaginable regions. In both cases the masked image is processed by a compression algorithm, but in different ways. According to an embodiment, the compression algorithm encodes only the non-easily imaginable regions. In such embodiment, the easily imaginable regions are not encoded and not streamed. In one embodiment, the non-easily imaginable regions of the image may be divided in a plurality of sub-images that collectively cover the non-easily imaginable regions. The sub-images may be encoded separately. In one embodiment, the masked region may be assigned a predetermined pixel value before encoding. The whole image may be encoded together (including both masked and non-masked regions), but the masked region will be encoded with a very low number of bits because of the fixed pixel values in the masked region.

Alternatively, both non-easily imaginable and easily imaginably regions are encoded but at different bitrates, where the bitrate may be inversely proportional to the values output by the Collaborator C in that region. In both examples the information about the mask may be included into the encoded bitstream or sent separately.

Once training has converged, the Adversary A can be discarded and only the Collaborator C and the Generator G may be kept. In particular, the first neural network C will be included as part of the encoder's (i.e. transmitter's) side, whereas the Generator G will be part of the decoder's (i.e. receiver's) side. The Adversarial A may still be kept in case the whole system will be updated further by continuing training after deployment. Figure 5 illustrates an overview of the encoder's side, and Figure 6 illustrates an overview of the decoder's side.

As shown in Figure 5, an input image 500 is received by the Collaborator 510 that outputs a mask 520. The input image 500 and the mask 520 are input to compressor 530, which compresses image 500 according to previous embodiments, so that non- easily imaginable parts of the image 500 are less compressed, resulting in a higher bitrate, and easily imaginable parts of the image 500, corresponding to mask 520 are more compressed, resulting in a lower bitrate. As discussed earlier, the portions covered by mask 520 may be alternatively removed and not encoded at all. As shown in Figure 6, the receiver receives a bit-stream 600 including image data, a mask, and/or G version. The image data is de-compressed 610 to provide an decompressed image 630 having regions with higher bitrate (solid lines) and lower bitrate (hatched lines). The decompressed image 630 is transmitted to the Generator 640 that may receive also the mask and the G version 620. With such data, the Generator 640 is able to regenerate the image 650.

In above, term "G version" is used. In the present examples, the trained Generator G is optimized to work on images that have been encoded using a specific Collaborator C being trained at the same training session. Therefore, the Collaborator C and the Generator G need to be compatible or synchronized. To this end, the version of the Generator G which is needed for decoding an image encoded via a certain Collaborator C network needs to be signaled to the decoder's side, for example as part of the bitstream. As shown in Figures 5 and 6, the version data is signaled between the transmitter and the receiver, and delivered to Generator 640. In some embodiments the parameters that defined Generator 640 may be sent to the receiver in advance or be pre-configured at the receiver. Therefore, it's not always necessary to signal any Generator related information during the media streaming.

As an additional embodiment, there may be different neural network versions for different contexts, such as sport, concert, indoor, outdoor, artificial (man-made) scene, natural (e.g. forest) scene, etc. These neural networks (at both encoder's and decoder's sides) would then be selected by using a "context neural network" which classifies the context. In such a case, the server is configured to provide the context related neural network topologies to the client once the client needs it for inpainting. The server may send the context related neural network simply by streaming the topology to the client either inside the video bitstream (e.g. utilizing metadata carriage mechanisms of the video bitstream or media segment file) or totally out of band by embedding the neural network topology representation inside an HTTP(s) response which is sent to the client. The information sent by the server may include an "effective-start-time" or a time interval parameter which indicates where in the presentation time the new network topology context can be utilized. Furthermore, the topology and parameters to be sent at every update may include only the difference between those in the currently used version and those in the updated version, in order to further reduce the bitrate. Although the proposed Collaborative-Adversarial Training is contextual ized here within the data compression domain, it is to be understood that it can be generalized to other domains. For example, the masked image produced by the Collaborator network may be used for different final tasks than reducing the bitrate. Or, even the entire Collaborator network may be trained for a completely different task, and it may be also trained to support/help multiple Generator networks instead of only one.

Regarding the data compression, not only visual data may be considered but also other types of data which have statistical structure and semantics (i.e. data which carry useful information, and not random data chunks), such as audio where frequency values may be predicted from other frequency values not based on redundancy (as in normal audio encoders) but based on semantics (e.g. semantics of speech).

Figure 7a is a flowchart illustrating a method according to an embodiment. A method implemented on a transmitter comprises receiving media data for compression 710; determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data 720; and providing the media data and the indication to a data compressor 730. Figure 7b is a flowchart illustrating a method according to another embodiment. A method implemented on a receiver comprises receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network 740; decompressing the media data 750; and regenerating a final media data in the neural network by using the indication and the parameters 760.

A part of media data that is determinable based on at least one other part of the media data may refer to a portion of the media data that has been removed from the original media data or that has been modified in some way, for example compressed in a higher level, such that other parts of the media data include information usable in at least partially recovering, reconstructing, or deducing the missing part or original form of the modified part. The determinable part may be also referred to as an imaginable part and these terms are used interchangeably throughout the specification.

An apparatus according to an embodiment comprises means for receiving media data for compression; means for determining, by a first neural network, an indication of at least one part of the media data that is determinable based on at least one other part of the media data; and means for providing the media data and the indication to a data compressor. The means comprises a processor, a memory, and a computer program code residing in the memory.

An apparatus according to another embodiment comprises means for receiving media data with an indication of at least part of the media data that is determinable based on at least one other part of the media data, and parameters of a neural network; decompressing the media data; and regenerating a final media data in the neural network by using the indication and the parameters. The means comprises a processor, a memory, and a computer program code residing in the memory.

An apparatus according to an embodiment is shown in Figure 8 as a simplified block chart. The apparatus 50 may comprise a housing for incorporating and protecting the device. The apparatus 50 may further comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera system 42 capable of recording or capturing images and/or video. The camera system 42 may contain one or more cameras. The camera system is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.

The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The controller 56 may be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller. The apparatus may be formed as a part of a server or cloud computing system. The apparatus may be configured to receive video and audio data from a capture device, such as for example a mobile phone, through one or more wireless or wired connections. The apparatus may be configured to analyze the received audio and video data and to generate a widened video field of view as described in the previous embodiments. The apparatus may be configured to transmit the generated video and/or audio data to an immersive video display apparatus, such as for example a head-mounted display or a virtual reality application of a mobile phone.

The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Accordingly, a computer program may be configured to carry out the features of one or more embodiments.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined. Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.