METHODS, SYSTEMS AND COMPUTER PROGRAMS FOR PROCESSING AND ADAPTING IMAGE DATA FROM DIFFERENT DOMAINS

Title:

METHODS, SYSTEMS AND COMPUTER PROGRAMS FOR PROCESSING AND ADAPTING IMAGE DATA FROM DIFFERENT DOMAINS

Document Type and Number:

WIPO Patent Application WO/2023/280423

Kind Code:

Abstract:

The present application discloses a method, a system and computer program for processing image data and adapting them to different domains. In particular, the method of the present application allows an online domain adaptation between source domain and target domain. After the adaptation the image data can be used as training set for a machine learning algorithm or as input images for an already trained machine learning system.

Inventors:

SCHMIDT NICO (DE)
SCHLICHT PETER (DE)
TERMÖHLEN JAN-AIKE (DE)
KLINGNER MARVIN (DE)
BRETTIN LEON J (DE)
FINGSCHEIDT TIM (DE)

Application Number:

PCT/EP2021/069167

Publication Date:

January 12, 2023

Filing Date:

July 09, 2021

Export Citation:

Click for automatic bibliography generation Help

Assignee:

CARIAD ESTONIA AS (EE)
VOLKSWAGEN AG (DE)

International Classes:

G06N3/04; B60W60/00; G05D1/02; G06N3/08

Other References:

YANG YANCHAO ET AL: "FDA: Fourier Domain Adaptation for Semantic Segmentation", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 13 June 2020 (2020-06-13), pages 4084 - 4094, XP033805492, DOI: 10.1109/CVPR42600.2020.00414
ARNESH KUMAR ISSAR ET AL: "Reproducibility of "FDA: Fourier Domain Adaptation forSemantic Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 April 2021 (2021-04-30), XP081946494
TERMOHLEN JAN-AIKE ET AL: "Continual Unsupervised Domain Adaptation for Semantic Segmentation by Online Frequency Domain Style Transfer", 2021 IEEE INTERNATIONAL INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC), IEEE, 19 September 2021 (2021-09-19), pages 2881 - 2888, XP033993795, DOI: 10.1109/ITSC48978.2021.9564566

Attorney, Agent or Firm:

2SPL PATENTANWÄLTE PARTG MBB (DE)

Download PDF:

View/Download PDF PDF Help

Claims:

1. A method for processing image data comprising the steps of:

(i) Obtaining source image data D_s

(ii) Computing and storing a frequency spectrum of source image data Ji°^s

(iii) Obtaining target image data D_T from one or more sensors

(iv) Computing a frequency spectrum of the target image data D_T

(v) Extracting a stored frequency spectrum of the source image data

(vi) Replacing at least a part of the frequency spectrum of the target image data with the corresponding stored frequency spectrum of the source image data of step (v)

(vii) Generating a new image X°^{T >s} based on the frequency spectrum of step (vi)

2. The method according to claim 1 , wherein in step (ii) only an amplitude spectrum of the source image data is stored as R “ ^Ds

3. The method according to any of the previous claims wherein in step (vi) only an amplitude spectrum of the target data D_T is replaced

4. The method according to any of the previous claims wherein in step (v) the extracted frequency spectrum is the one having similarities with the frequency spectrum of the target image data D_T, said similarities between spectra being computed via Euclidian metric

5. The method according to any of the previous claims wherein a discrete Fourier transform is used for computing the frequency spectrum of both the source data D_s and the target data D_T

6. The method according to any of the previous claims wherein the frequency spectrum is computed as average on frequency spectra of several images

7. The method according any of the previous claims wherein the frequency spectra of the source image data D_s are computed during training of a machine learning algorithm and saved on a computer readable medium or on a server

8. The method according to any of the previous claims wherein the new image X°T®S js used as input for a machine learning system trained with the source image data D_s

9. The method according to claim 1 to claim 7 wherein the new image X°T®S js used for training a machine learning algorithm

10. The method according to claims 7 to 9 wherein the machine learning algorithm performs image segmentation or object classification or object detection

11. A system for processing image data, the system comprising an interface, one or more storage units and one or more processing devices, wherein the system is configured for:

Storing the frequency spectrum of the source domain data in the one or more storage units

Obtaining the target image data via the interface - Performing the method according to claims 1 to 10 via the one or more processing devices

12. The system according to claim 11 wherein the storage unit is a computer readable storage medium, a network storage or a cloud-server

13. A vehicle (500) comprising the system according to claims 11 to 12, wherein the system is configured to process the image data to perform object detection or image segmentation or object classification for use in a driving assistance feature of the vehicle.

14. The vehicle according to claim 13 further comprising one or more sensors (501) to capture the target image data

15. A computer program having a program code for performing at least one of the methods of one of the claims 1 to 10, when the computer program is executed on a computer, a processor, or a programmable hardware component

Description:

Methods, Systems and Computer Programs for Processing and adapting Image Data from Different Domains

The present application discloses a method, a system and computer program for processing image data and adapting them to different domains. The present application further discloses a vehicle comprising such a system.

Machine learning has great potential for modern driver assistance systems and automated driving. Functions based on deep neural networks process raw sensor data (e.g. from camera, radar, lidar) to derive relevant information. This includes, for example, the type and position of objects in the vehicle environment, their behaviour, roadway geometries and topologies as well as the detection and classification of free path for the vehicles. Among these types of networks, convolutional neural networks (CNNs) and the more recent visual transformer networks (ViT) have proven to be particularly suitable for image processing.

Method based on machine learning algorithms must undergo a training process in order to be reliable. During the training process the machine learning algorithms are fed with a large amount of data: based on which type of data are used during training, machine learning methods can be divided in supervised machine learning methods, unsupervised machine learning methods and semi-supervised machine learning methods.

Unsupervised machine learning systems use machine learning algorithms to analyse and cluster unlabelled (not annotated) datasets. Those algorithms do not require the intervention of a human user during the training process. The data clustering and analyse can been done for example through k-means clustering or probabilistic clustering methods. A drawback of unsupervised machine learning systems is that they are non-transparent for humans as humans have no influence on the way data are clustered and classified.

Semi-supervised machine learning algorithms use a smaller labelled data set to guide classification and feature extraction from a large, unlabelled data set. Semi-supervised machine learning has the drawbacks that the clustering assumptions of the small labelled data set not always hold for the full unlabelled data set.

Supervised machine learning methods are based on the use of labelled (annotated) data sets for the training process.

Typically, human agents have to manually annotate the data by using annotation computer programs, where the images to be annotated have been uploaded. Through a user interface the human agent draws geometric figures, like for example bounding boxes or polygons, around each relevant object in an image, which has to be annotated. A label is then assigned to the object in the geometric figure. A label describes the class to which an object belongs. Examples of classes are vehicles, pedestrian, traffic signs, traffic lights and so on. A drawbacks of supervised machine learning system is that the process of manually annotating data is quite expensive and time consuming.

A possibility for obtaining a larger amount of labelled data at lower costs is the use of data augmentation. Data augmentation allows to artificially modify existing labelled data sets by applying small modifications to them. As example, during data augmentation existing labelled data sets can be randomly rotated, translated, cropped or the colours can be modified and/or altered. Additionally random objects could be added to the data. In a particular type of data augmentation, known as synthetic data generation, scenes can be generated that would be too dangerous to create in real life, e.g., traffic accidents. However, when models are trained on these synthetic datasets, the domain gap to real data typically leads to decreased performance of the machine learning algorithm.

A possibility to overcome this domain gap is to label additional target domain data: this is time consuming and expensive.

Another possibility to overcome the domain gap is to use methods for unsupervised domain adaptation (UDA): such methods allow to transfer information from a (labelled) source domain to an (unlabelled) target domain, e.g. by adversarial learning. For most unsupervised adaptation methods, both (labelled) source and (unlabelled) target data must be provided during training. This means that the target domain must be known in advance and data from it must be available. In automotive application this is not always the case, as the environment of the vehicle (target domain) is not always known in advance.

Another problem is that, unsupervised domain adaptation methods aim at training a network to generalize well towards unseen/unknown target domains, but those target domains must already be employed during source domain training: as a consequence, unsupervised domain adaptation methods are not suitable to the task of adapting an already trained model to a new domain.

In image processing it is known that the low-level spectrum of an image, in particular the low frequency amplitude can vary in a significant way. This variation does not affect the perception of the object represented. In other words, the appearance of an object such as a vehicle or a pedestrian or a traffic sign in an image is independent from the low-level spectrum of the image itself. It is then possible to alter the low-level spectrum of an image without altering its high semantic, leaving the perception of the objects unaltered.

This approach is studied in “FDA: Fourier Domain Adaptation for Semantic Segmentation” by Yang and Soatto (arXiv:2004.05498). In the paper the authors describe a method for unsupervised domain adaptation, wherein the style transfer between the source image data and the target image data is performed by replacing the low-frequency spectrum of the target image data with the low-frequency spectrum of the source image data. This approach is only suitable for offline domain adaptation, since the machine learning algorithm is trained on source image data that are transferred to look like target image data. This means that data from both domains (target and source domain) need to be available during training.

It is therefore a goal of the present application to improve the unsupervised domain adaptation according to the method of claim 1.

The method of the present application comprises the following steps

(i) Obtaining source image data D _s

(ii) Computing and storing a frequency spectrum of source image data 3l° ^s

(iii) Obtaining target image data D _T from one or more sensors

(iv) Computing a frequency spectrum of the target image data D _T

(v) Extracting a stored frequency spectrum ?¾ of the source image data

(vi) Replacing at least a part of the frequency spectrum of the target image data with the corresponding stored frequency spectrum of the source image data of step (v)

(vii) Generating a new image X ^{DT ®S} based on the frequency spectrum of step (vi) According to the method of claim 1 , the new image is generated directly from the target data, i.e. an online domain adaptation is performed. The generation of the new image X° ^{T ~,s} takes place in real time, as the target image D _T are continuously modified according to the method of claim 1. This has the advantage to increase the performance on multiple unseen target domains, since those target data do not have to be available during training.

In the automotive field this has also the advantage, that target data taken by one or more vehicle’s sensors can undergo a style transfer according to the method of claim 1 before to be used as input by the trained machine learning algorithm. In other words the trained machine learning system can be used also for target domains that differ from the source domain without undergoing an additional training phase.

In a preferred embodiment the source image data are labelled data. Preferably the machine learning algorithm which is trained using such source image data is a semantic segmentation algorithm. Semantic segmentation refers to the process of linking each pixel in an image to a class. Examples of classes are vehicles, pedestrian, traffic signs, traffic lights and so on. The machine learning algorithm is preferably a neuronal network, for example a convolutional network.

Alternatively, the machine learning algorithm which is trained using the source data is an object detection algorithm. An object detection algorithm is a machine learning algorithm which takes as input an image comprises several objects and gives as output one or more bounding boxes around each object and assigns a class annotation to each bounding box.

Alternatively, the machine learning algorithm which is trained using the source data is an object classification algorithm. An object classification algorithm is a machine learning algorithm which takes as input an image comprises one object and assigns as output a class annotation (also called class label) to said object.

The machine learning algorithm receives as input source image data set D _s\ those data sets are formed by image x e (G ) ^HxWxC where G denotes the set of integer grey values, H and W respectively the image height and width in pixels and C = {1,2,3} the number of colour channels.

The method of the present invention further comprises the steps of computing and storing the frequency spectrum of the source image data D _s. Preferably the storing of the frequency spectrum of the source image data D _s takes place during the training process.

The frequency spectrum of the source image data set is computed by using a two- dimensional Fourier transform (DFT) X _c = ( X _{c l} ) e (C ) ^HxW for each colour channel c E C = {1,2,3}

With k denoting the indices along spectrum height and width, (x _{c tl> w}) denoting the pixel values of colour channel c in the input image x with the indices h,w along image height and width, and € being the set of complex numbers.

The complex DFT spectrum of the entire input image is obtained by concatenating the complex spectra for all colour channels and is denoted as X _c = (C ) ^HxWXC. The amplitude spectrum of the source domain image is given by

With Re [ ],/m [ ^■] denoting the real and imaginary part of the complex spectrum respectively.

The computed frequency spectra of the source image data are stored as a set with n e JV = {1 ... N } denoting the index and N denoting the number of vectors and thereby the number of representations of the source domain that are being stored.. This set will be called in the following codebook. The codebook comprises codebook vectors described as R _n = with c,k, l being the indices along the channel, spectrum hight and spectrum width dimension and M ⁺ being the set of all positive real-valued numbers. In a preferred embodiment, the codebook is generated using the Linde-Buzo-Gray (LBG) algorithm employing the same source image data that are used for the training of the machine learning method.

In a preferred embodiment the frequency spectrum can be computed on more than one image: an average is then calculated and stored in the codebook.

In another preferred embodiment, the frequency spectrum is divided in several frequency ranges or frequencies subset. Preferably only frequency spectra corresponding to chosen subset of frequencies are stored in the codebook. Preferably the frequency spectrum corresponding to the low-level spectrum of the image are stored.

In a preferred embodiment only the amplitude spectra of the low-level spectrum are stored in the codebook.

The method of the present invention further comprises the step of obtaining target image data D _T. According to a preferred embodiment, the target image data are taken by one or more vehicle’s sensors. The sensor can be a camera, a stereocamera, a radar, a lidar or a combination thereof.

The frequency spectrum is computed for the target image data D _T by using a two- dimensional Fourier transform (DFT) in the same way as explained above for the source image data D _s.

While the phase spectrum {† ) is kept unaltered, the amplitude spectrum \X ^Dt\ is modified according to the method of claim 1.

In a first step, the stored frequency spectrum of the source image data having similarities with frequency spectrum of the target image is extracted from the codebook. In other words, the best matching source domain representation from the codebook for the current target domain sample \X° ^T\ is extracted. This is accomplished via an Euclidian metric by selecting the codebook entry that holds n* = arg

The selected codebook entry R^l is then mixed with the target domain amplitude spectrum . using the following masking technique: with ° denoting element-wise multiplication, 1 being an all-ones tensor, and M denoting the employed mask.

The binary mask M is written in centralized form as M = e i ^HxWxC with k' = k — H/2 and V = l - W/2 denoting the shifted indices. Following Yang and Sotto (arXiv:2004.05498), the binary mask is defined as:

With B = 2b « H, W and BXB being the size of the non-zero part of the mask. To apply the masks to the amplitude spectra, the amplitude spectra should be shifted, since they are in non-centralized form. To accomplish this, the first and second quadrant are swapped with the third and the fourth quadrant, respectively.

Finally, a new image or style-transferred image X° ^{T >s} = (X ^DT®S _c,h,w) ^is computed via a two dimensional inverse discrete Fourier transform by using the unaltered phase spectrum f° ^t and the mixed amplitude spectrum \X ^{DT >s}\ as follows:

With the complex spectrum obtained by

According to a preferred embodiment the new image or style-transferred image X° ^{T >S} =

(X ^{DT >s} _c,h,w) is then used as input for a previous trained machine learning algorithm, said machine learning algorithm having being trained on the source image data only. In preferred embodiment the machine learning algorithm is a semantic segmentation neural network that has been trained on the source domain only.

According to yet another embodiment of the present invention the new image or style- transferred image X° ^{T >}s = (X ^{DT >s} _C:Kw) is used as training image for training a machine learning algorithm. This has the advantage of improving the robustness and precision of the machine learning algorithm, especially if different target image data are used. Preferably both the new image and the source image data are used for training a machine learning algorithm. As a consequence, machine learning algorithms trained using both the new image and the source image data perform better in domain adaptation.

The present application discloses a system according to claim 11. The system comprises an interface, one or more storage units and one or more processing devices capable to perform the method for generating the style transferred image X° ^{T >s}.

The interface may correspond to one or more inputs and/or outputs for receiving and/or transmitting information, which may be in digital (bit) values according to a specified code, within a module, between modules or between modules of different entities. For example, the interface may comprise interface circuitry configured to receive and/or transmit information.

The one or more processing devices may be implemented using one or more processing units or any means for processing, such as a processor, a computer or a programmable hardware component being operable with accordingly adapted software. In other words, the described function of the one or more processing devices may as well be implemented in software, which is then executed on one or more programmable hardware components. Such hardware components may comprise a general-purpose processor, a Digital Signal Processor (DSP), a micro-controller, etc.

In at least some embodiments, the one or more storage units may comprise at least one element of the group of a computer readable storage medium, such as an magnetic or optical storage medium, e.g. a hard disk drive, a flash memory, Floppy-Disk, Random Access Memory (RAM), Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), an Electronically Erasable Programmable Read Only Memory (EEPROM), or a network storage. The storage unit may also be an external cloud-server.

The one or more storage units store the codebook: in such a way the frequency spectra and particularly the amplitude spectra of the source data can be retrieved at any time.

The present application discloses a vehicle according to claim 13. The vehicle may be equipped with one or more sensors. Examples of sensors are radar, LiDAR, camera, stereocamera or a combination thereof.

In a preferred embodiment the target data are data captured by one or more sensors mounted on the vehicle. The target data are then processed according to the method disclosed above and used for performing object detection, object classification or image segmentation for use in a driving assistance feature (e.g. an autonomous or semi- autonomous driving operation) of the vehicle.

The method of the present application can be implemented as a computer program according to claim 15.

Special embodiments of the present invention are described below.

Fig. 1 shows a flow chart of a method for training a machine learning algorithm using source image data D _s and for generating a codebook Jl ^Ds

Fig. 2 shows a flow chart of a method for domain adaptation between target domain data D _T and source domain data D _s

Fig 3 shows a preferred embodiment of the present application Fig. 4 describes the function that transforms

Fig. 5 shows a vehicle equipped with a system according to the present application

In step 101 in Figure 1 source image data D _s are obtained. Preferably the source image data are obtained through an interface. The source image data D _s comprises a set of image x ^Ds. Those images are used in step 102 as training images for a machine learning algorithm. The machine learning algorithm transforms the images into outputs y ^Ds. For example in the case of a semantic segmentation algorithm the outputs y ^Ds give a posteriori probability that a given pixel belongs to a given class on which the semantic segmentation is trained.

Additionally in step 103 a two dimensional Fourier transform is applied to the source image data D _s in order to compute the frequency spectrum X ^Ds. Note that the step 103 can be performed at the same time of the training process in step 102 as well as before or after step 102. In other words, step 102 and step 103 are not connected to each other and as such can be performed independently form each other. In step 104 the frequency spectrum X° ^s are saved in the codebook. In a preferred embodiment only the amplitude spectra |A° ^S| are stored in the codebook. In yet another preferred embodiment only the low-level part of the amplitude spectra is saved in the codebook.

In Fig. 2 a method for domain adaptation between target domain data D _T and source domain data D _s is shown. The method is an online method. In step 201 a set of target image data are obtained. In this example the target image data are captured directly by one or more vehicle’s sensors. In step 202 a two dimensional Fourier transform is applied to the source image data D _T in order to compute the frequency spectrum X° ^T. The frequency spectrum presents a phase f° ^t and a spectral amplitude |X ^£>7’|. In step 203 the spectral amplitude \X° ^T\ is replaced by the most similar spectral amplitude from the codebook computed in Fig.

1. The output of step 203 is the new amplitude spectrum |X ^Dr®5|.

In step 204 a new image or style-transferred image is computed via a two dimensional inverse discrete Fourier transform by using the unaltered phase spectrum f° ^t from step 202 and the mixed amplitude spectrum from step 203.

The new image or style transferred image is then used as input to the trained machine learning of Fig. 1.

In Fig.3 another embodiment of the present application is described.

In step 301a source domain data D _s are obtained while in step 301b target domain data D _T are obtained. The source domain data D _s are formed by a set of images D _s = while the target domain data D _T are formed by a set of images

In step 302a the two dimensional Fourier transform of the source domain data D _s is computed while in step 302b the two dimensional Fourier transform of the target domain data D _T is computed. In step 303 the frequency spectrum X ^Ds are saved in the codebook. In a preferred embodiment only the amplitude spectra |A° ^s| are saved in the codebook.

In step 304 the amplitude spectra of the target domain data D _T are replaced by the stored amplitude spectra of the codebook entry which satisfy

The output of step 304 is a mixed amplitude spectrum \X |

In step 305 a new image or style-transferred image X is computed via a two dimensional inverse discrete Fourier transform by using the unaltered phase spectrum f° ^t from step 302b and the mixed amplitude spectrum from step 304

In step 306 the new image or style-transferred image are used as training input for a machine learning algorithm. Alternatively also the source domain data image x° ^s can be used as training images together with

Fig. 4 describes the function 400 which generates the mixed amplitude spectrum \

Given as input the spectral amplitude \X ^Dt\ (step 401) and the codebook Ji° ^s (step 402) the codebook entry R°l is calculated in step 403. In step 404 a masking function is applied and is calculated

Fig. 5 shows a vehicle 500. The vehicle comprises one or more sensors 501. For example the one or more sensors 501 can be one or more cameras. The sensor takes one or more images of the vehicle’s environment. The images constitute the target image dataset D _T .

The target image dataset is then sent to one or more processing devices 502. The processing devices 502 calculate the frequency spectra of the target image data according to the method explained above. In addition, the processing devices are connected one or more storage units 503. The one or more storage units 503 store the codebook 3?°* . In this example the storage units 503 are located in an external server and the vehicle 500 is connected to them via a well-known vehicle to infrastructure technology.

The one or more storage units 503 could also be also mounted directly in the vehicle and be connected with the processing devices 502 via cables.

The processing devices 502 take the information from the codebook stored in the storage units 503 and modify the target source data according to the methods explained above in order to generate a new image X° ^{T ,s}.

The new image is then used as input for a machine learning algorithm, previously trained with the source image dataset. The machine learning algorithm is preferably stored in the processing devices 502.

In a preferred embodiment the machine learning algorithm is an object classification algorithm. Given as input the new image X° ^{T ,s}., the machine learning algorithm classifies as output the objects present in the input image according to classes. Examples of classes are vehicles, pedestrian, traffic signs, traffic lights and so on.

In a preferred embodiment the machine learning algorithm is an object detection algorithm. Given as input the new image X° ^{T ,s}., the machine learning algorithm gives as output the object class and the object localization in the image (typically by drawing a bounding box around the object)

In another preferred embodiment the machine learning algorithm stored in the processing devices 502 is a semantic segmentation algorithm. Given as input the new image X° ^{T >s}., the semantic segmentation algorithm labels each pixel of the image with a corresponding class. For example, the algorithm could label each pixel of the image according to the class “road”, “vehicle”, “pedestrian”.

The results of the machine learning algorithm can be used for controlling the vehicle, for example by determining a trajectory to avoid collision with other vehicles and/or objects.

Various features, aspects, and embodiments have been described herein. The features, aspects, and embodiments are susceptible to combination with one another as well as to variation and modification, as will be understood by those having skill in the art.

Abstract

Previous Patent: METHOD FOR PRODUCING A PART, AND PART

Next Patent: DEVICES AND METHODS FOR PROCESSING IMAGE DATA