Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DEVICE AND METHOD FOR SELECTING A DIGITAL IMAGE
Document Type and Number:
WIPO Patent Application WO/2023/012012
Kind Code:
A1
Abstract:
A computer-implemented method for selecting a digital image (xI) out of a provided set of digital images (xI) depending on provided query data (xQ), wherein a cross-modal embedder (60) is provided that is configured to receive digital image data (xI) and provide an embedding (rI) depending on the received digital image data (xI) and wherein said cross-modal embedder (60) is also configured to receive query data (xQ) and provide an embedding depending on said query data (rQ), said method comprising the steps of: - inputting at least one digital image (xI) of said provided set of digital images (xI) into said cross-modal embedder (60) to obtain a set of corresponding image embeddings (rI) - inputting said query data (xQ) into said cross-modal embedder (60) to obtain a query embedding (rQ), and - selecting digital images (xI) from said provided set of digital images (xI) based on said query embedding (rQ) and based on at least one embedding (rI) of said obtained set of embeddings (rI).

Inventors:
GUO ZE (DE)
Application Number:
PCT/EP2022/071111
Publication Date:
February 09, 2023
Filing Date:
July 27, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BOSCH GMBH ROBERT (DE)
International Classes:
G06F16/58
Foreign References:
CN111753116A2020-10-09
US20200394213A12020-12-17
DE102021206375A2021-06-22
Other References:
ZHONGWEI XIE ET AL: "Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 August 2021 (2021-08-02), XP091025949, DOI: 10.1109/TSC.2021.3098834
Download PDF:
Claims:
Claims

1. A computer-implemented method for selecting a digital image (xl) out of a provided set of digital images (xl) depending on provided query data (xQ), wherein a cross-modal embedder (60) is provided that is configured to receive digital image data (xl) and provide an embedding (rl) depending on the received digital image data (xl) and wherein said cross-modal embedder (60) is also configured to receive query data (xQ) and provide an embedding depending on said query data (rQ), said method comprising the steps of:

- inputting at least one digital image (xl) of said provided set of digital images (xl) into said cross-modal embedder (60) to obtain a set of corresponding image embeddings (rl)

- inputting said query data (xQ) into said cross-modal embedder (60) to obtain a query embedding (rQ), and

- selecting digital images (xl) from said provided set of digital images (xl) based on said query embedding (rQ) and based on at least one embedding (rl) of said obtained set of embeddings (rl) .

2. The method according to claim 1, wherein said selected digital image (xl) is obtained based on a determined similarity between said query embedding (rQ) and said at least one embedding (rl).

3. The method according to claim 2, wherein said selected digital image (xl) is obtained as that one of said plurality of provided digital images (xl) of which the computed image embedding (rl) is most similar to said query embedding (rQ).

4. The method according to claim 2 wherein said selected digital image (xl) is obtained if said determined similarity between said query embedding (rQ) said at least one embedding (rl) is deemed to be more similar than a provided threshold value. Method according to any one of the above claims, wherein said cross-modal embedder (60) is trained based on pairs (xl, xQ) of corresponding image data (xl) and query data (xQ), wherein said training comprises adjusting parameters ( ) of said cross-modal embedder (60) such as to optimize a loss function which is configured to encourage the embeddings (rl, rQ) of corresponding image data (xl) and query data (xQ) obtained with said cross- modal embedder (60) to be close to each other. The method according to any one of the above claims, in which said set of digital image data (xl) is received from a digital camera (30). The method according to any one of the above claims, in which the selected images (xl) are transmitted to a remote server (50). A computer program comprising instructions which cause a computer (45, 145) to carry out the method according to any one of the above claims if the computer (45, 145) is running said computer program. A machine-readable storage medium (46, 146) on which the computer program according to claim 8 is stored. A camera-control system (101) comprising the machine-readable storage medium (46) according to claim 9. A system (41) for selecting a digital image (xl) comprising the machine- readable storage medium (46) according to claim 8 and a computer (45) configured to carry out the computer program stored on said machine- readable storage medium (46), said system (41) also comprising a machine- readable storage medium (St4) on which said query data (xQ) is stored. A method for updating the system according to claim 11, including the steps of receiving, by said system (41), new query data (xQ) and updating said stored query data (xQ) with said new query data (xQ).

Description:
Device and method for selecting a digital image

The invention concerns a method for selection a digital image, a computer program and a machine-readable storage medium, a control system, and a system for selecting a digital image.

Prior art

Semantic image search can be applied for realizing offline active learning, i.e. for finding relevant cases from a large image database. Such search methods may be called content-based image retrieval.

If the query input for the search is text, then every image in the database will need to be tagged beforehand (e.g. it may be necessary to generate metadata tags).

If the query input is an image, then every image may be encoded to some representation using some feature extractor, then the query will be compared to every feature in the database. The final search result may be the top results with the highest similarity.

Encoding of image and text into representative features and associating image with corresponding text description by training a cross-modal model is known from “Learning Transferable Visual Models From Natural Language Supervision”, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, arXiv preprint arXiv:2103.00020vl (2020).

Unpublished DE 10 2021 206375 discloses a method for online data curation of digital image data.

For online data curation, it is possible to train different scene classifiers are on some joint features and deploy these scene classifiers to a target device for identifying different scenes in real time. Advantages of the invention

The method with the features of independent claim 1 can be used either for content-based image search (also known as reverse image search), as well as for online data filtering (which may also be called online data selection).

For content-based/reverse image search, this approach doesn't need any tag generation steps and can directly realize accurate semantic search based on either natural language text or image for finding relevant images for machine learning development.

For online data filtering, which may happens on a target device (e.g. in a vehicle like a passenger car) in real time, this approach doesn't require training numerous kinds of scene classifiers at all. Therefore, it provides high convenience and flexibility in customizing the filtering criteria while ensuring system integrity. As changing the selection criteria will merely be updating some texts or image features (usually in Kilobytes) on the device, no changes will be made on the existing on-device software, therefore no test and verification are needed for the system as a whole.

Further improvements are presented in the dependent claims. Further aspects of the invention are presented in the parallel independent claims.

Disclosure of the invention

According to one aspect of the invention, it may be envisioned to have a computer-implemented method for selecting a digital image out of a provided set of digital images depending on provided query data, wherein a cross-modal embedder is provided that is configured to receive digital image data and provide an embedding depending on the received digital image data and wherein said cross-modal embedder is also configured to receive query data and provide an embedding depending on said query data, said method comprising the steps of: - inputting at least one digital image of said provided set of digital images into said cross-modal embedder to obtain a set of corresponding image embeddings (i.e. each one of said provided digital images is inputted into said cross-modal embedder to obtain an embedding corresponding to said inputted one digital image).

- inputting said query data into said cross-modal embedder to obtain a query embedding, and

- selecting digital images from said provided set of digital images based on said query embedding and based on at least one embedding of said obtained set of embeddings.

Said set of images may be given by a single digital image in some embodiments, or a plurality of digital images in other embodiments.

Preferably, said selected digital image is obtained based on a determined similarity between said query embedding and said at least one embedding.

For example, said selected digital image may be obtained as that one of said plurality of provided digital images of which the corresponding computed image embedding of which is most similar to said query embedding. This is a very convenient way to realize content-based image search using flexible queries.

Alternatively, said selected digital image may be obtained said determined similarity between said query embedding said at least one embedding is smaller than a provided threshold value. This offers a convenient way especially for selecting images that match a given query sufficiently well.

Training of said cross-modal embedder may be based on pairs of corresponding image data and query data, wherein said training comprises adjusting parameters of said cross-modal embedder such as to optimize a loss function which is configured to encourage the embeddings of corresponding image data and query data obtained with said cross-modal embedder to be close to each other.

Hence, it is also possible in some embodiments to envision the method as a two- step procedure, with the training of said cross-modal embedder being carried out in a first step and the use of said trained cross-modal embedder being carried out in a subsequent second step. In some embodiments, it may be envisioned that said set of digital image data is received from a digital camera, for example as a stream of digital images.

In further embodiments, it may be envisioned that the selected images, in particular only the selected digital images, are transmitted to a remote server. This may be especially useful if, the method is carried out in a mobile device, for example a vehicle with a limited bandwidth connection between said vehicle and said remote server.

It should be noted that if said query data is stored on said system for selecting images, it is especially easy to update the criteria according to which it is decided whether or not to select images, as merely a new query data has to be uploaded to the system to replace existing query data.

Embodiments of the invention will be discussed with reference to the following figures in more detail. The figures show:

Embodiments of the invention will be discussed with reference to the following figures in more detail. The figures show:

Figure 1 a flow-chart diagram of a method for training said cross-modal embedder;

Figure 2 a flow-chart diagram of a method for selecting images from an object database by a received query;

Figure 3 a flow-chart diagram of a method for selecting images from a stream against a provided query;

Figure 4 an embodiment of a training device for training said cross-modal embedder; Figure 5 an embodiment of a device for selecting images from an object database by a received query;

Figure 6 an embodiment of a device for selecting images from a stream against a provided query;

Figure 7 a preferred embodiment illustrating the use of the device for selecting images.

Description of the embodiments

Shown in figure 1 is a flow-chart diagram of a method for training said cross- modal embedder, which may, for example, be a machine-learning model like an artificial neural network. As will be appreciated by a person having skill in the art, said training may proceed following the principles of training such machinelearning model by optimization of a loss function.

The training methods begins by providing (1000) connected pairs of images and corresponding description data, which may preferably be textual. For example, such image may be an image showing a road with no other traffic participants present, and said textual description may read “empty road”. Said description may also be given by audio data, for example by a recording of a person saying the words “empty road”. Said image data will be denoted as ‘xl’, said corresponding description data as ‘xQ’.

Next, said image data and said corresponding description data are inputted (1100) into a cross-modal embedder. Such a cross-modal embedder is known, for example, from aforementioned publication “Learning Transferable Visual Models From Natural Language Supervision” and generally is a machine learning model that computes an embedding, i.e. a number-vector representation of a lower dimensional space known as ‘latent space’ to the person having skill in the art, from either one of said image data (xl) (said corresponding embedding being called ‘rl’) or from said corresponding description data (xQ) (said corresponding embedding being called ‘rQ’).

Having obtained said embeddings (rl) and (rQ), a loss function is then computed (1200) which penalizes if said embeddings (rl) and (rQ) differ from each other. Preferably, the bigger a distance | rl -rQ| between said embeddings is, the larger said loss function penalty is. Said loss function may be given, for example, by a cross-entropy loss term.

Then, parameters (O) characterizing the behavior of said cross-modal embedder are adjusted (1300) depending on said loss function. If said cross-modal embedder is a neural network, said parameters (O) are commonly known as ‘weights’. Said adjustment may, for example, be carried out by an algorithm known as stochastic gradient descent, and backpropagation through the layers of said neural network.

It is then checked (1400) if the method has finished, e.g. by checking whether criteria indicating convergence of said method indicate that said method has converged. In the case of convergence, the method finishes (1500), otherwise it iterates back to step (1000).

It will be appreciated that above steps may be carried out for individual pairs of image data and corresponding description data, or for a plurality of such pairs. Such pluralities of pairs are commonly known as ‘batches’, or ‘mini-batches’.

Shown in figure 2 is a flow-chart diagram of a method for selecting images (xl) from an image database by a received query (xQ). A plurality of images (xl) is provided (2000) in a database. A query (xQ) is received (2100). Said query (xQ) may, for example, be an image, or it may be textual. In some embodiments the cross-modal embedder has been trained, for example with the method depicted in figure 1, to compute an embedding (rQ) for query data of said modality.

Using said cross-modal embedder, embeddings (rl) corresponding to each image (xl) in said database and embeddings (rQ) corresponding to said query (xQ) are computed (2200). A measure characterizing corresponding similarities between said query embedding (rQ) and each of said embeddings (rl) is computed next (2300), for example a cosine similarity. Then, using a predefined number of results (K), those images (xl) the corresponding embeddings (rl) of which produce the measure indicating the largest similarity with said query (xQ) are outputted (2400). This concludes the method.

It should be noted that when the number of images in the database becomes large, e.g. if it exceeds a billion, computing the similarity measure against the whole dataset can become rather expensive. To address this issue, approximated nearest neighbor search may be applied. Such methods are known from e.g. https://hal.inria.fr/inria-00514462v2/document and https://arxiv.org/abs/1702.08734.

Shown in figure 3 is a flow-chart diagram of an embodiment of a method for selecting received images (xl) that match a predefined criterion. Said criterion is provided as query data (xQ). As in the method illustrated in figure 2, said query data may be given as one of several modalities, such as text. Next, an input image (xl) is then received (3100) from a source, e.g. from a video sensor. It may be possible that a stream of such input images (xl) is received. Then, an embedding (rl) of said received image (xl) and an embedding (rQ) of said query data (xQ) is computed (3200). Then, like in the method illustrated in figure 2, a similarity measure that characterizes a similarity between said embeddings (rl, rQ) is computed, for example a cosine similarity.

Next (3400), said similarity measure is compared to a predefined similarity threshold. In some embodiments, if said similarity measure is such that larger values indicate a higher degree of similarity between said embeddings (rl, rQ), it may be provided that if said similarity measure exceeds said similarity threshold, the image (xl) is selected and may, for example, be transmitted to a remote server (3500). Otherwise, it is discarded (3600). Alternatively, in different embodiments, if said similarity measure is such that smaller values indicate a higher degree of similarity between said embeddings (rl, rQ), it may be provided that if said similarity measure is smaller than said similarity threshold, the image (xl) is selected and may, for example, be transmitted to a remote server (3500). Otherwise, it is discarded (3600). This concludes this method.

Shown in figure 4 is an embodiment of a training system (140) to carry out the method illustrated in figure 1. Preferably, said training system (140) comprises a computer memory (145) and a processor (146) and said method is executed by carrying out a computer program which is stored on said computer memory (146). Said computer program is carried out by said processor (145). The training system (140) comprised a first storage (St ) and a second storage (St 2 ). The parameters ( ) of said multi-modal embedder (60) are stored in said first storage (S' / . Stored on said second storage (St 2 ) are pairs of image data (xl) and corresponding description data (xQ). Upon computing corresponding embeddings (rl, rQ) of said pairs, an adjuster (70) which is also comprised by said training system (140) computes adjusted parameters (O') as described in connection with figure 1 which shall update the stored parameters (O).

Figure 5 illustrates an embodiment of an image retriever (40) that is configured to selecting images from an object database by a received query with the method illustrated in connection with figure 2. The image retriever (40) comprises a computer memory (46) on which a computer program configured to carry out the aforementioned method is stored, and a computer (45) for executing said computer program. The image retriever (40) comprises a first storage (St ), a third storage (St 3 ), said multi-modal embedder (60) and a rater (80). Stored on said first storage (St ) are parameters ( ) that characterize the behavior of multimodal embedder (60) and which have been trained, for example, with training system (140) with the method described in connection with figure 1.

Stored on said third storage (St 3 ) is said plurality of images (xl). The image retriever (40) is configured to input these into the embedder (60) to compute embeddings (rl) and pass them on to said rater (80). The image retriever (40) is further configured to receive query data (xQ), to input it into said embedder (60) and compute said corresponding embedding (rQ), which is also passed on to said rater (80). Said rater (80) is then configured to compute a result (r) which comprises those of said plurality of images (xl) that best math said query data (xQ) as illustrated by steps (2300) and (2400) of the method described in connection with figure 2.

Figure 6 illustrates an embodiment of an image selector (41), which in this embodiment is configured for selecting images from a stream against a provided query with the method illustrated in connection with figure 3. The image retriever

(40) comprises a computer memory (46) on which a computer program configured to carry out the aforementioned method is stored, and a computer (45) for executing said computer program. The image selector (41) comprises said first storage (St ), a fourth storage (St 4 ), said multi-modal embedder (60) and a selector (90). Stored on said first storage (St ) are parameters (O) that characterize the behavior of multi-modal embedder (60) and which have been trained, for example, with training system (140) with the method described in connection with figure 1.

Stored on said fourth storage (St 4 ) is said query data (xQ). The image selector

(41) is configured to input these into the embedder (60) to compute a corresponding embedding (rQ) and pass it on to said selector (90). The image selector (41) is further configured to receive image data (xl), to input it into said embedder (60) and compute said corresponding embedding (rl), which is also passed on to said selector (90). Said selector (90) is then configured to decide whether or not to output said image data (xl) as illustrated by steps (3400), (3500) and (3600) of the method described in connection with figure 3.

Figure 7 illustrates an embodiment of a practical application of the image selector (41). Shown in figure 7 is a robotic device, especially a vehicle (100), which comprises a visual sensor (30), or camera, which senses an environment of said vehicle (100) to output corresponding image data (xl). Note that said sensor (30) may be a video sensor, a radar sensor, an ultrasound sensor, a Lidar sensor or a thermal camera. If outputted by said image selector (41), said image data (xl) is then passed on to a transmitter (10), e.g. a mobile transmitter, and passed on to a remote server (50) which may be distant from said vehicle (100).

A system comprising said visual sensor (30), said image selector (41), and said transmitter (10) is called a camera control system (101). Note that such camera- control system (101) may also be put to use elsewhere, not necessarily in a vehicle (100), but in any device that requires sensing images, selecting them and passing on only some of them, like e.g. a robotic device.