IDENTIFYING AT LEAST ONE OBJECT WITHIN AN IMAGE

Title:

IDENTIFYING AT LEAST ONE OBJECT WITHIN AN IMAGE

Document Type and Number:

WIPO Patent Application WO/2020/234602

Kind Code:

Abstract:

There is provided a computer-implemented method of identifying at least one object within an image, comprising the steps of: receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object; determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset; training an artificial neural network based on the sub-set of pixels; receiving an image comprising a plurality of pixels, the image relating to at least one object; and using the artificial neural network, identifying the at least one object in the image.

Inventors:

CHHABRA PUNEET (GB)
JELODARI MAHDI (GB)

Application Number:

PCT/GB2020/051247

Publication Date:

November 26, 2020

Filing Date:

May 21, 2020

Export Citation:

Click for automatic bibliography generation Help

Assignee:

HEADLIGHT AI LTD (GB)

International Classes:

G06K9/62

Foreign References:

CN106096656A

2016-11-09

Other References:

VANIKA SINGHAL ET AL: "How to Train Your Deep Neural Network with Dictionary Learning", 22 December 2016 (2016-12-22), XP055729386, Retrieved from the Internet [retrieved on 20200910]
MATTHIEU COURBARIAUX ET AL: "Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1", ARXIV.ORG, 17 March 2016 (2016-03-17), pages 1 - 11, XP055405835, Retrieved from the Internet
JOSEPH REDMON ET AL: "You Only Look Once: Unified, Real-Time Object Detection", 9 May 2016 (2016-05-09), pages 1 - 10, XP055556774, Retrieved from the Internet [retrieved on 20190214], DOI: 10.1109/CVPR.2016.91

Attorney, Agent or Firm:

MATHYS & SQUIRE et al. (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

Claims

1. A computer-implemented method of identifying at least one object within an image, comprising the steps of:

receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object;

determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset;

training an artificial neural network based on the sub-set of pixels;

receiving an image comprising a plurality of pixels, the image relating to at least one object; and

using the artificial neural network, identifying the at least one object in the image.

2. A method according to Claim 1 , further comprising the step of collecting weights from the trained artificial neural network.

3. A method according to Claim 1 or 2, comprising the steps of determining whether a weight is either greater or less than the mean of all weights in the network; and representing a weight in binary terms accordingly.

4. A method according to Claim 3, wherein using the artificial neural network comprises applying the binarised weights to the artificial neural network.

5. A method according to Claim 3 or 4, further comprising the step of saving the binarised weights to a memory.

6. A method according to any preceding claim, further comprising the step of training the artificial neural network based on a further sub-set of pixels providing a different sparse representation of the dataset, wherein the sparsity of the further sub-set of pixels is different from that of the sub-set of pixels.

7. A method according to any preceding claim, wherein the size of the sub-set of pixels is no more than one-half, preferably no more than one-fifth, and more preferably no more than one-tenth the size of the dataset.

8. A method according to any preceding claim, further comprising the step of determining the size of the sub-set of pixels based on a required accuracy of identification.

9. A method according to any preceding claim, wherein the sub-set of pixels provide a sparse representation of a non-linear transformation of the dataset.

10. A method according to Claim 9, wherein the sub-set of pixels comprises a dictionary and/or sparse codes, preferably a product of the dictionary and sparse codes; wherein the dictionary is a discriminative orthogonal dictionary

11. A method according to any preceding claim, further comprising, prior to training the artificial neural network, determining parameters of the artificial neural network based on the sub-set of pixels.

12. A method according to any preceding claim, wherein identifying the at least one object comprises obtaining a sparse representation of the image, preferably of a non linear transformation of the image.

13. A method according to Claim 12, wherein the sparsity of the sparse representation of the image is different to the to the sparsity of the sub-set of pixels.

14. A method according to any preceding claim, wherein identifying the at least one object comprises detecting and localising the at least one object.

15. A method according to any preceding claim, wherein identifying the at least one object comprises classifying the at least one object, optionally by comparison to a plurality of pre-determined labels.

16. A method according to any preceding claim, wherein identifying the at least object comprises tracking the at least one object, optionally by associating the at least one object with at least one bounding box.

17. A method according to any preceding claim, further comprising the step of moving a powered device based on the identified at least one object.

18. A method according to any preceding claim, wherein prior to determining a sub-set of pixels within the dataset, the dataset is transformed to mimic conditions of poor visibility.

19. Apparatus for identifying at least one object within an image, comprising:

means for receiving an image comprising a plurality of pixels, the image relating to at least one object; and

means for processing the received image, the means for processing comprising an artificial neural network configured to identify the at least one object in the image; the artificial neural network being trained based on a determined sub-set of pixels within a dataset of images, each image comprising a plurality of pixels and relating to at least one object, wherein the sub-set of pixels provide a sparse representation of the dataset.

20. Apparatus according to Claim 19, further comprising a data store for storing binarised weights for the artificial neural network.

21. Apparatus according to Claim 19 or 20, wherein the means for processing comprises a field programmable gate array (FPGA).

22. A system for identifying at least one object within an image, comprising:

a server, comprising:

means for receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object;

means for determining a sub-set of pixels within the dataset, wherein the sub set of pixels provide a sparse representation of the dataset;

means for training an artificial neural network based on the sub-set of pixels; and

a mobile device, comprising:

means for receiving an image comprising a plurality of pixels, the image relating to at least one object; and

the artificial neural network, being configured to identify the at least one object in the image.

Description:

Identifying at least one object within an imaqe

Field of disclosure

The present invention relates to a computer-implemented image processing method, in particular a method of identifying at least one object within an image. The invention extends to a corresponding apparatus and system.

Background

In many situations, it is necessary to identify features within data, such as detecting a certain object within an image. Problematically, present systems often struggle to quickly, correctly, and reliably identify such features within data. Furthermore, present systems often struggle to identify features reliably when the sensing rate (i.e. the number of pixels measured against the number of actual pixels needed to make a good prediction) acquired by sensors such as cameras is low (such as below 10%), which can be the case when data is corrupt, when environmental conditions are poor (e.g. during fog), when sensors are occluded, or when a low resolution sensor is used.

Summary of Invention

According to at least one aspect of the present disclosure, there is described herein a computer-implemented method of identifying at least one object within an image, comprising the steps of: receiving a dataset of images, each image comprising a plurality of pixels and optionally relating to at least one object; determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset (i.e. obtaining a sparse representation of the images (in the dataset)); training an (artificial) neural network based on the sub-set of pixels (i.e. the sparse representation); receiving an image comprising a plurality of pixels, the image relating to at least one object; and using the artificial neural network, identifying the at least one object in the image.

Obtaining a sparse representation of the image may comprise determining (learning) a (preferably non-linear) dictionary and sparse codes (or“sparse coding”, or“representation”) which provide a sparse representation of the image. The artificial neural network may be trained based on the dictionary, on the sparse codes, or on both (e.g. a product of the sparse codes and the dictionary). In other words, the sub-set of pixels may comprise the (learned) dictionary and/or sparse codes (e.g. a product of the dictionary and sparse codes).

In another aspect, there is provided a computer-implemented method of creating neural networks for object detection and classification (or training an artificial neural network for use in identifying at least one object within an image), comprising the steps of: receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object; determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset (i.e. obtaining a sparse representation of the images (in the dataset)); and training an artificial neural network based on the sub-set of pixels.

In another aspect, there is provided a computer-implemented method of identifying at least one object within an image, comprising the steps of: receiving an image comprising a plurality of pixels, the image relating to at least one object; and using an artificial neural network, identifying the at least one object in the image; wherein the artificial neural network is trained based on a determined sub-set of pixels within a dataset of images, each image comprising a plurality of pixels and relating to at least one object, wherein the sub-set of pixels provide a sparse representation of the dataset.

Preferably, the method further comprises the step of collecting weights (corresponding to the connections between neurons) from the trained artificial neural network. The method may further comprise the steps of determining whether a weight (or each weight in the network) is either greater or less than the mean of all weights in the network; and representing a (or each) weight in binary terms accordingly (i.e. binarizing the neural network / neural network weights). Using the artificial neural network may comprise applying the binarised weights to the artificial neural network. The binarised weights may be saved to a memory, optionally the memory of a memory constrained device such as a mobile device or robot.

Optionally, the method may further comprise the step of training the artificial neural network based on a further sub-set of pixels (e.g. dictionary and/or sparse codes) providing a different sparse representation of the dataset, wherein the sparsity of the further sub-set of pixels is different from that of the sub-set of pixels.

The size of the sub-set of pixels may be no more than one-half, preferably no more than one-fifth, and more preferably no more than one-tenth the size of the dataset. Optionally, the method may further comprise the step of determining the size of the sub-set of pixels (in other words, the (degree of) sparsity of the sparse codes and/or the dimensions of the dictionary) based on a required accuracy of identification. Accordingly, the method may allow balancing the trade-off between sparsity and final model accuracy depending e.g. on the specific model application. For example, a high degree of sparsity (and thus reduced power and/or memory usage) may be prioritised for a wildlife tracking camera system at the cost of reduced model accuracy, while model accuracy may prioritised for autonomous robots over sparsity. The sub-set is preferably an optimal sub-set of pixels (for providing a sparse representation of the full set of data), which may be selected automatically or semi-automatically.

Identifying the at least one object preferably comprises detecting and localising the at least one object. Identifying the at least one object preferably comprises classifying the at least one object, optionally by comparison to a plurality of pre-determined labels. Identifying the at least one object preferably comprises tracking the at least one object, optionally by associating the at least one object with at least one bounding box.

The method may further comprise the step of moving a powered device based on the identified at least one object.

Optionally, the sub-set of pixels provide a sparse representation of a non-linear transformation of the dataset. Optionally, the non-linear transformation is of a lower dimension than the dataset. Optionally, the non-linear transformation is obtained by applying a non-linear function (preferably a radial basis function) to each pixel in the dataset.

Optionally, the sub-set of pixels comprises a dictionary and/or sparse codes, preferably a product of the dictionary and sparse codes; wherein the dictionary is a discriminative orthogonal dictionary.

Optionally, the dictionary is one or more of: orthogonal, discriminative, non-linear, incoherent, kernelized, and/or complete.

Optionally, the dictionary is a single dictionary (i.e. a single dictionary is determined for the entire dataset (of images)).

Optionally, the dictionary is learned using a kernel dictionary learning method. As part of the learning method, a kernel matrix may be determined, wherein the kernel matrix is preferably approximated using a low-rank approximation method, preferably using the Krylov and/or Nystrom method.

Optionally, the method comprises selecting and/or optimising neural network parameters, preferably based on the dataset of images and/or the sub-set of pixels and/or (process) requirements (e.g. required training time).

Optionally, prior to training the artificial neural network, the method comprises determining parameters of the artificial neural network based on the sub-set of pixels (i.e. based on the sparse representation of the dataset). Optionally, the method comprises selecting the input to the artificial neural network (based on which the network is trained) among: the dictionary, the sparse codes, and a product of thereof. Optionally, identifying the at least one object comprises obtaining a sparse representation of the image, preferably of a non-linear transformation of the image.

Optionally, the sparsity of the sparse representation of the image is different to the sparsity of the sub-set of pixels (sparse representation of the dataset). In other words, a different sparse representation may be used for training than that used for inference. Accordingly, different requirements may be prioritised at training (e.g. reduced training time) than at inference (increased detection / classification accuracy).

Optionally, the image and/or images may be visible-spectrum images. The present disclosure is particularly applicable to visible-spectrum images as such images are often of poor quality (e.g. segments of the image being blurred or not visible) for classification/detection purposes as visible spectrum images may be adversely affected by ‘conditions of poor visibility’ (e.g. rain, fog, or haze). Such problems do not typically apply to images in other wavelengths/frequencies such as radar images (e.g. Synthetic Aperture Radar (SAR)).

Preferably, the sub-set of pixels provide a sparse representation of the entire dataset. In other words, preferably, no pre-filtering of the image is implemented (e.g. a candidate area where the object to be detected may be is not selected). This may improve the versatility and reduce the complexity (and thereby computational and/or memory cost) of the method.

Optionally, the method further comprises, using the artificial neural network, classifying the image (in other words, assigning one or more classes to the image).

Optionally, the sub-set of pixels is a sparse representation of the dataset.

Optionally, prior to determining a sub-set of pixels within the dataset, the dataset is transformed to mimic conditions of poor visibility.

Optionally, the method further comprises mapping the dictionary into the domain of the dataset (i.e. into the input domain).

In another aspect, there is provided apparatus for identifying at least one object within an image, comprising: means for receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object; means for determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset (i.e. means for obtaining a sparse representation of the images (in the dataset)); means for training an artificial neural network based on the sub-set of pixels; means for receiving an image comprising a plurality of pixels, the image relating to at least one object; and the artificial neural network for identifying the at least one object in the image. ln another aspect, there is provided apparatus for creating neural networks for object detection and classification (or training an artificial neural network for use in identifying at least one object within an image), comprising: means for receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object; means for determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset (i.e. means for obtaining a sparse representation of the images (in the dataset)); and means for training an artificial neural network based on the sub-set of pixels.

In another aspect, there is provided apparatus for identifying at least one object within an image, comprising: means for receiving an image comprising a plurality of pixels, the image relating to at least one object; and means for processing the received image, the means for processing comprising an artificial neural network configured to identify the at least one object in the image; the artificial neural network being trained based on a determined sub-set of pixels within a dataset of images, each image comprising a plurality of pixels and relating to at least one object, wherein the sub-set of pixels provide a sparse representation of the dataset.

Optionally, the apparatus further comprises a data store for storing binarised weights for the artificial neural network, optionally wherein said binarised weights are obtained from the training of the artificial neural network.

Optionally, the means for processing comprises a field programmable gate array (FPGA).

Optionally, the means for processing comprises a cloud service (offering) made up of a number of separate servers (computers), such as Amazon® Web Services. The means for receiving an image may transmit the received image to the means for processing, and the means for processing may transmit data related to the at least one identified object in the image (e.g. object identifier(s)) to the means for receiving. In this way, inference may be performed remotely (from the means for receiving) on the means for processing, and the results (data related to identified object(s)) fed back to the means for receiving (such as a robot), thereby reducing the computational and/or memory requirements on the means for receiving.

Optionally, the apparatus further comprises means for applying the binarised weights to the artificial neural network.

In another aspect, there is provided a system for identifying at least one object within an image, comprising: a server, comprising: means for receiving a dataset of images, each image comprising a plurality of pixels and relating to at least one object; means for determining a sub-set of pixels within the dataset, wherein the sub-set of pixels provide a sparse representation of the dataset (i.e. means for obtaining a sparse representation of the images (in the dataset)); means for training an artificial neural network based on the sub-set of pixels; and a mobile device (such as a robot or drone), comprising: means for receiving an image comprising a plurality of pixels, the image relating to at least one object; and the artificial neural network, being configured to identify the at least one object in the image.

Optionally, the mobile device comprises means for transmitting the image to the server; and the server, rather than the mobile device, comprises the artificial neural network, being configured to identify the at least one object in the image, and means for transmitting data relating to the at least one object to the mobile device.

In overview, the invention may provide a method and system using a binary neural network that learns in the compressive domain for edge devices (i.e. devices having relatively low memory or processing power for machine learning applications). Binary weights are generated based on the model learnt in the compressed domain - in other words, the “binarisation” and “compressed domain” aspects interact so as to lead to further advantageous improvements.

Use of the words“apparatus”,“server”, "device", "processor",“communication interface” and so on are intended to be general rather than specific. Whilst these features of the disclosure may be implemented using an individual component, such as a computer or a central processing unit (CPU), they can equally well be implemented using other suitable components or a combination of components. For example, they could be implemented using a hard-wired circuit or circuits, e.g. an integrated circuit, using embedded software, and/or software module(s) including a function, API interface, or SDK. Further, they may be more than just a singular component. For example, a server may not only include a single hardware device but also include a system of microservices or a serverless architecture. Either of which are configured to operate in the same or similar way as the singular server is described.

The invention extends to methods, system and apparatus substantially as herein described and/or as illustrated with reference to the accompanying figures.

The invention also provides a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, and a computer readable medium having stored thereon a program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein. The invention also provides a signal embodying a computer program or a computer program product for carrying out any of the methods described herein, and/or for embodying any of the apparatus features described herein, a method of transmitting such a signal, and a computer product having an operating system which supports a computer program for carrying out the methods described herein and/or for embodying any of the apparatus features described herein.

Any feature in one aspect of the invention may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.

Furthermore, features implanted in hardware may generally be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.

As used herein, the term ‘object’ preferably connotes an object that is intended to be detected by the system and/or is a target for the system; more preferably an object that is relevant for navigation and/or mapping (in particular for a vehicle or device including the system). As used herein, all references to the term‘object’ in the singular sense should be understood to additionally refer to‘objects’ in a plural sense, and vice versa.

As used herein, the term“weight” is preferably synonymous with the term“bias”.

The disclosure will now be described by way of example, with references to the accompanying drawings in which:

Figure 1 shows an image including a car;

Figure 2 shows an exemplary computer apparatus on which the methods described herein may be implemented;

Figure 3 is a flowchart for a method of identifying an object;

Figure 4 is a flowchart for a method of training an artificial neural network for use in identifying an object;

Figure 5 is a graph showing datapoints alongside basis elements;

Figure 6 is a representation of a deep learning model;

Figure 7 shows a method of converting weights and models to a binary representation; Figure 8 shows examples of images in a 32x32 resolution;

Figure 9 shows reduced representations of the images of Figure 8;

Figure 10 shows a normal and a reduced representation of an image;

Figure 11 shows reduced representations of images;

Figures 12 and 13 show classification using reduced representations of images; and Figure 14 shows a conventional image classification method;

Figure 15 shows an example implementation of the method of Figure 4.

Detailed description

Referring to Figure 1 , there is shown an image including a car. In many applications, it is desirable to identify objects (such as cars) in image data, in particular where this identification involves locating the object within the image and classifying the object. This is particularly desirable in applications such as vehicle navigation.

In vehicle navigation, decisions often need to be taken based on limited input data using a constrained hardware platform in a short period of time (often under a second). Computationally heavy approaches for object identification are thus often not ideal for such applications, in particular in situations where input data is particularly limited (e.g. conditions of low visibility).

Computing device

Referring to Figure 2, there is shown a computer device 2000 suitable for executing the computer— implemented object identification methods described herein. The computer device 2000 comprises a processor in the form of a CPU 2002, a communication interface 2004, a memory 2006, storage 2008 and a user interface 2012 coupled to one another by a bus 2014. The user interface 2012 comprises a display 2016 and an input/output device, which in this embodiment is a keyboard 2018 and a mouse 2020.

The CPU 2002 is a computer processor, e.g. a microprocessor. It is arranged to execute instructions in the form of computer executable code, including instructions stored in the memory 2006 and the storage 2008. The instructions executed by the CPU 2002 include instructions for coordinating operation of the other components of the computer device 2000, such as instructions for controlling the communication interface 2004 as well as other features of a computer device 2000 such as a user interface 2012. The memory 2006 is implemented as one or more memory units providing Random Access Memory (RAM) for the computer device 2000. In the illustrated embodiment, the memory 2006 is a volatile memory, for example in the form of an on-chip RAM integrated with the CPU 2002 using System-on-Chip (SoC) architecture. However, in other embodiments, the memory 2006 is separate from the CPU 2002. The memory 2006 is arranged to store the instructions processed by the CPU 2002, in the form of computer executable code. Typically, only selected elements of the computer executable code are stored by the memory 2006 at any one time, which selected elements define the instructions essential to the operations of the computer device 2000 being carried out at the particular time. In other words, the computer executable code is stored transiently in the memory 2006 whilst some particular process is handled by the CPU 2002.

The storage 2008 is provided integral to and/or removable from the computer device 2000, in the form of a non-volatile memory. The storage 2008 is in most embodiments embedded on the same chip as the CPU 2002 and the memory 2006, using SoC architecture, e.g. by being implemented as a Multiple-Time Programmable (MTP) array. However, in other embodiments, the storage 2008 is an embedded or external flash memory, or such like. The storage 2008 stores computer executable code defining the instructions processed by the CPU 2002. The storage 2008 stores the computer executable code permanently or semi permanently, e.g. until overwritten. That is, the computer executable code is stored in the storage 2008 non-transiently. Typically, the computer executable code stored by the storage 2008 relates to instructions fundamental to the operation of the CPU 2002.

The communication interface 2004 is configured to support short-range wireless communication, in particular Bluetooth® and WiFi communication, long-range wireless communication, in particular cellular communication, and/or an Ethernet network adaptor. In particular, the communications interface is configured to establish communication connections with other computing devices and/or a server. The server may be used to store data and to perform certain processing, in particular more computationally complex processing. In general the server may act to train an artificial neural network, which may then be communicated to a computer device 2000 such that“on-the-fly” calculations can be performed.

The storage 2008 provides mass storage for the computer device 2000. In different implementations, the storage 2008 is an integral storage device in the form of a hard disk device, a flash memory or some other similar solid state memory device, or an array of such devices. In some embodiments, there is provided removable storage, which provides auxiliary storage for the computer device 2000. In different implementations, the removable storage is a storage medium for a removable storage device, such as an optical disk, for example a Digital Versatile Disk (DVD), a portable flash drive or some other similar portable solid state memory device, or an array of such devices. In other embodiments, the removable storage is remote from the computer device 2000, and comprises a network storage device or a cloud-based storage device.

A computer program product is provided that includes instructions for carrying out aspects of the method(s) described below. The computer program product is stored, at different stages, in any one of the memory 2006, storage device 2008 and removable storage. The storage of the computer program product is non-transitory, except when instructions included in the computer program product are being executed by the CPU 2002, in which case the instructions are sometimes stored temporarily in the CPU 2002 or memory 2006. It should also be noted that the removable storage 2008 is removable from the computer device 2000, such that the computer program product is held separately from the computer device 2000 from time to time.

Dataset classification

Referring to Figure 3, there is shown a method of identifying at least one object within an image by use of an artificial neural network. In a first step 302, an image is received, for example via a camera. In a second step 302, the image is fed into an artificial neural network, which processes the image so as to identify (and optionally classify) objects in the image data.

Traditional artificial neural networks are an interconnected group of nodes (neurons). The connections between these neurons are modelled as weights and all inputs to the network are modified based on the weights and summed to produce an outcome. Modern neural networks (shallow, recurring or deep), depending on the number of layers, building on the above principle, are being used for predictive modelling and decision making within the field of robotics, machine intelligence and computer vision. They follow a standard pipeline: given a large set of training input data and its corresponding output labels, the network learns or finds a mathematical way to turn the input data in to the output. These relationships can either be linear or nonlinear and when a new test image is passed through the network, an output label, which in the context of machine learning, may correspond to an image type (e.g. cat, dog, vehicle, cyclist, etc.), is generated. Such artificial neural networks can be computationally complex. Referring to Figure 4, a flowchart for a method of training a neural network for use in the method of the present invention is shown.

In a first step 402, a dataset of images is received for use in training. Several known common datasets for the purposes of training machine learning models are known. Each image generally relates to at least one object (and often a plurality of objects of different categories), and is formed of a plurality of pixels.

In a second step 404, a sub-set of pixels within the dataset is determined, where the sub-set of pixels provide a sparse representation of the dataset. The pixels within the sub-set may be referred to as“basis elements” or“dictionary atoms” for the dataset. Each image within the dataset can be represented by K number of learned atoms, where K represents the degree of sparsity. Although the sub-set is referred to as consisting of “pixels”, it will be appreciated that the actual datapoints may relate to pixel values of the image, or may relate to values representing pixels within an image, such as line data. The sub-set preferably relates to optimal representations of the images within the dataset or the dataset as a whole, which may be combined to represent images within the dataset or the dataset as a whole. Since the subset represents the full input dataset, it is possible to use the subset as input data for an artificial neural network in place of the full dataset.

In a third step 406, a neural network is trained based on the basis elements. Training may comprise supervised or unsupervised training. Typically, this may involve the neural network classifying a feature of the subset and this classification then being identified as being correct or incorrect. Typically, training comprises the use of data corresponding to, for example, 10,000 or 100,000 images, to form the neural network. Thereafter, the neural network is able to classify a feature of an input image. Training a neural network based on a sparse representation of an input dataset reduces the size and complexity of training and subsequent processing significantly.

In a fourth step 408, the weights (and biases) of the trained model are collected. The weights are the values allocated to the connections between“neurons” (i.e. datapoints) of the neural network, where the value of the weight indicates the degree of association between neurons.

In a fifth step 410, the weights are“binarised”, that is, individual weights are compared to the mean of all weights in the network, and if the value of the weight is greater than the mean, setting the value of the weight as binary“1”. Conversely, if the value of the weight is less than the mean, the value of the weight is set as binary“0”. In other words, binary weights are generated based on the model learnt in the compressed domain. This step considerably reduces the complexity of the weights associated with the neural network. Alternatively, the weights may be quantized to a larger set of possible values than“0” and“1”. For example, the weights may be quantized to three values, where the value of a weight is set as 0, 1 , or 2, depending on the percentile of the weight’s value (e.g. the value of the weight is set to 0 for values below the 33 ^rd percentile, 1 for values between the 33 ^rd and 66 ^th percentiles, and 2 for values above the 66 ^th percentile).

In a sixth step 412, the binarised weights are transferred to an on-chip memory of a device such as a robot vehicle (or other mobile device). The test (input) database may be transformed into the compressed domain using a projection and the learned dictionaries. In practical implementation, the saved weights may be fed into a deep learning model (optionally on-device) to perform predictions.

The described training process may be repeated by varying the degree of sparsity K to improve results.

In general terms, the described method may reduce the data footprint needed (to produce neural network-based models) by a significant percentage (e.g. 50% in some cases). The described models may also have a major impact on learning/training time and hardware resource allocation, e.g. memory requirements. This may enable deployment on memory constraint and battery budgeted edge devices (e.g. mobile phones and FPGAs), and/or on cloud services. These methods may further enable robots to see (i.e. perform detection, classification and localisation) in hazy, rainy or foggy environments (which may be described generically as“conditions of poor visibility”). By sensing only 10% of the input data, taken from a camera, classification accuracy of around 85% and 70% is reported on benchmark datasets. Although the robot may“see” less (in particular in conditions of poor visibility), the described methods allow it to infer more about its surroundings.

The described method may thereby improve object detection based on relatively poor sensed data, either as a result of conditions or as a result of the sensor itself. For example, a low resolution camera may be used. In an example, a 10 megapixel image may be reconstructed using a single pixel image sensor (e.g. a charge coupled device or CCD), as opposed to the use of a CCD with 2592 x 3872 pixels (i.e. 10MP).

The described training aspects may be performed at a server, whereas the trained model (and binarised weights) may be provided for implementation on a local device such as a robot or other powered machine.

The system will now be described in further detail. Referring to Figure 5, input data and basis elements (e.g. atoms which can be combined linearly or non-linearly to reconstruct every single input datum) determined therefrom is shown. For the sake of simplicity and visualization, in Figure 5 each input datum is represented as a dot. In practice, these dots may represent an image. It can be seen that in the example of Figure 5 there are three distinctive clusters, each representing a class/image type, e.g. a cat dog or vehicle.

Traditional deep learning builds/learns neural network (NN) models on the entire dataset. In contrast, the method described herein involves finding an optimal representation, which inherently is of much lower dimension than the original input data. For example, for an image of size 40000 pixels, the method may use 10% of the pixels (dictionary atoms) to learn a NN model and still maintain higher detection accuracy.

Referring to Figure 6, a neural network for classifying input data is typically trained using a deep learning model, for example a‘you only look once’ (YOLO) model. As mentioned, once training has occurred, the weights and biases of the trained model are collected.

Binarisation and testing

Referring to Figure 7, the weights and biases of the trained neural network are binarised and transferred to an on-chip memory.

Classification using the described methods results shows significant training time improvement over traditional models. Using the compressed sensing approach, dictionaries (D) and corresponding coefficients (Gamma) are learned from the training dataset. It is possible to be selective about the number of learning atoms and the sparsity per pixel in every image in the dataset. The size of D and degree of sparsity affects the performance of the model. For instance, the sparser the representations, the faster the training time is in the learning stage of the pipeline. However, this may affect the model’s accuracy. Depending on the required degree of accuracy, the learning parameters may be adjusted.

Figure 8 shows two examples from the Cifar-10 test dataset for a Concorde aircraft and a Jeep vehicle, respectively, in 32x32 default resolution. Figure 9 shows the constructed 100 atom (with r = 10) of the same image used for inference. The entire test dataset has been transformed to the compressed domain - that is a reduced set of basis elements has been determined - and used for inference. The trained model on 50,000 reconstructed images yields 91% accuracy when applied on the transformed dataset of Figure 8. Larger images

Learning dictionaries on a small dataset can be unhealthy as shown here (Gamma * D): This experiment is conducted to show the impact of learning small dictionaries on small datasets (dic_size = 5 and sparsity = 2). In some embodiments, the Yolo-v2 model (one of the world’s object recognition frameworks with a 33 layer CNN) is used to train on the car dataset. The pipeline is as described above. The image sets are in the CS (compressive sensing, i.e. training based on a sub-set of data as described above) domain so the amount of data per pixel is significantly reduced (1 , 3, 4). The training phase (2) is supposed to show significant improvement due to the reduced number of features required to be learnt.

Figures 12a and 13a show detections in the regular domain using a neural network trained in the compressed sensed domain. This method has faster training time, lower power consumption, and lower data footprint.

Figures 12b and 13b show detections in the compressive sensing domain using a network which is trained in the regular domain. Detections are with very low confidence 2-3%. This method may lead to a fully trained model fails to confidently recognise the objects.

Figures 12c and 13c show detections in the compressive sensing domain using a network also trained in the compressive sensing domain. This method has faster training time, lower power consumption, lower data footprint,‘visibility’ in poor conditions. Table 1 below shows classification accuracy of the described method on MNIST, Cifar-10 and Yolov2/car datasets.

Table 1

Further details

The method of Figure 4 will now be described in further detail.

The dataset received at step 402 may be denoted by an (original) input data matrix Y (of dimensions determined by: number of images x number of pixels per image). Matrix Y is typically large in size, and may be prohibitively large for using as input to an artificial neural network that is to be fast to train and suitable for deployment on constrained devices.

Thus, to at least partially alleviate these problems, the method of Figure 4 may include:

• At step 404: determining (learning) a dictionary D from the input data Y, and determining the corresponding sparse codes (or sparse coding or representation) X.

• At step 406: training a neural network using Y _transformed = D ^* X as input (training) data

(see further details below). • At step 410: binarizing or quantizing the weights. This may be particularly beneficial for deployment of the method on ‘edge’ devices which are often constrained (in particular in terms of available memory).

The method of learning the dictionary D is preferably a non-linear method (as described in further detail below). Further, the dictionary is preferably generic (e.g. applicable to (appropriate for) a range of input images and/or classes of objects (e.g. vehicles, persons) within the input data Y), and/or the sparse codes are preferably highly sparse (a large proportion of the matrix elements being zero). Preferably, the dictionary atoms are orthogonal, more preferably the dictionary matrix D is orthogonal. Preferably, the dictionary is a discriminative orthogonal dictionary. Preferably, the dictionary is complete (i.e. the number of dictionary samples is equal to the input data (signal) dimension (e.g. number of pixels per image in the dataset); in other words, the dictionary matrix is a square matrix). Alternatively, the dictionary may be undercomplete (i.e. number of samples in dictionary is lower than the input data (signal) dimension) or overcomplete (i.e. number of samples in dictionary is greater than the input data (signal) dimension).

Training data for artificial neural network

The input training data for the artificial neural network may comprise the sparse codes X and/or the dictionary D and/or a combination of both (e.g. a product as denoted by Y _transformed) ·

If the input data comprises sparse codes (alone or in combination with the dictionary), particularly good results (e.g. high accuracy) may obtained if the sparsity level is reduced (i.e. higher proportion of non-zero entries in X). However, this has an implication that when sparse codes are used in combination with the dictionary and D is multiplied with X (with more non-zero entries), more columns in D are selected than if the sparse codes had a higher sparsity level. This may be an issue if the dictionary D is of poor quality (e.g. poorly represents the input dataset) - however, it is acceptable if a good D is obtained.

Using input data comprising the (learned) dictionary D (preferably for which the number of entries is significantly lower than the original image matrix/dataset Y, and hence the training time significantly reduced) was found to result in improved performance over using the sparse codes X alone, but nonetheless worse than using a combination of D and X (e.g. as Y _transformed) ·

Determining the dictionary and sparse codes Example pseudo-code for determining the dictionary and sparse codes (stages 1 and 2) is shown below. The pseudo-code corresponds to a non-linear method of determining a (non linear) dictionary. Stage 3 of the pseudo-code corresponds to classification (inference).

In order to determine the dictionary and sparse codes, the input data is first transformed into a non-linear domain (the input data undergoes (is subject to) a non-linear transformation) - see stage 1 in pseudo-code below. Subsequently, the dictionary and sparse codes are determined for (learned on) the transformed input data (see stage 2 in pseudo-code below). Optionally, the input data is pre-processed (see further details below). Optionally, the non linear transformation of the input data further comprises one or more (preferably low-rank) approximations.

In more detail, at stage 1 , the input data (represented by Y _train for training, and Y _test for testing/inference) is transformed into a non-linear space and its dimensionality is reduced by approximating the kernel matrix. In other words, a non-linear function (e.g. a radial basis function) is applied to each pixel of the input (training) matrix Y _train, and the resulting matrix Y _s is multiplied by itself (see line 5) to arrive at matrix H. As matrix H is typically very large, it is preferably approximated (see line 6). The matrix H is preferably approximated using a low- rank (dimension) approximation method, such as the Nystrom method, and/or the Krylov method, and/or the alternating projections algorithm. Next, the original data (Y _train and/or Y _test) is transformed into the non-linear domain to arrive at transformed data Otrain / test.

Further details of the inner-workings of the function compute_kernel in the pseudo-code above (see lines 3 and 4) are provided below:

Vector quantization (function vector_quantisation in pseudo-code above - see line 2) is a k- means algorithm to cluster data. Given a large matrix (e.g. the kernel matrix in the compute_kernel explanation above, or matrix H in the pseudo-code above), the aim is to approximate it as it is typically very large in size - thus, single value decomposition (SVD) may be very time consuming. Accordingly, the matrix is approximated, e.g. using the Nystrom method. A subset of the matrix is built by either uniformly sampling the original matrix or by vector quantization which performs k-means clustering and uses the clusters mean and approximates the non-linear input matrix (H) (see line 6 in pseudo-code above).

At stage 2, the dictionary D and sparse codes X are learned on (determined based on) the transformed data F _train (which is treated as the input dataset for the sparse dictionary learning problem). The stage 2 pseudo-code corresponds to a method for learning an orthogonal dictionary, using an orthogonal dictionary may significantly reduce training and/or inference time. The dictionary D and sparse codes X may be determined (the sparse dictionary learning problem may be solved) via a range of methods such as K-SVD, stochastic gradient descent, or the Lagrange dual method. In order to arrive at a highly sparse X, the sparse dictionary learning problem may be formulated such that sparsity of X is assigned a high importance (i.e. the trade-off between arriving at D and X such that DX » F, and sparsity of X is balanced such that a solution with a high sparsity of X is obtained). Applying the described method of dictionary learning (Kernel (kernel) dictionary learning) to classification tasks (in particular, image classification) offers a solution for learning non linearity in the input data (in particular, in an image), which may greatly increase classification accuracy. However, decomposing a kernel matrix for large datasets is a computationally intensive task. Existing methods of dictionary learning using a (optimal) kernel approximation method improve computation run-time (as compared to other typical dictionary learning methods) but learn an over-complete dictionary (i.e. number of samples in dictionary is (typically much) greater than the input data (signal) dimension). In contrast, the described method preferably learns a discriminative (and/or complete) orthogonal dictionary (preferably for which the number of dictionary samples is equal to the input data (signal) dimension). As a result, learning and classification run-time may be significantly reduced as compared to existing methods.

The above described algorithm (corresponding to the pseudo-code above), Kernelized simultaneous approximation, and discrimination (K-SAD), learns a single highly discriminative and incoherent non-linear dictionary. The algorithm is particularly well suited to small to medium-scale real-world datasets (with the corresponding drawbacks, such as images taken in conditions of poor visibility). Experimental data shows that the algorithm may obtain over 97% classification accuracy on such datasets and shows that the algorithm can scale better both in space and time when compared to existing dictionary learning algorithms.

In summary, some of the key features/advantages of the described method of dictionary learning include: i. An improvement (reduction) in run-time and/or (increased) classification accuracy as compared to existing Kernel deep learning methods - thanks to learning a discriminative orthogonal dictionary instead of an over-complete dictionary (as is typical in existing methods);

ii. An improvement in run-time and/or classification accuracy - as a result of using an efficient SVD method for large matrices when approximating the (kernel) matrix (preferably using the Krylov method), in contrast to existing methods which typically do not use a kernel matrix as described above and/or do not approximate it; and/or iii. High classification (and/or detection) accuracy and faster run-time on image datasets (in particular on high-dimensional RGB-D and/or face recognition databases (datasets)) - thanks to learning a single (preferably generic) Kernelized orthogonal dictionary (which notably allows learning of non-linearity in input data). The method may further comprise mapping the kernel dictionary back into the input domain in order to better understand the dictionary structure and diversity.

Y _transformed may be determined based on the learned sparse codes X and dictionary D and used as training data input for the artificial neural network. Alternatively, the sparse codes X determined using the above-described method may then be determined based on the dictionary D and the transformed input F (e.g. via X = D ^TF for an orthogonal matrix D), and used as training data input for the artificial neural network. Accordingly, neural network weights are determined (learned), and may subsequently be quantized (e.g. so as to make the neural network particularly appropriate for edge devices). Once the artificial neural network is trained, it may be used for inference - in particular, classification and/or detection (e.g. detection of object(s) in the input image).

At inference, the input to the trained artificial neural network may be determined in a similar way as for training - the operations and variables used in stages 1 and 2 at inference are denoted with a“test” subscript in the above pseudo-code (e.g. the transformed input is denoted by F _test).

Optionally, the transformed input data at inference (e.g. F _test) is stored and used for future training. The transformed input data is encoded and smaller in size (number of samples) and (optionally) dimensions than the original input data (e.g. Y _test) , so storing transformed data as opposed to original inference input data may allow: (1) faster training (the transformation need not be repeated during training); and/or (2) reduced data usage (storing the transformation may take up less memory than the original data).

The described methods create binarized (smaller footprint) models by transforming images into a sparse domain where the dictionary and the sparse codes themselves are learned from the input data.

Data pre-processing

The input training data may optionally be pre-processed, as part of step 404 of the method of Figure 4 (before and/or after non-linear transformation), and/or between steps 402 and 404 of the method of Figure 4 (the original data being processed), and/or between steps 404 and 406 of the method of Figure 4 (the sparse representation (codes / coding) being processed).

Preferably, the original input data is (pre-)processed between steps 402 and 404. In particular, the original data may be transformed to mimic conditions of poor visibility - in other words, the original data may be augmented with artificial effects of conditions of poor visibility (such as artificial fog, rain and smoke). Accordingly, the artificial neural network is trained using images corresponding to conditions of poor visibility, and may perform better with inference input data obtained in conditions of poor visibility.

Optionally, the parameter selection and optimization may be automated.

Artificial neural network parameters

Optionally, prior to the training data entering the artificial neural network (between steps 404 and 406 in the method of Figure 4), the method comprises determining neural network (learning) parameters in dependence on the training data (e.g. the sparse representation (sparse coding)). A parameter space (a pre-defined range of possible parameters) is considered (explored) and the best / optimal (e.g. with respect to model accuracy, and/or training and/or inference time, and/or model complexity) configuration found (selected) with respect to the training (transformed) data. In other words, neural network parameters are optimally selected based on (given) the input dataset.

For example, in order to identify the optimal parameters in the parameter space, the method may use algorithms such as grid search, random search, or Bayesian optimization.

Example neural network parameters that may be determined in this way include: the number of layers, the number of nodes per layer, the activation function (e.g. sigmoid, tanh etc.), number of epochs, number of hidden units, and/or learning rate.

Optionally, the method further comprises selecting the optimal input (for training and/or inference) to the artificial neural network in dependence on the (original and/or transformed) dataset. For example, the input may be selected among: the learned dictionary D, the sparse codes X, and a product of D * X.

Training the artificial neural network

At step 406 of the method of Figure 4, the artificial neural network is trained based on the training dataset. As part of step 406, the method may comprise saving the model(s) (e.g. weights, or parameters) and logging to a datetime directory. Further, visualisation instructions may be printed to the console after training.

In addition to the training dataset, the training process may be based on further inputs, such as: global configuration settings, and/or network configuration settings ((hyper)parameters).

Deployment of trained artificial neural network

Optionally, the method comprises inference using the trained neural network models (‘playing the models in action’) and deploying the models on to the target environment (e.g. Edge and/or Cloud). The combination of non-linear dictionary learning and sparse codes in the described methods allows obtaining an improved data format (that balances sparse code sparsity and representation accuracy) which is then passed into a neural network. This may significantly reduce training time by making use of very limited data (at times 1/10 ^th of the original data set), thereby creating efficient deep learning models. The models are further quantised / binarized for edge devices. Moreover, the described methods may allow automated neural network parameter selection and optimisation.

Comparison to prior systems

Referring to Figure 14, a conventional image classification method using an artificial neural network is shown. Conventional artificial neural networks are an interconnected group of nodes (neurons). The connections between these neurons are modelled as weights and all inputs to the network (e.g. all pixels of an input image) are modified based on these weights and summed to produce an outcome (e.g. a classification of the input image). In other words, the training data for the artificial neural network (e.g. based on the‘YOLO’ model) comprises the original dataset (e.g. comprised of original images).

Referring to Figure 15, an example implementation of the method of Figure 4 is shown. In contrast to the conventional method of Figure 14, only a subset of the input data (or a lower- dimension representation of the input data) is fed into the neural network. This subset is determined from the original dataset using above-described methods (denoted by“Intelligent Encoding” in Figure 15). By using only a subset of the input data, a neural network may be generated (trained) by making use of just enough information, which is selected intelligently (e.g. to optimise data size, and/or classification performance), and trained at a much faster rate than training of the conventional neural network of Figure 14.

Once determined, the subset is fed as training input into an artificial neural network (or any other deep learning (DL) network). Once trained, the neural network may further be optimised, and/or its parameters (e.g. weights) may be binarized.

By using (sensing) only a limited amount (sub-set) of the input data, taken from a camera or any other sensor, the method of Figure 15 has been found to achieve classification accuracy of 85% and 70% on MNIST and CIFAR benchmark datasets respectively.

The method of Figure 15 may be applied in a wide range of applications, for example as part of a computer vision system. Example applications include drones, autonomous vehicles, cloud services (e.g. for inference on the cloud rather than on a device). Further, since the computational and memory requirements of the method of Figure 15 are lower than for the conventional method of Figure 14, the method may be implemented using low-power devices such as an FPGA, or a Raspberry Pi.

The method of Figure 15 is particularly beneficial for hardware resource allocation, e.g. for memory constrained devices. The method can also be deployed on cloud services, e.g. Amazon® Web Services (AWS) or Google® Cloud Platform (GCLOUD). It can also be deployed on memory limited and battery budgeted edge devices (e.g. mobile phones and FPGAs). The way models are built using the method of Figure 15 may also enable robots to ‘see’ (e.g. perform detection, classification and/or localisation tasks) in conditions of poor visibility (e.g. hazy, rainy or foggy environments).

The described methods also differ from the conventional use of sparse coding techniques for machine learning. Conventional methods obtain sparse codes by making use of a fixed (or ready-made) dictionary (e.g. contourlets) as feature vectors. In contrast, the above described methods correspond to learning dictionaries and sparse codes from input images (or image blocks / segments / patches). Furthermore, the above described methods may reduce the training time (e.g. by using a low-dimensionality sparse coding) and weight binarization may reduce inference time, computational cost and memory usage.

Alternative classification methods

In an alternative example, classification of input data may be performed without the use of an artificial neural network. If specific (preferably generic) dictionaries are determined for each class of objects (e.g. car, person, tree, etc.), the input data may be classified by finding the dictionary with the sparsest representation (sparsest sparse coding).

In a further alternative example, the dictionary and sparse codes may be used for image classification as shown in Stage 3 (lines 15-20) of the pseudo-code above.

In a yet further alternative example, the artificial neural network may be trained using the dictionary rather than the sparse codes as described above.

Alternatives and modifications

Various other modifications will be apparent to those skilled in the art. For example while the detailed description has described a method of identifying/classifying objects in images, the method is more broadly useable for any dataset that contains a plurality of datapoints, where these datapoints may relate, for example, to audio or textual data.

It will be appreciated that the described methods may be used as an input in motive/navigation systems for robots or other powered machines. Decisions may be taken based on the detected objects, for example to control motive systems so as to avoid or head towards a particular detected object.

In various embodiments, the size of the reduced set of basis elements is determined using a predetermined ratio or determined based on an accuracy requirement. As an example, the set of basis elements may be selected to be one-fifth, or one-tenth of the size of the number of datapoints. There may also be determined a maximum, or a desired, amount of information loss, where the set of basis elements may be sized so as to avoid losing a certain amount of information from the number of datapoints.

Optionally, the described process of identifying images using an artificial neural network comprises obtaining a sparse representation of the image (comprising a plurality of basis elements) based on a plurality of pixels within the image. The basis elements preferably form an optimal representation of the image that may be combined to obtain the full set of datapoints. Identifying a feature based on the basis elements as opposed to the full set of datapoints may enable a feature to be extracted quickly and at a low processing cost. Optionally, determining a sparse representation comprises compressing the image, optionally by lossless compression.

It will be understood that the present disclosure has been described above purely by way of example, and modifications of detail can be made within the scope of the disclosure.

Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Previous Patent: AIR HANDLING SYSTEM, CONTROLLER AND METHOD

Next Patent: IMPROVED 3D SENSING