Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM OF SELECTING ONE OR MORE IMAGES FOR HUMAN LABELLING
Document Type and Number:
WIPO Patent Application WO/2024/002534
Kind Code:
A1
Abstract:
There is provided a method and system of selecting unlabelled images for human labelling in active learning, the method comprising: receiving unlabelled images; applying a plurality of models to a selected image, wherein the plurality of models is trained on a single labelled dataset, and wherein each of the plurality of models outputs a set of bounding boxes, wherein each bounding box represents a location of a detected object; associating bounding boxes between sets of bounding boxes, wherein the associated bounding boxes each represent a location of the same object; and generating a normalised object uncertainty score for each object. The method may further comprise generating a normalised image uncertainty score for the selected image; and outputting the selected image for human labelling where the normalised image uncertainty score is above a threshold.

Inventors:
SAHU SHWETA (SG)
KANAGARAJ SARANYA (SG)
Application Number:
PCT/EP2023/056856
Publication Date:
January 04, 2024
Filing Date:
March 17, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CONTINENTAL AUTOMOTIVE TECH GMBH (DE)
International Classes:
G06F18/10; G06F18/21; G06F18/22; G06F18/40; G06V10/82
Other References:
HAUSSMANN ELMAR ET AL: "Scalable Active Learning for Object Detection", 2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 9 April 2020 (2020-04-09), XP093043312, ISBN: 978-1-7281-6673-5, Retrieved from the Internet [retrieved on 20230501], DOI: 10.1109/IV47402.2020.9304793
ADRIAN ROSEBROCK: "Simple object tracking with OpenCV - PyImageSearch", 23 July 2018 (2018-07-23), XP093043314, Retrieved from the Internet [retrieved on 20230501]
VIKAS S SHETTY: "Object detection through Ensemble of models | by Vikas S Shetty | inspiringbrilliance | Medium", 20 November 2020 (2020-11-20), XP093043315, Retrieved from the Internet [retrieved on 20230501]
ANGELA CASADO-GARCÍA JÓNATHAN HERAS: "Ensemble Methods for Object Detectio", 29 August 2020 (2020-08-29), XP093043316, Retrieved from the Internet [retrieved on 20230501]
ROMAN SOLOVYEV ET AL: "Weighted boxes fusion: Ensembling boxes from different object detection models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 February 2021 (2021-02-06), XP081875378, DOI: 10.1016/J.IMAVIS.2021.104117
LAKSHMINARAYANAN: "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", 31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, 2017
Attorney, Agent or Firm:
CONTINENTAL CORPORATION (DE)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method of selecting one or more unlabelled images for human labelling in active learning, the method comprising: receiving one or more unlabelled images; applying a plurality of models to an image selected from the one or more unlabelled images, wherein the plurality of models is trained on a single labelled dataset, and wherein each of the plurality of models outputs a set of bounding boxes, wherein each bounding box represents a location of an object detected by each of the plurality of models; associating bounding boxes between sets of bounding boxes, wherein the associated bounding boxes each represent a location of the same object within the selected image; and generating a normalised object uncertainty score for each object based on a location of the associated bounding boxes.

2. The computer-implemented method of claim 1, wherein each of the plurality of models is trained on the entire single labelled dataset with different and/or random weight initialisations.

3. The computer-implemented method of any of the preceding claims, wherein each of the plurality of models has a MobileNet-SSD architecture.

4. The computer-implemented method of any of the preceding claims, wherein associating bounding boxes between sets of bounding boxes comprises: assigning an identifier to each bounding box, preferably using a tracking algorithm, wherein each identifier is associated with an object within the selected image; and associating bounding boxes with the same assigned identifier.

5. The computer-implemented method of any of the preceding claims, wherein generating the normalised object uncertainty score for each object comprises: generating an object uncertainty score for each object based on the associated bounding boxes representing the object; and normalising the object uncertainty score based on the number of models that detected the obj ect.

6. The computer-implemented method of claim 5, wherein generating the object uncertainty score for each object comprises or is done by calculating a cumulative absolute difference between coordinates of at least two points of pairs of the associated bounding boxes.

7. The computer-implemented method of claim 6, wherein the bounding boxes are rectangular and the at least two points are vertices diagonal to each other.

8. The computer-implemented method of claim 5, wherein generating the object uncertainty score for each object comprises or is done by calculating a cumulative distance, preferably a Euclidean distance, between the location of pairs of the associated bounding boxes.

9. The computer-implemented method of any of the preceding claims, further comprising: generating a normalised image uncertainty score for the selected image based on the normalised object uncertainty scores generated for all objects within the selected image; and outputting the selected image for human labelling where the normalised image uncertainty score is above a threshold.

10. The computer-implemented method of claim 9, wherein generating a normalised image uncertainty score for the selected image comprises: calculating an image uncertainty score for the selected image based on the generated normalised object uncertainty score for all objects within the selected image; and normalising the image uncertainty score based on the number of models applied.

11. The computer-implemented method of claim 10, wherein calculating an image uncertainty score for the selected image comprises or is done by summing the normalised object uncertainty score for all objects within the selected image.

12. A system for selecting one or more images for human labelling in active learning, the system comprising one or more processors in communication with a data storage device, the one or more processors configured to carry out a computer-implemented method according to any one of claims 1 to 11.

13. A computer program, a machine-readable storage medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer- implemented method according to any one of claims 1 to 11.

Description:
METHOD AND SYSTEM OF SELECTING ONE OR MORE IMAGES FOR HUMAN

LABELLING

TECHNICAL FIELD

[0001] The invention relates generally to methods and systems for computer vision, and more particularly to selection of the most informative images for human labelling during active learning.

BACKGROUND

[0002] Object detection is a process of locating and localizing objects in an image. Object detection is widely used in many applications, including autonomous driving, robotics, and surveillance. For example, detection of objects such as vehicles and pedestrians is required in autonomous driving to detect obstacles and to understand the driving scene. Object detection may be achieved using several known algorithms, such as conventional image processing algorithms or artificial intelligence approaches. Deep learning-based approaches are used widely for object detection. To achieve a high accuracy output using deep learning, either a greater number of labelled data or good samples of labelling data are required.

[0003] Labelling is a process of adding one or more meaningful and informative labels to an input sample to provide context so that a deep learning model can learn from it. There are situations in which unlabelled data is abundant but manual labelling is expensive. In such a scenario, learning algorithms, also known as a learner, can actively query a human, also known as a user or teacher, for labels. This type of iterative supervised learning is known as active learning. Since the learner chooses samples for labelling by the user, the number of samples required to learn a concept can often be much lower than the number required in normal supervised learning. Thus, the main aim of active learning is to obtain higher accuracy with less data labelled or annotated by selecting the most informative samples for human labelling. The most informative samples are those that contain information that has yet to be learned by the model being trained. Once the most informative samples have been labelled by a human, these most informative samples may be added to the initial training dataset used to retrain the model or may be used to generate a new dataset for training other models. [0004] In active learning, there are several known algorithms such as entropy method, least confidence method, and margin sampling method to calculate probabilistic scores to determine good, or informative, samples for human labelling. Uncertainty estimation is one such active learning method used to achieve certain accuracy with minimal labelling effort. In such settings, the model learns to select the most informative unlabelled samples for labelling based on an estimated uncertainty generated by the model for each sample. The higher the uncertainty of a sample, the more likely the sample contains information that has yet to be learned by the model. Therefore, the highly uncertain predictions are assumed to be more informative for improving model performance. An uncertainty estimate describes how unsure or uncertain a model is about the results predicted by the model. Monte Carlo dropout and model ensemble learning are methods that have garnered attention in calculating uncertainty because they mesh well with existing neural network architecture and are less computationally constrained than other methods. The former has variations in terms of drop out masks, while the latter has variation in models used. In both methods, uncertainty must be calculated between the prediction of different dropout masks or the prediction of different model. Furthermore, drop out processes require the addition of a dropouts in layers to existing network architecture.

[0005] There are various methods proposed for uncertainty estimation in active learning. A first method estimates uncertainty by iteratively increasing the noise in the reference image, detecting the object, and calculating intersection over union (IOU) between the predicted bounding box (BBOX) and reference BBOX for “localization stability”. The iterative process makes this method computationally heavy for the calculation of uncertainty scores for each image. The method also requires an additional region proposal network or a selective search mechanism to identify intermediate region proposal of foreground objects and compute IOU, this making the method even more computationally heavy. Furthermore, it is unclear whether the method can be used for multi-object detection. In addition, although the method is likely to be robust against noise additions to the image, the method may not be robust on geometric variation in the position of objects.

[0006] Another method uses Dirichlet distribution to calculate uncertainty scores for object detection. Feature extraction takes place in several stages, using region of interest (ROI) pooling and Feature Pyramid Networks (FPN). The third-party object detection network, ROI pooling layer and two classifiers makes the setup computationally intensive. [0007] Yet another method uses probabilistic distribution concept for handling localization and classification of objects in a single forward pass. The method uses a deep neural network followed by a mixture density network comprising of localization and classification heads for estimating uncertainty scores. Aleatoric and epistemic uncertainty scores are estimated to measure every object’s informativeness score and the top k scored images are submitted for labelling. The method requires the computation of 3 groups of parameters summing up to 12 parameters for each detected bounding box, with the uncertainty score estimated based on these 12 parameters, which may be computationally inefficient. Yet another method estimates uncertainty for a one stage object detector using the combination of a Bayesian neural network and YOLOv3, a real-time object detection algorithm that identifies specific objects. This method computes aleatoric uncertainty for bounding box detected for each object. During training, this method trains one model with dropout, one without dropout. Dropouts are added to five different convolutional layers. This step of adding dropout layers to existing network architecture will not be useful if there are requirements to develop an active learning method without disturbing the existing network. As the initial training without dropout and aleatoric loss is crucial for stable training, the method can only be used for one stage object detectors and retraining the model for a new class or kind of data would be tedious.

SUMMARY

[0008] Embodiments of the present invention improve the selection of one or more images for human labelling in active learning by estimating object and/or image uncertainty based on the output of a model ensemble comprising a plurality of models trained on a single labelled dataset.

[0009] It shall be noted that all embodiments of the present invention concerning a method might be carried out with the order of the steps as described, nevertheless this has not to be the only and essential order of the steps of the method. The herein presented methods can be carried out with another order of the disclosed steps without departing from the respective method embodiment, unless explicitly mentioned to the contrary hereinafter.

[0010] To solve the above technical problems, the present invention provides a computer- implemented method of selecting one or more unlabelled images for human labelling in active learning, the method comprising: receiving one or more unlabelled images; applying a plurality of models to an image selected from the one or more unlabelled images, wherein the plurality of models is trained on a single labelled dataset, and wherein each of the plurality of models outputs a set of bounding boxes, wherein each bounding box represents a location of an object detected by each of the plurality of models; associating bounding boxes between sets of bounding boxes, wherein the associated bounding boxes each represent a location of the same object within the selected image; and generating a normalised object uncertainty score for each object based on a location of the associated bounding boxes.

[0011] The present invention is advantageous over prior art solutions as the existing network architecture remains untouched and undisturbed. Therefore, the present invention may be easily used to develop an active learning-based process at the time of evaluation or post core development of network architecture. The present invention may be used with models of any architecture, including both one-stage and two-stage object detectors. The present invention also provides several additional advantages over known methods, including the reduction of labelling costs, time, and resources. A plurality of models, also known as a model ensemble, allows better predictions or detections and achieve better performance than any single contributing model, as well as reduces the spread or dispersion of the predictions or detections, and model performance. Although training the ensembled models requires time and resources, training the ensembled models is a one-time process and during evaluation the checkpoint can be loaded once to compute uncertainty for any number of images. In addition, unlike prior art solutions that use groups of parameters during evaluation which makes such solutions computationally heavy, the present invention only uses coordinates of predicted objects for uncertainty estimation which makes the present invention more computationally efficient. Furthermore, unlike prior art solutions that have complicated iterative processes comprising adding noise to pixel values to evaluate localisation stability, as well as complicated architecture with multiple networks or search mechanisms to identify intermediate region proposal of foreground objects, the present invention may be more computationally efficient as the architecture and method is more streamlined. In comparison with prior art solutions such as Dirichlet distribution which has additional steps like Feature Pyramid Networks (FPN), Region Proposal Network (RPN) and regions with convolutional network (R-CNN), the present invention may consume less time, memory, and computation power. [0012] A preferred method of the present invention is a computer-implemented method as described above, wherein each of the plurality of models is trained on the entire single labelled dataset with different and/or random weight initialisations.

[0013] The above-described aspect of the present invention is advantageous as using different and/or random weight initialisations would give increased performance over other methods of training a model ensemble such as bagging or bootstrapping as each model is trained on the entire dataset to reduce bias.

[0014] A preferred method of the present invention is a computer-implemented method as described above, wherein each of the plurality of models has a MobileNet-SSD architecture.

[0015] The above-described aspect of the present invention is advantageous as MobileNet- SSD is fast and accurate. MobileNet is a convolutional neural network (CNN) that has lower computational costs due to the implementation of depth wise separable convolutions. Although MobileNet on its own has reduced accuracy compared to other neural networks, including a single shot detector (SSD) algorithm complements MobileNet by increasing the accuracy while retaining the speed of MobileNet without training an additional CNN.

[0016] A preferred method of the present invention is a computer-implemented method as described above, wherein associating bounding boxes between sets of bounding boxes comprises: assigning an identifier to each bounding box, preferably using a tracking algorithm, wherein each identifier is associated with an object within the selected image; and associating bounding boxes with the same assigned identifier.

[0017] The above-described aspect of the present invention is advantageous as the method may be used to address multi-object detection irrespective of the object class used. Tracking algorithms, particularly multi -object tracking algorithms, can locate, differentiate, track and/or associate objects over multiple images or image frames whilst maintaining the identities of the objects.

[0018] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating the normalised object uncertainty score for each object comprises: generating an object uncertainty score for each object based on the associated bounding boxes representing the object; and normalising the object uncertainty score based on the number of models that detected the object.

[0019] The above-described aspect of the present invention is advantageous as an overall object uncertainty score may be calculated based on the output a plurality of models, which may make better detections and achieve better uncertainty estimation than a single model, as well as reduces the spread or dispersion of the uncertainty. Normalisation of the object uncertainty score advantageous as it allows different object uncertainty scores to be compared or combined in subsequent steps.

[0020] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating the object uncertainty score for each object comprises or is done by calculating a cumulative absolute difference between coordinates of at least two points of pairs of the associated bounding boxes.

[0021] The above-described aspect of the present invention is advantageous as calculating an absolute difference between coordinates allows the comparison of the spatial location and size of the generated bounding boxes without a comparison of the total area or IOU.

[0022] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein the bounding boxes are rectangular and the at least two points are vertices diagonal to each other.

[0023] The above-described aspect of the present invention is advantageous as comparing vertices that are diagonal to each other for a rectangular bounding box achieves good performance using minimal computational power as only two points are compared to obtain an accurate representation of the dimensions of the bounding boxes.

[0024] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating the object uncertainty score for each object comprises or is done by calculating a cumulative distance, preferably a Euclidean distance, between the location of pairs of the associated bounding boxes. [0025] The above-described aspect of the present invention is advantageous as calculating a distance between the location of the bounding boxes allow the comparison of the spatial location of the generated bounding boxes without a comparison of the total area or IOU.

[0026] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, further comprising: generating a normalised image uncertainty score for the selected image based on the normalised object uncertainty scores generated for all objects within the selected image; and outputting the selected image for human labelling where the normalised image uncertainty score is above a threshold.

[0027] The above-described aspect of the present invention is advantageous as the overall human effort required to label the unlabelled images is reduced. Time and cost of labelling an unlabelled dataset is reduced as only samples or images with the highest uncertainty, or the most informative samples, are sent for human labelling. This could potentially lead to an increase in performance of around 10%.

[0028] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein generating a normalised image uncertainty score for the selected image comprises: calculating an image uncertainty score for the selected image based on the generated normalised object uncertainty score for all objects within the selected image; and normalising the image uncertainty score based on the number of models applied.

[0029] The above-described aspect of the present invention is advantageous as the image uncertainty score obtained provides an accurate estimation of uncertainty of the applied models in detecting or identifying objects within the image. Normalisation is advantageous as it allows the image uncertainty score obtained to be compared to a threshold in subsequent steps to determine which images to be sent for labelling.

[0030] A preferred method of the present invention is a computer-implemented method as described above or as described above as preferred, wherein calculating an image uncertainty score for the selected image comprises or is done by summing the normalised object uncertainty score for all objects within the selected image. [0031] The above-described aspect of the present invention is advantageous as the image uncertainty score calculated accounts for all the objects within the selected image.

[0032] The above-described advantageous aspects of a computer-implemented method of the invention also hold for all aspects of a below-described system of the invention. All below-described advantageous aspects of a system of the invention also hold for all aspects of an above-described computer-implemented method of the invention.

[0033] The invention also relates to a system for selecting one or more images for human labelling in active learning, the system comprising one or more processors in communication with a data storage device, the one or more processors configured to carry out a computer- implemented method according to the invention.

[0034] The above-described advantageous aspects of a computer-implemented method, computer system or vehicle of the invention also hold for all aspects of a below-described computer program, machine-readable medium, or a data signal of the invention. All below- described advantageous aspects of a computer program, machine-readable medium, or a data signal of the invention also hold for all aspects of an above-described computer-implemented method, computer system or vehicle of the invention.

[0035] The invention also relates to a computer program, a machine-readable medium, or a data carrier signal that comprises instructions, that upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of a computer-implemented method according to the invention. The machine- readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). The machine- readable medium may be any medium, such as for example, read-only memory (ROM); random access memory (RAM); a universal serial bus (USB) stick; a compact disc (CD); a digital video disc (DVD); a data storage device; a hard disk; electrical, acoustical, optical, or other forms of propagated signals (e.g., digital signals, data carrier signal, carrier waves), or any other medium on which a program element as described above can be transmitted and/or stored.

[0036] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “sensor” refers to any sensor that detects or responds to some type of input from a perceived environment. Examples of sensors include image sensors cameras, video cameras, LiDAR sensors, radar sensors, depth sensors, light sensors, colour sensors, or red, green, blue, and distance (RGBD) sensors.

[0037] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “bounding box” refers to a bounding region of an object, and may include a bounding box, a bounding circle, a bounding ellipse, or any other suitably shaped region representing an object. In particular, a bounding box region contains at least all pixels that are deemed to belong to a detected or predicted object within an image. A bounding box surrounding an object can have a rectangular shape, a square shape, a polygon shape, a blob shape, or any other suitable shape.

[0038] As used in this summary, in the description below, in the claims below, and in the accompanying drawings, the term “model” refers to any type of model, such as a machine learning model, a deep learning model, a recognition model, or an object detection model. Examples of models include neural networks, convolutional neural networks, YOLO models, and single shot detectors.

BRIEF DESCRIPTION OF THE DRAWINGS

[0039] These and other features, aspects, and advantages will become better understood with regard to the following description, appended claims, and accompanying drawings where:

[0040] Fig. l is a high-level block diagram of a computer-implemented method for selecting one or more images for human labelling in active learning, in accordance with embodiments of the present disclosure;

[0041] Fig. 2 is a functional block diagram of a system for selecting one or more unlabelled images for human labelling in active learning, in accordance with embodiments of the present disclosure;

[0042] Fig. 3 is a flowchart of examples of output of object detection module, data association module and uncertainty estimation module based on a selected image, in accordance with embodiments of the present disclosure;

[0043] Fig. 4 is a pictorial representation of a model that may be trained by the training module on a labelled dataset stored on labelled database, in accordance with embodiments of the present disclosure; [0044] Fig. 5 is a schematic illustration of the convolutional layers and parameters of model having a MobileNet-SSD architecture, in accordance with embodiments of the present disclosure; and

[0045] Fig. 6 is a schematic illustration of a computer system within which a set of instructions, when executed, may cause one or more processors of the computer system to perform one or more of the methods described herein, in accordance with embodiments of the present disclosure.

[0046] In the drawings, like parts are denoted by like reference numerals.

[0047] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative systems embodying the principles of the present subject matter. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and executed by a computer or processor, whether or not such computer or processor is explicitly shown.

DETAILED DESCRIPTION

[0048] In the summary above, in this description, in the claims below, and in the accompanying drawings, reference is made to particular features (including method steps) of the invention. It is to be understood that the disclosure of the invention in this specification includes all possible combinations of such particular features. For example, where a particular feature is disclosed in the context of a particular aspect or embodiment of the invention, or a particular claim, that feature can also be used, to the extent possible, in combination with and/or in the context of other particular aspects and embodiments of the invention, and in the inventions generally.

[0049] In the present document, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily be construed as preferred or advantageous over other embodiments.

[0050] While the disclosure is susceptible to various modifications and alternative forms, specific embodiment thereof has been shown by way of example in the drawings and will be described in detail below. It should be understood, however that it is not intended to limit the disclosure to the forms disclosed, but on the contrary, the disclosure is to cover all modifications, equivalents, and alternative falling within the scope of the disclosure.

[0051] The following description sets forth exemplary methods, parameters, and the like. It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure but is instead provided as a description of exemplary embodiments.

[0052] Although the following description uses terms “first,” “second,” etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another.

[0053] The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that on-going technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation.

[0054] For the sake of convenience, the operations of the present disclosure are described as interconnected functional blocks or distinct software modules. This is not necessary, however, and there may be cases where these functional blocks or software modules are equivalently aggregated into a single logic device, program, or operation with unclear boundaries. In any event, the functional blocks and software modules or described features can be implemented by themselves, or in combination with other operations in either hardware or software. Further, the boundaries of the functional building blocks or modules have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. The terms “comprises”, “comprising”, “includes” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a setup, device or method that includes a list of components or steps does not include only those components or steps but may include other components or steps not expressly listed or inherent to such setup or device or method. In other words, one or more elements in a system or apparatus proceeded by “comprises... a” does not, without more constraints, preclude the existence of other elements or additional elements in the system or method. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.

[0055] Fig. 1 is a high-level block diagram of a computer-implemented method 100 for selecting one or more images for human labelling in active learning, in accordance with embodiments of the present disclosure. Method 100 for selecting one or more images for human labelling in active learning may be implemented by a data processing device on any architecture and/or computing system. For example, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as multi-function devices, tablets, smart phones, etc., may implement the techniques and/or arrangements described herein. Method 100 may be stored as executable instructions that, upon execution on a data processing device and/or control unit, cause the data processing device and/or control unit to perform the steps of method 100.

[0056] Method 100 for selecting one or more images for human labelling in active learning may commence at step 108, wherein one or more images are received. In some embodiments, the one or more images may be received from an unlabelled dataset stored on a data storage device. In some embodiments, the one or more images may be received from one or more sensors. Preferably, the one or more images may be received from image sensors like cameras.

[0057] According to some embodiments, method 100 may comprise step 116 wherein a plurality of models, also known as a model ensemble, is applied to an image selected from the one or more images received in step 108. Model ensembling is a process where multiple diverse models, or a model ensemble, are created or trained to predict an outcome. The models trained or created may be any algorithm or model, including machine learning or artificial intelligence models such as neural networks. According to some embodiments, the model ensemble or plurality of models may be created or trained either by using different modelling algorithms, different neural network architecture, by using different hyperparameters (e.g., weight initialization, learning rate, optimizers). An example of a known method of training a model ensemble is bagging or bootstrapping, wherein the ensemble models are trained on different bootstrap samples of the same dataset to induce diversity. Preferably, the plurality of models is trained on a single labelled dataset with different and/or random weight initializations to increase the performance of the model ensemble. An example of a method of training the plurality of models on a single labelled dataset with random initialisations may be found in section 2.4 of a paper titled “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles” by Lakshminarayanan et. al. for the 31st Conference on Neural Information Processing Systems (NIPS 2017). Each model predicts a set of bounding boxes for an image, wherein each bounding box represents a location of an object detected by the model. A number of models will result in N sets of bounding boxes for each image. According to some embodiments, the number of models can vary from a minimum of three to any number. Increasing the number of models beyond three may result in a trade-off between calculation performance and computational efficiency and therefore the number of models used may be adjusted based on the desired calculation performance and/or computational efficiency.

[0058] According to some embodiments, method 100 may comprise step 124 wherein bounding boxes are associated between sets of bounding boxes, wherein the associated bounding boxes each represent a location of the same object within the selected image. Examples of objects include humans, vehicles, and animals. Preferably, associating the bounding boxes between sets of bounding boxes may comprise assigning an identifier to each bounding box and associating bounding boxes with the same assigned identifier. The identifier may be any identifier, including labels, numbers and/or alphabets. Preferably, a tracking algorithm is used to assign the identifier, also known as a track identifier. In some embodiments, a centroid-based tracker may be used to assign the identifier. In some embodiments, a multiple object tracking algorithm may be used. An example of a known centroid-based tracker that may be employed is OpenCV as described in https://pyimagesearch.com/2018/07/23/simple-object-tracking- with-opencv/. In some embodiments, neural network-based object trackers such as a convolutional neural network may be used.

[0059] According to some embodiments, method 100 may comprise step 132 wherein a normalised object uncertainty score is generated for each object in the selected image based on a location of the bounding boxes associated in step 124.

[0060] According to some embodiments, step 132 of generating the normalised object uncertainty score for each object may comprise generating an object uncertainty score for each object based on the associated bounding boxes representing the object. In some embodiments, generating the object uncertainty score for each object may comprise calculating a cumulative absolute difference between coordinates of at least two points of pairs of the associated bounding boxes. In some embodiments, the coordinates of the bounding boxes may be an x-coordinate and ay-coordinate. The absolute difference may be calculated between the bounding boxes generated by all possible pairwise combinations of the plurality of models regardless of whether the object is detected by a particular model. In some embodiments, where an object is not detected by a particular model, the coordinates of the points of the bounding boxes may be set as (0,0) or (0, 0, 0, 0). In some embodiments, the object uncertainty score is the sum of the absolute difference calculated for the at least two points. In some embodiments, the object uncertainty score for each object may be generated using Equation 1, wherein Equation 1 is: where objn is the n object detected in the selected image, and M n is the n model.

[0061] In some embodiments, where the bounding box is rectangular or square in shape, the at least two points may be vertices diagonal to each other as diagonal vertices are sufficient to accurately represent the dimensions, i.e., height and width, of a rectangular or square bounding box and may achieve good performance or accuracy while keeping the computing power required low by comparing only two points. In some embodiments, calculating an absolute difference between a first point and a second point of the bounding boxes may use Equation 2, wherein Equation 2 is: where objn is the n object detected in the selected image

M a and. Mb are any two models of the plurality of models

FP X is the x-coordinate of the first point of the bounding box FPy is the y-coordinate of the first point of the bounding box SP X is the x-coordinate of the second point of the bounding box SP y is the y-coordinate of the second point of the bounding box

[0062] In some embodiments, the first point FP and the second point SP may be the top left vertex and lower right vertex of the bounding box. In some embodiments, the first point P and the second coordinate SP may be the lower left coordinate and top right coordinate of the bounding box. In other embodiments, where the bounding box is rectangular, all four vertices of the bounding boxes may be used to calculate the object uncertainty score.

[0063] According to some embodiments, generating the object uncertainty score for each object may comprise calculating a cumulative distance, preferably a Euclidean distance, between the location of the associated bounding boxes.

[0064] According to some embodiments, step 132 of generating a normalised object uncertainty score for each object may further comprise normalizing the object uncertainty score based on the number of models that detected the object. In some embodiments, normalising the object score is based on the number of models that detected the object, regardless of the number of models applied. For example, where 3 models were applied and only 2 models detected the object, the cumulative absolute difference is normalized by dividing the cumulative absolute difference by 2.

[0065] In some embodiments, normalising the object uncertainty score may use Equation 3, wherein Equation 3 is: where objn is the n object detected in the selected image

Uncertainty Score o hj n is the object uncertainty score of the n object

Num O bj n is the number of models that detected the n object

[0066] According to some embodiments, method 100 may further comprise step 140 wherein a normalised image uncertainty score is generated for the selected image based on the normalised object uncertainty scores generated in step 132 for all objects within the selected image. Generating a normalised image uncertainty score for the selected image may comprise calculating an image uncertainty score for the selected image based on the generated normalised object uncertainty score for all objects within the selected image. In some embodiments, calculating an image uncertainty score for the selected image may comprise or is done by summing the normalised object uncertainty score for all objects within the selected image. In some embodiments, calculating the image uncertainty score may use Equation 4, wherein Equation 4 is:

Uncertainty Score Image = 2ob;ect=i NormalizedUncertaintyScore Object (4) [0067] According to some embodiments, generating the normalised image uncertainty score may comprise normalizing the image uncertainty score based on the number of models applied. For example, where 3 models were applied, the cumulative object uncertainty score is normalized by dividing the cumulative object uncertainty score by 3.

[0068] In some embodiments, normalising the image uncertainty score may use Equation 5, wherein Equation 5 is:

NormalizedUncertaintyScore Image= UncerCaintyScoreima se N m.mod.els where

Image is the selected image

Uncertainty Score image is the image uncertainty score of the selected image Num m odeis is the number of models applied to the selected image

[0069] According to some embodiments, method 100 may further comprise step 148 wherein the selected image is output for human labelling where the image uncertainty score is above a threshold. The threshold may be user-defined or may be automatically defined. In some embodiments, the threshold may be adjusted based on the desired performance of active learning or the desired number of images to be sent for human labelling.

[0070] According to some embodiments, method 100 may further comprise step 156 wherein the outputted image is displayed on a display for human labelling. In some embodiments, after the outputted image has been labelled or annotated by a human, the labelled or annotated image may be added to the labelled database 224. The labelled database 224 would therefore include additional more informative samples that comprise information that previously may have been missing in the original labelled database 224. The labelled database 224 may then be used to train other models or retrain the plurality of models. In other embodiments, after the outputted image has been labelled or annotated by a human, the labelled or annotated image may be added to a new database. In some embodiments, where a predefined number of labelled or annotated images have been added to a new database, the labelled or annotated images forming a new labelled dataset which may be used to train additional models. This labelled dataset would comprise the most informative image set and thus would be able to produce a higher performing model as compared to models trained on a labelled dataset comprising random samples. [0071] According to some embodiments, after the outputted image has been labelled by a human, the labelled outputted image may be added to the original labelled database used to train the plurality of models or may be used to train one or more object detectors. In some embodiments, the labelled outputted images may be stored as a separate dataset for training one or more object detectors. Computer-implemented method 100 may be used for active learning in many fields that require object detection, including for mobile robots or autonomous vehicle to detect obstacles, pedestrians, vehicles, and other nearby objects, for traffic analysis using surveillance cameras that detect cars, people, or objects, and for indoor surveillance.

[0072] Fig. 2 is a functional block diagram of a system 200 for selecting one or more unlabelled images for human labelling in active learning, in accordance with embodiments of the present disclosure. Elements/modules of the system 200 may be, for example, performed by one or more processors. The system 200 may comprise a training module 204, an object detection module 208, a data association module 212, an uncertainty estimation module 216, and an output module 218. In some embodiments, the training module 204 may receive labelled or annotated images 220 from a labelled database 224 comprising one or more labelled datasets. Labelled database 224 may be stored on a data storage device and may be received or retrieved by training module 204 through a wired or wireless connection. In some embodiments, training module 204 builds a plurality of models 228 from scratch using the same labelled dataset of images 220 obtained from labelled database 224. Preferably, the plurality of models 228 is trained on a single labelled dataset of images 220 with different and/or weight initializations to increase the performance of the model ensemble. An example of a method of training the plurality of models on a single labelled dataset with random initialisations may be found in section 2.4 of a paper titled “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles” by Lakshminarayanan et. al. for the 31st Conference on Neural Information Processing Systems (NIPS 2017). Preferably, a minimum of three models 228a-228c may be built by training module 204. The plurality of models 228 is trained to detect or predict one or more objects within an image and generate a set of bounding boxes, wherein each bounding box represents a spatial location of a detected or predicted object. The plurality of models may be any known object detection model or algorithms, including convolutional neural networks (CNN), single-shot detectors or support vector machines. According to some embodiments, object detection module 208 may communicate with the training module 204 and may receive as input one or more unlabelled images 232 from an unlabelled database 236, as well as the plurality of models 228 trained by training module 204. Unlabelled database 236 may be stored on a data storage device and may be received or retrieved by object detection module 208 through a wired or wireless connection.

[0073] According to some embodiments, object detection module 208 may apply the plurality of models 228 to an image selected from the one or more unlabelled images 232 to generate a plurality of sets of bounding boxes 236, wherein each bounding box represents a location of an object in the selected image detected by the model. For example, model 228a applied to the selected image would generate a set of bounding boxes 236a, model 228b applied to the selected image would generate a set of bounding boxes 236b, and model 228c applied to the selected image would generate a set of bounding boxes 236c.

[0074] According to some embodiments, data association module 212 may communicate with the object detection module 208 and may receive as input N sets of bounding boxes, wherein N corresponds to the number of models 228. Data association module 212 may associate bounding boxes between sets of bounding boxes, wherein the associated bounding boxes each represent a location of the same object within the selected image. In some embodiments, data association module 212 may assign an identifier to each bounding box generated to generate bounding boxes with identifiers 240, wherein each identifier is associated with an object within the selected image. The identifier may be a tracking identifier assigned using any known tracking algorithm. In some embodiments, a centroidbased tracker may be used to assign the identifier. In some embodiments, a multiple object tracking algorithm may be used. In some embodiments, neural network-based object trackers such as a convolutional neural network may be used.

[0075] According to some embodiments, uncertainty estimation module 216 may communicate with data association module 212 and may receive as input TV sets of bounding boxes with identifiers 240. Uncertainty estimation module 216 may determine a normalised object uncertainty score 244 for each object within the selected image based on the associated bounding boxes with the same identifier associated with the object. The normalised object uncertainty score 244 may be determined using the method described in step 132 of method 100. [0076] According to some embodiments, uncertainty estimation module 216 may further determine a normalised image uncertainty score 248 for the selected image based on the normalised object uncertainty scores 244 generated for all objects within the selected image. The normalised image uncertainty score 248 may be determined using the method described in step 140 of method 100.

[0077] According to some embodiments, output module 218 may communicate with uncertainty estimation module 216 and may receive the normalised image uncertainty score 248 as input. Output module 218 may determine whether the normalised image uncertainty score 248 of the selected image is above a threshold and may output the selected image for human labelling where the normalised image uncertainty score 248 is above the threshold.

[0078] Fig. 3 is a flowchart of examples of output of object detection module 208, data association module 212 and uncertainty estimation module 216 based on a selected image 332, in accordance with embodiments of the present disclosure. As illustrated in Fig. 3, object detection module 208 applies a first model 328a, a second model 328b, and a third model 328c to selected image 332 to generate multiple sets of bounding boxes 336a-336c for the single selected image 332. First model 328a identified 2 objects (i.e., a first vehicle and a second vehicle) and the set of bounding boxes 336a generated comprises 2 bounding boxes. Second model 328b identified 3 objects (i.e., a first vehicle, a second vehicle, and a pedestrian) and the set of bounding boxes 336b generated comprises 3 bounding boxes. Third model 336c applied identified 4 objects (i.e., a first vehicle, a second vehicle, a first pedestrian, and a second pedestrian) and the set of bounding boxes 236c generated comprises 4 bounding boxes.

[0079] As illustrated in Fig. 3, data association module 212 assigns identifiers to each bounding box to generate multiple sets of bounding boxes with identifiers 340a-340c. As illustrated in Fig. 3, the identifiers assigned may be numbers For example, a first vehicle on the left of the image may be assigned the identifier of 0, a second vehicle on the right of the image may be assigned the identifier of 1, a first pedestrian on the left of the image may be assigned the identifier of 3, and a second pedestrian on the right of the image may be assigned the identifier of 4. As illustrated in Fig. 3, the same identifier is assigned to bounding boxes representing the same object of the image. [0080] As illustrated in Fig. 3, uncertainty estimation module 216 generates an object uncertainty score 344 for each object. For example, the first vehicle assigned the identifier 0 was identified by all three models 328a-328c applied. The coordinates of the top left and lower right vertices of each bounding box may be expressed in a combined matrix as [TL X , TL y , LR X , LR y ], where TL X is the x-coordinate of the top left vertex, TL y is the y-coordinate of the top left vertex, LR X is the x-coordinate of the lower right vertex, and LR y is the y- coordinate of the lower right vertex.

[0081] For example, the associated bounding box of the object with identifier 0 generated by first model 328a may be expressed as [0, 50, 20, 30], the associated bounding box of the object with identifier 0 generated by second model 328b may be expressed as [0, 50, 20, 30], and the associated bounding box of the object with identifier 0 generated by third model 328c may be expressed as [0, 40, 20, 30], Using Equation 2, the absolute difference for each coordinate is calculated as [0, 20, 0, 0], and adding up the absolute difference would generate an object uncertainty score of 20 for the object assigned the identifier 0. Using Equation 3, the object uncertainty score may be normalized by dividing 20 by 3, as the object assigned the identifier 0 was identified by all three models 328a-328c applied, to obtain a normalized object uncertainty score of 6.66.

[0082] For example, the associated bounding box of the object with identifier 3 generated by second model 328b may be expressed as [100, 10, 130, 60], the associated bounding box of the object with identifier 3 generated by third model 328c may be expressed as [100, 20, 100, 50], As the object with identifier 3 was not detected by first model 328a, the combined matrix [0, 0, 0, 0] may be used when calculating the absolute difference between associated bounding boxes. Using Equation 2, the absolute difference for each coordinate is calculated as [200, 40, 260, 120], and adding up the absolute difference would generate an object uncertainty score of 620 for the object assigned the identifier 3. Using Equation 3, the object uncertainty score may be normalized by dividing 620 by 2, as the object assigned the identifier 3 was only identified by the second model 328b and third model 328c, to obtain a normalized object uncertainty score of 310.

[0083] For example, the associated bounding box of the object with identifier 4 generated by third model 328c may be expressed as [150, 20, 150, 70], As the object with identifier 4 was not detected by first model 328a or second model 328b, the combined matrix [0, 0, 0, 0] may be used when calculating the absolute difference between associated bounding boxes. Using Equation 2, the absolute difference for each coordinate is calculated as [300, 40, 300, 140], and adding up the absolute difference would generate an object uncertainty score of 780 for the object assigned the identifier 4. Using Equation 3, the object uncertainty score may be normalized by dividing 780 by 1, as the object assigned the identifier 4 was only identified by the third model 328c, to obtain a normalized object uncertainty score of 780.

[0084] As illustrated in Fig. 3, a normalised image uncertainty score 348 is calculated once a normalised object uncertainty score 344 is calculated for each object. Applying Equation 4, the image uncertainty score 348 for image 332 is 6.66+310+780 to get 1096.66. Using Equation 5, the image uncertainty score may be normalised by dividing 1096.66 by 3 as three models 328a-328c were applied, to obtain a normalised image uncertainty score of 368.553. If the threshold is 300, image 332 would be output by output module 218 for human labelling. If the threshold is 400, image 332 would not be output by output module 218 for human labelling.

[0085] Fig. 4 is a pictorial representation of a model that may be trained by training module 204 on a labelled dataset stored on labelled database 224, in accordance with embodiments of the present disclosure. It is emphasized that model 400 described herein is only an example of a model of the plurality of models 228 that may be trained by training module 204 and models for object detection with other architecture may be employed. Examples of other object detection models that may be employed include Fast Region-Based Convolutional Network (Fast R-CNN), Faster Region-Based Convolutional Network (Faster R-CNN), Histogram of Oriented Gradients (HOG) and Region-based Convolutional Neural Networks (R-CNN).

[0086] In some embodiments, model 400 may be trained on a labelled dataset comprising labelled images capturing one or more objects. Examples of the objects may include pedestrians, vehicles and animals. An example of a labelled dataset that can be used to train model 400 is the Microsoft COCO dataset described at https://cocodataset.Org/#home.

[0087] According to some embodiments, model 400 may comprise at least one neural network, which may comprise input nodes (i.e., layers), hidden nodes, output nodes. In some embodiments, neural network may be a convolutional neural network, or CNN. A convolutional neural network (CNN) is a multi-layered feed-forward neural network, made by stacking many hidden layers on top of each other in sequence. The sequential design may allow CNNs to learn hierarchical features. The hidden layers are typically convolutional layers followed by activation layers, some of them followed by pooling layers. The CNN may be configured to identify pattens in data. The convolutional layer may include convolutional kernels, that are used to look for patterns across the input data. The convolutional kernel may return a large positive value for a portion of the input data that matches the kernel’s pattern or may return a smaller value for another portion of the input data that does not match the kernel’s pattern. A CNN is preferred as a CNN may be able to extract informative features from the training data without the need for manual processing of the training data. The CNN may produce accurate results where large unstructured data is involved, such as image classification, speech recognition and natural language processing. Also, CNN is computationally efficient as a CNN is able to assemble patterns of increasing complexity using the relatively small kernels in each hidden layer.

[0088] According to some embodiments, model 400 may comprise one CNN for both feature extraction and object detection. In some embodiments, model 400 may comprise two CNNs: a first CNN for feature extraction and a second CNN for object detection. Preferably, model 400 may comprise a CNN 404 for feature extraction and an algorithm for object detection 408 to perform bounding box regression and classification. Model 400 with CNN 404 and algorithm for object detection 408 is preferred as object detection can be carried out on an input image quickly and accurately on a computationally limited device. The algorithm for object detection 408 may be a one-stage or a two-stage object detector. A one-stage object detector algorithm takes only one shot to detect many objects in an image using multi -box as it performs both classification and localization in one single network and is preferable due to end-to-end learning and model simplicity.

[0089] According to some embodiments of the present disclosure, CNN 404 receives an unlabelled image 412 as an input, applies one or more convolutional filters or layers 416a- 416e to unlabelled image 412 to generate one or more feature maps. There may be any number of convolutional filers or layers 416 based on the application or requirement of a user, although only 5 are depicted in Fig. 4. Preferably, CNN 404 for feature extraction may have a MobileNet architecture. MobileNet is an architecture model of a convolutional neural network for object detection that was optimized primarily for speed. The main building blocks of MobileNet are depthwise separable convolutions which factorize or separate the standard convolution filter into two distinct operations: (i) a first operation where separate convolution kernels, also known as depthwise with a convolution, are applied to each input channel (“depthwise convolution”); and (ii) a second operation where pointwise (1 x 1) convolution is used to combine the information of the first operation (“pointwise convolution”). On the other hand, standard convolution filters perform the channel-wise and spatial-wise computation in a single step. The separation or factorization of the standard convolution into two distinct operations have less parameter and computational costs than a standard convolution due to fewer mult-adds (multiplication and addition operations). In some embodiments, the CNN 404 may comprise a first standard convolution layer, followed by a plurality of depthwise and pointwise convolution layers, an average pooling layer, a fully connected layer, and a softmax classifier. Each layer in the CNN 404 may be followed by a batch normalization (BN) and Rectified Linear Activation Function (ReLU) nonlinearity with the exception of the final fully connected layer which has no nonlinearity and feeds into a softmax layer for classification. Additional information on the MobileNet architecture may be found in “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications” by Howard et. al.

[0090] According to some embodiments, the algorithm for object detection 408 is a Single Shot Detector (SSD), which is a model that carries out object detection based on feature maps generated by the one or more convolutional filters or layers 416a-416e of CNN 404. SSD uses CNN 404 as a base network and applies additional convolution filters, or convolution kernels, to the feature maps output by CNN 404 to perform classification and localization regression (i.e., bounding box regression). In some embodiments, the convolution filters or convolution kernels may be layers 420a-420e to output location of objects as bounding boxes and object classification. There may be any number of convolutional filers or layers 420 based on the application or requirement of a user, although only 5 are depicted in Fig. 4. In some embodiments, the convolution filters or convolution kernels layers 420a-420e may comprise a first set of layers to output location and a second set of layers to output classification (not shown). The use of SSD is preferred as it does not require the training of an additional CNN. Furthermore, the use of SSD improves the accuracy of MobileNet while maintaining the speed of MobileNet. SSD uses different input feature maps to perform classification and localization regression, wherein some of the input feature maps are output from layers 416a-416e of the base network CNN 404. The different feature maps used by SSD are generally of different sizes to leverage both high- and low- level information. In general, where CNN 404 is combined with algorithm for object detection 408, the last few layers of the CNN 404 are omitted, and the output of CNN 404 are feature maps on which algorithm for object detection 408 or SSD bases detections. For example, where MobileNet is combined with SSD, the average pooling layer, fully connected layer and softmax classifier of the MobileNet are omitted such that the output of MobileNet is feature maps to be used by SSD. The SSD adds more convolution layers in which the intermediate tensors are kept, such that a stack of feature maps with a variety of sizes are generated for detections 428. For example, where a feature layer has a size of (a x b) with c channels, a convolution (e.g., 3 x 3) is applied on the (a x b x c) feature layer to generate k bounding boxes or detections per class, wherein each bounding box with a probability score assigned for each location of the objects identified. Non-maxima suppression 432 is then used to ensure that only one bounding box is generated around an object by suppressing all bounding boxes with non-maximum probability values. In particular, bounding boxes with a probability less than a certain threshold are discarded, and the remaining bounding boxes with a probability higher than the threshold are retained. Of the remaining bounding boxes, the bounding box with the greatest probability factor is looked upon for each and every object and the other bounding boxes except the one with maximum probability factor is suppressed. More information on the MobileNet-SSD architecture may be found at https://iq.opengenus.org/ssd-mobilenet-vl-architecture/.

[0091] Fig. 5 is a schematic illustration of the convolutional layers and parameters of model 400 having a MobileNet-SSD architecture, in accordance with embodiments of the present disclosure. It is noted that Fig. 5 only illustrates an example of the convolutional layers and parameters for model 400 having a MobileNet-SSD architecture and models with other architecture, layers, and/or parameters may be employed. Each row in the table in Fig 5 represents a layer of the MobileNet-SSD architecture. The parameters of each layer include the type of convolutional layer and stride (first column), shape of the filter applied (second column), and the input size (third column). As illustrated in Fig. 5, model 400 may comprise a first standard convolution layer 504 (indicated by “conv”). The stride is indicated by “s”, wherein “si” represents a stride of 1, and “s2” represents a stride of 2. The plurality of convolutions 508 after first standard convolution layer 504 are separated into depthwise convolutional layers (layers indicated by “conv dw” and filter shape indicated by “dw”) and pointwise convolutional layers (indicated by “conv”). Fig. 5 illustrates 13 depthwise separated convolutions after first standard convolution layer 504, although the number of depthwise separated convolutions may vary depending on the application and requirements. The subsequent layers 512 are convolutional layers or filters of the SSD for classification and localization regression.

[0092] Prior to training model 400, the initial dataset from labelled database 224 may be split into a training set for training model 400 and a validation set for validating model 400. For example, 70% of the initial dataset may be split into the training set and 30% of the remaining data may be split into the validation set. In some embodiments, the same training set and validation set may be used to train each model of the plurality of models 228.

[0093] According to some embodiments, training each model 400 may be carried out using the training set to adjust the parameters, i.e., using the labelled training set as positive examples for model 400 to generate predictions on. When training each model 400, the activation function may be set as ReLU, and the weights may be randomly initialized. The different and/or random weight initialisations provide variations to each model 400 trained, such that the plurality of models 228a-228c generated by training module 204 have varied hyperparameters. The learning rate may be 0.001 and may be adjusted or varied during the training process over training iterations. The batch size may be set at 4, and the number of epochs or training iterations may be set at 10,000. The loss function representing the magnitude of error of model 400 may be focal loss as disclosed in “Focal Loss for Dense Object Detection” by Lin et. al. In some embodiments, the optimization algorithm used may be the Adam (adaptive moment estimation) optimization algorithm as disclosed in “Adam: A Method for Stochastic Optimization” by Kingma & Ba to minimize loss function and train the network. In some embodiments, the optimization algorithm may be the well-known stochastic gradient descent. It is emphasized that the training parameters disclosed are examples of training parameters, and that different training parameters may be used according to the requirements for the model. According to some embodiments, each model 400 may be validated using the validation set to ensure that the parameters of the trained model 400 are satisfactory to generate an accurate prediction. Such parameters include weights and biases of neurons of the model 400. If the predictions generated are not accurate, training of model 400 may be continued using the training set.

[0094] Fig. 6 is a schematic illustration of a computer system 600 within which a set of instructions, when executed, may cause one or more processors 608 of the computer system to perform one or more of the methods described herein, in accordance with embodiments of the present disclosure. It is noted that computer system 600 described herein is only an example of a computer system that may be employed and computer systems with other hardware or software configurations may be employed. In some embodiments, computer system 600 may be connected to one or more data storage devices 616, such connection to the one or more data storage devices 616 may be wired or wirelessly. The data storage device 616 may include a plurality of data storage devices. The storage device 616 may include, for example, long term storage (e.g., a hard drive, a tape storage device, flash memory), shortterm storage (e.g., a random-access memory, a graphics memory), and/or any other type of computer readable storage. The modules and devices described herein can, for example, utilize the one or more processors to execute computer-executable instructions and/or include a processor to execute computer executable instructions (e.g., an encryption processing unit, a field programmable gate array processing unit).

[0095] In some embodiments, computer system 600 may comprise a server computer, a laptop, a personal computer, a desktop computer, or any machine capable of executing a set of instructions that specify actions to be taken by the computer system. Computer system 600 may comprise one or more processors 608 and one or more memory 628 which communicate with each other via a bus 636. Computer system 600 may further comprise a network interface device 644 which allows computer system 600 to communicate with a network 652. In some embodiments, computer system 600 may further comprise a disk drive unit 660 which may include a machine-readable storage medium 668 on which is stored one or more sets of instructions 676 embodying one or more of the methods described herein. The one or more sets of instructions 676 may also reside in the one or more processors 608 or the one or more memory 628. In some embodiments, the one or more sets of instructions 676 may be received as a data carrier signal received by computer system 616. In some embodiments, computer system 600 may comprise an I/O interface 684 for communication with another information processing system, for receiving information through an input device 692, or for outputting information to an output device 698. In some embodiments, the input device 692 may be any input device that may be controlled by a human, such as a mouse, a keyboard or a touchscreen. In some embodiments, the output device 698 may include a display.

[0096] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.