METHODS AND SYSTEMS FOR VERIFYING AND PREDICTING THE PERFORMANCE OF MACHINE LEARNING ALGORITHMS

Title:

METHODS AND SYSTEMS FOR VERIFYING AND PREDICTING THE PERFORMANCE OF MACHINE LEARNING ALGORITHMS

Document Type and Number:

WIPO Patent Application WO/2022/023697

Kind Code:

Abstract:

A computer-implemented method comprising: receiving data to be input to a machine learning algorithm, the output from the machine learning algorithm to be used in carrying out a task; determining a degree of dissimilarity between the received data and training data that has been used to train the machine learning algorithm; determining, based on the degree of dissimilarity, whether the machine learning algorithm is likely to achieve a required level of performance when processing the received data; and in the event that it is determined that the machine learning algorithm is unlikely to reach the required level of performance, modifying the way in which the task is carried out.

Inventors:

HOND DARRYL (GB)
ASGARI HAMID (GB)
JEFFERY DANIEL (GB)

Application Number:

PCT/GB2021/051625

Publication Date:

February 03, 2022

Filing Date:

June 25, 2021

Export Citation:

Click for automatic bibliography generation Help

Assignee:

THALES HOLDINGS UK PLC (GB)

International Classes:

G06N3/04; B60W60/00; G06N3/08

Domestic Patent References:

WO2019214309A1

2019-11-14

Other References:

YUCHI TIAN ET AL: "DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 August 2017 (2017-08-29), XP080968652
UNKNOWN ET AL: "DeepXplore : Automated Whitebox Testing of Deep Learning Systems", PROCEEDINGS OF THE 26TH SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES , SOSP '17, 1 January 2017 (2017-01-01), New York, New York, USA, pages 1 - 18, XP055572312, ISBN: 978-1-4503-5085-3, DOI: 10.1145/3132747.3132785
ASLANSEFAT, K. ET AL., SAFEML: SAFETY MONITORING OF MACHINE LEARNING CLASSIFIERS THROUGH STATISTICAL DIFFERENCE MEASURE, 2020
MA, L.JUEFEI-XU, F.ZHANG, F.SUN, J.XUE, M.LI, B.ZHAO J.: "DeepGauge: Multi-granularity testing criteria for deep learning systems", PROCEEDINGS OF THE 33RD ACM/IEEE INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, September 2018 (2018-09-01), pages 120 - 131, XP033720314, DOI: 10.1145/3238147.3238202
K. SIMONYANA. ZISSERMAN: "Very Deep Convolutional Networks for Large-Scale Image Recognition", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2015
NOA GARCIAGEORGE VOGIATZIS: "Image and Vision Computing Journal", vol. 82, September 2017, ELSEVIER, article "Learning Non-Metric Visual Similarity for Image Retrieval"
RICHARD ZHANG ET AL.: "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION
NOA GARCIAGEORGE VOGIATZIS: "Learning Non-Metric Visual Similarity for Image Retrieval", IMAGE AND VISION COMPUTING, April 2019 (2019-04-01)
B. E. ROGOWITZ ET AL.: "Perceptual Image Similarity Experiments", PROC. IS&T/SPIE CONF. HUMAN VISION ELECTRONIC IMAGING III, July 1998 (1998-07-01), pages 576 - 590

Attorney, Agent or Firm:

GRANT, David (GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A computer-implemented method comprising: receiving data to be input to a machine learning algorithm, the output from the machine learning algorithm to be used in carrying out a task; determining a degree of dissimilarity between the received data and training data that has been used to train the machine learning algorithm; determining, based on the degree of dissimilarity, whether the machine learning algorithm is likely to achieve a required level of performance when processing the received data; and in the event that it is determined that the machine learning algorithm is unlikely to reach the required level of performance, modifying the way in which the task is carried out.

2. A computer-implemented method according to claim 1, wherein the step of determining whether the machine learning algorithm is likely to achieve the required level of performance comprises referencing a predetermined relationship between the performance of the machine learning algorithm when exposed to test data and the degree of dissimilarity between the training data and the test data.

3. A computer-implemented method according to claim 2, comprising inferring a likely level of performance of the machine learning algorithm, based on the predetermined relationship.

4. A computer-implemented method according to any one of the preceding claims, comprising defining a threshold degree of dissimilarity between the received data and the training data, wherein in the event the degree of dissimilarity between the received data and training data is above the threshold, it is determined that the machine learning algorithm is unlikely to reach the required level of performance.

5. A computer-implemented method according to claim 4 as dependent on claim 2 or 3, wherein the threshold is defined using the predetermined relationship.

6. A computer-implemented method according to any one of the preceding claims, wherein the machine learning algorithm comprises a neural network, and the degree of dissimilarity between the received data and the training data is determined by comparing the activation of neurons in the neural network when using the training data as input and when using the received data as input.

7. A computer-implemented method according to claim 6, wherein the received data comprises one or more individual data items and determining the degree of dissimilarity comprises: determining, for one or more neurons in the neural network, one or more first activation values for the neuron, wherein each first activation value for a neuron comprises an activation value of the neuron as observed when executing the machine learning algorithm with a respective one of the data items as input; and determining, for each of the one or more neurons, second activation values for the respective neuron, the second activation values comprising activation values of the neuron as observed when using the training data as input for the machine learning algorithm.

8. A computer-implemented method according to claim 7, wherein the received data comprises a single data item, and determining the degree of dissimilarity further comprises: for each neuron, forming a comparison between the first activation value for the respective neuron with the second activation values for the respective neuron.

9. A computer-implemented method according to claim 8, comprising: defining, for each of the one or more neurons, a measure of the distribution of second activation values; and comparing the first activation value to the measure of the distribution of second activation values.

10. A computer-implemented method according to claim 9, wherein the measure of the distribution of second activation values comprises a range of second activation values.

11. A computer-implemented method according to claim 10, wherein the upper limit of the range is defined by the maximum one of the second activation values for the respective neuron and/or the lower limit of the range is defined by the minimum one of the second activation values for the respective neurons.

12. A computer-implemented method according to claim 10 or 11 wherein comparing the first activation value of each respective neuron to the range of activation values for the neuron comprises determining whether or not the first activation value falls within the range of activation values.

13. A computer-implemented method according to claim 12, wherein the degree of dissimilarity is based on the number of neurons for which the first activation value does not fall within the range of activation values for the neuron.

14. A computer-implemented method according to claim 12, wherein determining the degree of dissimilarity comprises: for each one of the neurons for which the first activation value falls outside the range of activation values, determining a difference between the first activation value and whichever end of the range of second activation values the first activation value is closest to.

15. A computer-implemented method according to claim 7, wherein the received data comprises a plurality of individual data items and determining the degree of dissimilarity further comprises: defining, for each of the one or more neurons, a measure of the distribution of second activation values; and comparing the first activation values to the measure of the distribution of second activation values.

16. A computer-implemented method according to claim 15, wherein comparing the first activation values to the measure of the distribution of second activation values comprises: defining, for each of the one or more neurons, a measure of the distribution of the first activation values; and comparing the measure of the distribution of the first activation values to the measure of the distribution of second activation values.

17. A computer-implemented method according to any one of the preceding claims, wherein determining the degree of dissimilarity between the received data and the training data includes comparing the activation of neurons in a reference neural network when using the training data as input to the reference neural network and when using the received data as input to the reference neural network.

18. A computer-implemented method according to claim 17, wherein the machine learning algorithm comprises a neural network and the method comprises: determining a first measure of the dissimilarity by comparing the activation of neurons in the reference neural network when using the training data as input to the reference neural network and when using the received data as input to the reference neural network; and determining a second measure of the degree of dissimilarity by comparing the activation of neurons in the machine learning algorithm network when using the training data as input to the machine learning algorithm and when using the received data as input to the machine learning algorithm.

19. A computer-implemented method according to claim 18, wherein determining whether the machine learning algorithm is likely to achieve the required level of performance comprises determining whether the value of the first measure of the degree of dissimilarity lies within a predetermined range of values, and whether the value of the second measure of the degree of dissimilarity also lies within a predetermined range of values.

20. A computer-implemented method according to claim 18 or 19, wherein determining whether the machine learning algorithm is likely to achieve the required level of performance comprises referencing a predetermined relationship between the performance of the machine learning algorithm when exposed to test data and the degree of dissimilarity between the training data and the test data, the relationship being used to predict the performance for the first measure of the degree of dissimilarity and the second measure of the degree of dissimilarity.

21. A computer-implemented method according to any one of the preceding claims, wherein the training data and the received data each comprise one or more images.

22. A computer-implemented method according to any one of the preceding claims, wherein the machine learning algorithm comprises a classifier algorithm.

23. A computer-implemented method according to any one of the preceding claims, wherein the task comprises controlling one or more functions of an autonomous vehicle.

24. A method of using a neural network to determine a degree of dissimilarity between an individual test data item and a set of two or more other data items, the neural network being configured to receive as input data items and to execute, for each data item, a machine learning task to produce an output for the data item, the method comprising: executing the machine learning task using the test data item as input to the neural network; determining, for one or more neurons in the neural network, a respective first activation value for the neuron, wherein the first activation value comprises an activation value of the neuron as observed when executing the machine learning task with the test data item as input; for each of the one or more neurons, comparing the respective first activation value of the neuron with second activation values for the neuron, the second activation values comprising activation values of the neuron as observed when using respective ones of the two or more other data items as input for the machine learning task; and based on the comparison, defining a degree of dissimilarity between the individual test data item and the set of two or more other data items.

25. A method according to claim 24, comprising: defining, for each of the one or more neurons, a measure of the distribution of second activation values; and comparing the first activation value to the measure of the distribution of second activation values.

26. A method according to claim 25, wherein the measure of the distribution of second activation values comprises a range of second activation values.

27. A method according to claim 26, wherein the upper limit of the range is defined by the maximum one of the second activation values for the respective neuron and/or the lower limit of the range is defined by the minimum one of the second activation values for the respective neurons.

28. A method according to claim 26 or 27 wherein comparing the first activation value of each respective neuron to the range of activation values for the neuron comprises determining whether or not the first activation value falls within the range of activation values.

29. A method according to claim 28, wherein the degree of dissimilarity is based on the number of neurons for which the first activation value does not fall within the range of activation values for the neuron.

30. A method according to claim 29, wherein determining the degree of dissimilarity comprises: for each one of the neurons for which the first activation value falls outside the range of activation values, determining a difference between the first activation value and whichever end of the range of second activation values the first activation value is closest to.

31. A method according to any one of claims 24 to 30, wherein the set of two or more other data items comprises data items used to train the neural network.

32. A computer-implemented method for training a machine learning algorithm, the method comprising: receiving test data for testing a performance of the machine learning algorithm; determining a degree of dissimilarity between the test data and training data on which the machine learning algorithm has been trained; specifying a performance requirement for the machine learning algorithm, based on the degree of dissimilarity between the test data and the training data; determining the performance of the machine learning algorithm when processing the test data; and in the event that the determined performance of the machine learning algorithm does not meet the performance requirement, subjecting the machine learning algorithm to further training and/or modifying the machine learning algorithm.

33. A computer-implemented method according to claim 32, wherein specifying a performance requirement for the machine learning algorithm comprises defining a trend in performance as a function of dissimilarity between the test data and training data.

34. A computer-implemented method according to claim 33, comprising defining one or more requirements for the gradient of the trend.

35. A computer-implemented method according to any one of claims 32 to 34, wherein subjecting the machine learning algorithm to further training comprises supplementing the training data with further training data; wherein the further training data comprises data items that are dissimilar to data items included in the training data.

36. A computer-implemented method according to any one of claims 32 to 35, wherein the machine learning algorithm comprises a neural network and specifying a performance requirement for the machine learning algorithm comprises: specifying a performance level for the machine learning algorithm with reference to a first measure of dissimilarity, wherein the first measure of dissimilarity comprises a measure of the dissimilarity between the training data and the test data as obtained by comparing the activation of neurons in the machine learning algorithm when using the training data as input and when using the test data as input to the machine learning algorithm.

37. A computer-implemented method according to claim 36, wherein specifying a performance requirement for the machine learning algorithm further comprises: specifying a performance level for the machine learning algorithm with reference to a second measure of dissimilarity, wherein the second measure of dissimilarity comprises a measure of the dissimilarity between the training data and the test data as obtained by comparing the activation of neurons in a reference neural network when using the training data as input and when using the test data as input to the reference neural network.

38. A computer-implemented method according to claim 37, comprising determining a mapping between the first measure of the degree of dissimilarity and the second measure of the degree of dissimilarity; and using the mapping to define a performance requirement in terms of the value of the first measure of the degree of dissimilarity.

39. A computer-implemented method according to any one of claims 17 to 20 and 37 to 38 wherein the reference network comprises a neural network that has been pre-trained on a training data set having a greater number of data items than the training data used to train the machine learning algorithm.

40. A computer-implemented method according to any one of claims 17 to 20 and 37 to 39, wherein the reference network is used to provide a measure of semantic differences between the received data or test data and the training data used to train the machine learning algorithm.

41. A computer implemented method for managing how a task is to be carried out, the method comprising: receiving test data for testing the performance of a machine learning algorithm; determining a degree of dissimilarity between the test data and training data on which the machine learning algorithm has been trained; specifying a performance requirement for the machine learning algorithm, based on the degree of dissimilarity between the test data and the training data; determining the performance of the machine learning algorithm when processing the test data; and in the event that the determined performance of the machine learning algorithm does not meet the performance requirement, subjecting the machine learning algorithm to further training and/or modifying the machine learning algorithm to meet the performance requirement; receiving new data to be input to the machine learning algorithm, wherein the output from the machine learning algorithm when processing the new data is to be used in carrying out the task; determining a degree of dissimilarity between the new data and the training data; determining, based on the degree of dissimilarity between the new data and the training data, whether the machine learning algorithm is likely to meet the performance requirement when processing the new data; and in the event that it is determined that the machine learning algorithm is unlikely to meet the performance requirement when processing the new data, modifying the way in which the task is carried out.

42. A computer readable medium comprised computer executable instructions that when executed by a computer will cause the computer to carry out a method according to any one of the preceding claims. 43. A computer system configured to carry out a method according to any one of claims 1 to

41.

44. An autonomous vehicle comprising a computer system according to claim 43.

Description:

Methods and systems for verifying and predicting the performance of machine learning algorithms

FIELD

Embodiments described herein relate to methods and systems for verifying and predicting the performance of machine learning algorithms.

BACKGROUND

Al-based systems are predicted to play a significant role in global technological progress for the foreseeable future. Among other applications, these systems provide new opportunities to expand the capabilities of Unmanned Surface Vehicles (USVs), Unmanned Aerial Vehicles (UAVs), Connected Autonomous Vehicles (CAVs), and Maritime Autonomous Systems (MAS), each of which continue to be developed for use in a number of different sectors. Such sectors include: the oil and gas sector, where both unmanned surface and underwater vehicles may be used for pipeline inspection; the growing offshore wind farm industry, where MAS and UAVs will be applicable for routine inspection and maintenance; the automotive industry where CAVs will revolutionise mobility for different sectors of society; the security and defence industry, where UAVs may be employed for collecting intelligence, surveillance and reconnaissance (ISR) data; and more generally for use by the scientific community in collecting important environmental data.

One of the core types of machine learning algorithms used in autonomous systems are object and signal classifiers; indeed, these algorithms can be regarded as key components of such autonomous systems. Autonomous vehicles, for example, employ a variety of different types of sensor (visible light and IR cameras, LIDAR, sonar signals etc.) to capture data about their surrounding environment, and use machine learning algorithms such as neural networks or other classifiers (Support Vector Machines, logistic regression) to identify objects within those surroundings. The classification of objects may be carried out in order to detect obstacles that must be navigated around, or to identify objects of interest on which to carry out further data analysis, for example.

Conventionally, such classifiers are trained on a training set that may comprise either real data captured by sensors, or synthetic data. The training data includes examples of positive occurrences (for example, images containing an object that the classifier is being trained to identify) as well as examples of negative occurrences (for example, images in which the object is absent). In the case of a supervised learning algorithm, each item in the data set also includes a label identifying the item as a positive or negative example. The classifier itself comprises a model that extracts features from the training data to determine whether or not a particular item is a positive or negative example. Each time the classifier is presented with an example from the training set, it will judge whether or not the example is a positive or negative occurrence, and compare its output with the label for that example. A cost function can be defined that reflects the difference between the classifier output and the label, as a function of the parameters of the model. By repeatedly exposing the model to different items of training data and modifying the parameters of the model to minimise the cost function, it is possible to improve the classifier performance, such that it will identify positive and negative examples with increasing accuracy.

As Al systems continue to surpass human levels of performance on increasingly complex tasks, there is an increasing desire to apply them to high-risk safety-critical systems. However, it has been widely observed that whilst Al systems can sometimes surpass human performance, they often exhibit shortcomings: for example, Al systems can easily be fooled into making erroneous classifications or decisions by making small changes, imperceptible to humans, to the input data. Such shortcomings can impose a significant risk to safety, depending on the type of application. Taking the example of a USV, UAV or MAS, these systems need to be able to correctly analyse real-world image data across a range of environmental conditions. In the event that the algorithm fails to behave as expected, this could present a risk to the crew and passengers of any nearby vessels, vehicles or aircraft, as well as the integrity of the USV or UAV itself. This is particularly true in the case where the system is operating in an uncontrolled environment such as the sky or ocean as well as on the roads where autonomous vehicles are operational.

In order to fully exploit the opportunities presented by machine learning algorithms, it is therefore necessary to provide meaningful verification evidence of their behavioural performance for regulation, certification and assurance purposes. However, verification and validation (V&V) of such systems is not, in general, a mature area.

One of the core components of a V&V process is to verify that a given system satisfies a specification, where that specification includes one or more key performance indicators (KPI). In the case of a machine learning algorithm, the specification may be to create an approximation of a classification function that maps an input to a label. The specification is often based on the explicit or implicit assumption that data items are independently and identically distributed (I ID). The verification process of such a system will then involve presenting an unseen test data set to the trained algorithm and measuring its performance on that data set. Doing so then provides an indication of how well the classification function has been approximated, and as such how well the specification has been met.

The I ID assumption will often not hold for test data sets, however, nor is it common to subject a test data set to statistical tests to see to what extent the assumption does hold. Furthermore, the IID assumption is restrictive as there is no guarantee that data items processed during real-world operation will come from exactly the same source as the training data. In the absence of a more robust verification methodology that takes these considerations into account, there will remain limits on the extent to which such algorithms can be implemented in systems whose safety of operation is a primary concern, including UAVs, MAS and USVs.

Distance measures for estimating the safety of machine learning classifiers are discussed in a paper entitled “SafeML: Safety Monitoring of Machine Learning Classifiers through Statistical Difference Measure” (Aslansefat, K., et al, 2020). It is desirable to continue developing means for systematically verifying the performance of a machine learning algorithm, and in particular, to ensure that decisions and outputs from the algorithm can be relied on when exposing the algorithm to new input data. When doing so, it is further desirable to specify one or more KPIs for the system and to dynamically test the system to ensure these requirements are met.

SUMMARY

According to a first aspect of the present invention, there is provided a computer- implemented method comprising: receiving data to be input to a machine learning algorithm, the output from the machine learning algorithm to be used in carrying out a task; determining a degree of dissimilarity between the received data and training data that has been used to train the machine learning algorithm; determining, based on the degree of dissimilarity, whether the machine learning algorithm is likely to achieve a required level of performance when processing the received data; and in the event that it is determined that the machine learning algorithm is unlikely to reach the required level of performance, modifying the way in which the task is carried out.

The step of determining whether the machine learning algorithm is likely to achieve the required level of performance may comprise referencing a predetermined relationship between the performance of the machine learning algorithm when exposed to test data and the degree of dissimilarity between the training data and the test data. The method may comprise inferring a likely level of performance of the machine learning algorithm, based on the predetermined relationship.

The method may comprise defining a threshold degree of dissimilarity between the received data and the training data, wherein in the event the degree of dissimilarity between the received data and training data is above the threshold, it is determined that the machine learning algorithm is unlikely to reach the required level of performance. The threshold may be defined using the predetermined relationship.

The machine learning algorithm may comprise a neural network. The degree of dissimilarity between the received data and the training data may be determined by comparing the activation of neurons in the neural network when using the training data as input and when using the received data as input.

The received data may comprise one or more individual data items. Determining the degree of dissimilarity may comprise: determining, for one or more neurons in the neural network, one or more first activation values for the neuron, wherein each first activation value for a neuron comprises an activation value of the neuron as observed when executing the machine learning algorithm with a respective one of the data items as input; and determining, for each of the one or more neurons, second activation values for the respective neuron, the second activation values comprising activation values of the neuron as observed when using the training data as input for the machine learning algorithm.

The received data may comprise a single data item. Determining the degree of dissimilarity may further comprise: for each neuron, forming a comparison between the first activation value for the respective neuron with the second activation values for the respective neuron.

The method may comprise: defining, for each of the one or more neurons, a measure of the distribution of second activation values; and comparing the first activation value to the measure of the distribution of second activation values.

The measure of the distribution of second activation values may comprise a range of second activation values. The upper limit of the range may be defined by the maximum one of the second activation values for the respective neuron. The lower limit of the range may be defined by the minimum one of the second activation values for the respective neuron.

Comparing the first activation value of each respective neuron to the range of activation values for the neuron may comprise determining whether or not the first activation value falls within the range of activation values. The degree of dissimilarity may be based on the number of neurons for which the first activation value does not fall within the range of activation values for the neuron.

Determining the degree of dissimilarity may comprise: for each one of the neurons for which the first activation value falls outside the range of activation values, determining a difference between the first activation value and whichever end of the range of second activation values the first activation value is closest to.

The received data may comprise a plurality of individual data items. Determining the degree of dissimilarity may further comprise: defining, for each of the one or more neurons, a measure of the distribution of second activation values; and comparing the first activation values to the measure of the distribution of second activation values.

Comparing the first activation values to the measure of the distribution of second activation values may comprise: defining, for each of the one or more neurons, a measure of the distribution of the first activation values; and comparing the measure of the distribution of the first activation values to the measure of the distribution of second activation values.

Determining the degree of dissimilarity between the received data and the training data may include comparing the activation of neurons in a reference neural network when using the training data as input to the reference neural network and when using the received data as input to the reference neural network. The machine learning algorithm may comprise a neural network and the method may comprise: determining a first measure of the dissimilarity by comparing the activation of neurons in the reference neural network when using the training data as input to the reference neural network and when using the received data as input to the reference neural network; and determining a second measure of the degree of dissimilarity by comparing the activation of neurons in the machine learning algorithm network when using the training data as input to the machine learning algorithm and when using the received data as input to the machine learning algorithm.

Determining whether the machine learning algorithm is likely to achieve the required level of performance may comprise determining whether the value of the first measure of the degree of dissimilarity lies within a predetermined range of values, and whether the value of the second measure of the degree of dissimilarity also lies within a predetermined range of values.

Determining whether the machine learning algorithm is likely to achieve the required level of performance may comprise referencing a predetermined relationship between the performance of the machine learning algorithm when exposed to test data and the degree of dissimilarity between the training data and the test data, the relationship being used to predict the performance for the first measure of the degree of dissimilarity and the second measure of the degree of dissimilarity.

The training data and the received data may each comprise one or more images.

The machine learning algorithm may comprise a classifier algorithm.

The task may comprise controlling one or more functions of an autonomous vehicle.

According to a second aspect of the present invention, there is provided a method of using a neural network to determine a degree of dissimilarity between an individual test data item and a set of two or more other data items, the neural network being configured to receive as input data items and to execute, for each data item, a machine learning task to produce an output for the data item, the method comprising: executing the machine learning task using the test data item as input to the neural network; determining, for one or more neurons in the neural network, a respective first activation value for the neuron, wherein the first activation value comprises an activation value of the neuron as observed when executing the machine learning task with the test data item as input; for each of the one or more neurons, comparing the respective first activation value of the neuron with second activation values for the neuron, the second activation values comprising activation values of the neuron as observed when using respective ones of the two or more other data items as input for the machine learning task; and based on the comparison, defining a degree of dissimilarity between the individual test data item and the set of two or more other data items.

The distribution of second activation values may comprise a range of second activation values. The upper limit of the range may be defined by the maximum one of the second activation values for the respective neuron. The lower limit of the range may be defined by the minimum one of the second activation values for the respective neurons.

The set of two or more other data items may comprise data items used to train the neural network. According to a third aspect of the present invention, there is provided a computer- implemented method for training a machine learning algorithm, the method comprising: receiving test data for testing a performance of the machine learning algorithm; determining a degree of dissimilarity between the test data and training data on which the machine learning algorithm has been trained; specifying a performance requirement for the machine learning algorithm, based on the degree of dissimilarity between the test data and the training data; determining the performance of the machine learning algorithm when processing the test data; and in the event that the determined performance of the machine learning algorithm does not meet the performance requirement, subjecting the machine learning algorithm to further training and/or modifying the machine learning algorithm.

Specifying a performance requirement for the machine learning algorithm may comprise defining a trend in performance as a function of dissimilarity between the test data and training data. The method may comprise defining one or more requirements for the gradient of the trend.

Subjecting the machine learning algorithm to further training may comprise supplementing the training data with further training data. The further training data may comprise data items that are dissimilar to data items included in the training data.

The machine learning algorithm may comprise a neural network and specifying a performance requirement for the machine learning algorithm may comprise: specifying a performance level for the machine learning algorithm with reference to a first measure of dissimilarity, wherein the first measure of dissimilarity comprises a measure of the dissimilarity between the training data and the test data as obtained by comparing the activation of neurons in the machine learning algorithm when using the training data as input and when using the test data as input to the machine learning algorithm.

Specifying a performance requirement for the machine learning algorithm may further comprise: specifying a performance level for the machine learning algorithm with reference to a second measure of dissimilarity, wherein the second measure of dissimilarity comprises a measure of the dissimilarity between the training data and the test data as obtained by comparing the activation of neurons in a reference neural network when using the training data as input and when using the test data as input to the reference neural network. The method may comprise determining a mapping between the first measure of the degree of dissimilarity and the second measure of the degree of dissimilarity. The method may comprise using the mapping to define a performance requirement in terms of the value of the first measure of the degree of dissimilarity.

The reference network may comprise a neural network that has been pre-trained on a training data set having a greater number of data items than the training data used to train the machine learning algorithm. The reference network may be used to provide a measure of semantic differences between the received data or test data and the training data used to train the machine learning algorithm.

According to a fourth aspect of the present invention, there is provided a computer implemented method for managing how a task is to be carried out, the method comprising: receiving test data for testing the performance of a machine learning algorithm; determining a degree of dissimilarity between the test data and training data on which the machine learning algorithm has been trained; specifying a performance requirement for the machine learning algorithm, based on the degree of dissimilarity between the test data and the training data; determining the performance of the machine learning algorithm when processing the test data; and in the event that the determined performance of the machine learning algorithm does not meet the performance requirement, subjecting the machine learning algorithm to further training and/or modifying the machine learning algorithm to meet the performance requirement; receiving new data to be input to the machine learning algorithm, wherein the output from the machine learning algorithm when processing the new data is to be used in carrying out the task; determining a degree of dissimilarity between the new data and the training data; determining, based on the degree of dissimilarity between the new data and the training data, whether the machine learning algorithm is likely to meet the performance requirement when processing the new data; and in the event that it is determined that the machine learning algorithm is unlikely to meet the performance requirement when processing the new data, modifying the way in which the task is carried out.

According to a fifth aspect of the present invention, there is provided a computer readable medium comprised computer executable instructions that when executed by a computer will cause the computer to carry out a method according to any one of the previous aspects of the present invention.

According to a sixth aspect of the present invention, there is provided a computer system configured to carry out a method according to any one of the first to fourth aspects of the present invention.

According to a seventh aspect of the present invention, there is provided an autonomous vehicle comprising a computer system according to the sixth aspect of the present invention.

Embodiments described here can contribute to the development of comprehensive schemes for the verification of the performance of machine learning based algorithms, including Artificial Neural Networks (ANNs) and Deep Neural Networks (DNNs), across a host of application areas. Embodiments have particular use in verifying the correct performance of autonomous systems that implement such machine learning classifiers. Such classifiers might be used to allocate labels to real-world data, including images, for example. Embodiments seek to characterise the dissimilarity of data sets, assess the ability of classifiers to generalise, and specify how the classifiers are required to generalise.

BRIEF DESCRIPTION OF FIGURES

Embodiments of the invention will now be described by way of example with reference to the accompanying drawings in which:

Figure 1 shows an example of a classifier resilience function (CRF) according to an embodiment;

Figure 2 shows an example of how a single test data set may be split into subsets based on data item property values according to an embodiment;

Figure 3 shows a flow-chart of steps used in determining whether the performance of a machine learning algorithm meets a requirement defined by a CRF according to an embodiment;

Figure 4 shows a flow-chart of steps used in determining whether the performance of a machine learning algorithm meets a requirement defined by a CRF according to an embodiment;

Figure 5 shows a flow-chart of steps used in an online performance verification process, according to an embodiment;

Figure 6 shows an example of how the estimated performance of a machine learning algorithm when processing received data may vary dependent on the degree of dissimilarity between the received data and the data used to train the algorithm;

Figure 7 shows a flow-chart of steps used in determining a degree of dissimilarity between different data items, according to an embodiment;

Figure 8 shows a schematic representation of the activation values of a single neuron in an ANN when training data is input to the ANN;

Figure 9 shows the neuron of Figure 8 with activation values of the neuron when test data is input into the ANN;

Figure 10 shows a schematic representation of activation values of two neurons in an ANN when test data is input to the ANN;

Figure 11 shows a schematic representation of activation values of three neurons in an ANN when test data is input to the ANN;

Figure 12 shows a schematic representation of activation values of two neurons in an ANN when two test data items input to an ANN;

Figure 13 shows an example of a machine learning algorithm being used to classify data contained in test data sets, and using the machine learning algorithm and a separate reference network to determine respective measures of dissimilarity between those test data sets and the training data used to train the machine learning algorithm;

Figure 14 shows example data reflecting the value of the dissimilarity measures as obtained from the machine learning algorithm and reference network of Figure 13, together with the accuracy (performance) of the machine learning algorithm for the different test data sets;

Figure 15 shows the machine learning algorithm of Figure 13 being used to classify data contained in additional test data sets and to determine the dissimilarity between those test data sets and training data used to train the machine learning algorithm;

Figure 16 shows example data reflecting the value of the dissimilarity measure as obtained from the machine learning algorithm in Figure 15, together with the accuracy (performance) of the machine learning algorithm for the different test data sets; and

Figure 17 shows a graphical plot of the data in Figures 14 and 16.

DETAILED DESCRIPTION

In embodiments described herein, a framework is provided by means of which the performance of machine learning algorithms incorporated within autonomous systems can be verified. The elements of this framework can further be used to validate and ensure the safe operation of autonomous systems, and in particular to set safeguards for when the operation of those systems is to be modified, or in some cases terminated.

In more detail, embodiments provide a means of measuring the generalisation capability of a machine learning classifier. Generalisation in this context refers to the ability of a classifier to perform effectively when supplied with inputs which do not belong to the data set used for training. The performance of a classifier will, in general, be dependent on the relationship between an input data item and the training data set, or between a test or operational data set and the training data set. Therefore if classifier performance is measured for a data set for which this relationship is not known, the information gained is limited, and insufficient for practical use. More information is gained if classifier performance is measured for a data set whose relationship to the training data set has been established. Ideally then, classifier performance requirements should state the level of performance that needs to be attained for each of a set of values of a quantifiable test data set property, where that property describes the relationship between a training data set and a test data set. A classifier with a greater ability to generalise than some other classifiers will exhibit a superior classification performance for a series of test instances or test data sets that differ progressively more from the data included in the training data set.

In order to measure the generalisation capability of a machine learning algorithm, and so provide a basis for verification of performance, embodiments described herein introduce the concept of a classifier resilience function (CRF). An example CRF is shown in Figure 1.

The CRF captures the relationship between performance and some measure of test data set dissimilarity (denoted Z in Figure 1). That is, the CRF shows the extent to which performance is maintained as the algorithm is tested on a series of data sets which, by some measure, are progressively further from the data set on which the algorithm was trained. By doing so, the CRF indicates how well a machine learning algorithm, such as an ANN, can generalize to progressively more difficult data sets i.e. data sets that differ from the training set to an increasing degree.

As noted above, the parameter Z quantifies the dissimilarity between the test data set and the set of data used to train the machine learning algorithm. The parameter Z may be based on one of a number of different quantifiable properties of the data sets; that is, the dissimilarity between the training data set and the test data set may be obtained by quantifying a particular property of the training data set and the test data set and determining the difference in those two values. In the case of images, for example, the properties could be expressed in terms of the number of pixels of a particular colour per image, or the mean brightness of pixels in the images, etc. In the case of audio signals, the properties could be expressed in terms of the amplitudes of certain frequency components, for example. Properties of a data set that might be quantified may also include the intrinsic separability of classes within the data set, for example.

It will be appreciated that a test data set may comprise one or more individual data items. Thus, the value Z may define a measure of dissimilarity between an individual test sample and a training set; that is, Z can be defined as a difference between a single, individual test data item and a collection of items contained within the training set. In other embodiments, the value Z may define a difference between a set of (multiple) test data items and a collection of items contained within the training set.

In more detail, as examples of the different way in which the value Z may be derived, we can consider three scenarios. The first of these scenarios concerns the case where there is a single test data item whilst the training data set contains a plurality of data items. The second and third scenarios both concern the case where the test data set and the training data set each comprise multiple data items.

In the first scenario, the value Z may be derived as the distance between a measured value of a property of the single test data item and a value of that property for the training data set - the latter value may, for example, comprise the mean value of that property for the data items contained in the training data set, or some other measure of the distribution of values of that property across the training data set. In the second scenario, the distance between the test data set and the training data set may be derived from the individual distances between each test data item and the training data set. As an example of this, the distance may be found between the mean value of the property for the training data set and the value of that property for each test data item. The mean of these individual distances could then be used as the distance between the test and training data sets. In the third scenario, the distance between the test and the training data set may be defined without reference to the individual distances between each test data item and the training data set. As an example of this, the distance between the mean value of the property for the test data set and the mean value of the property for the training data set could be used to derive the value of Z.

It will be recognised in the above that where reference is made to a mean value, this is simply by way of example, and other parameters that reflect the distribution of values may also be used when deriving the value of Z.

The CRF for a particular application or system may be defined outright; for example, a user may simply specify the form of the CRF that a system is required to meet. Alternatively, the CRF may be obtained based on measurement. In the latter case, the CRF may be obtained by subjecting a trained machine learning algorithm to multiple test data sets that each have a different value of Z (i.e. data sets that each differ from the original training set to a greater or lesser degree) and monitoring the performance of the algorithm as a function of the parameter Z. The performance (which, in the case of a classifier, may be defined as the accuracy with which the algorithm is able to classify particular objects present in each test data set, for example) can be plotted as a function of Z, so as to generate the CRF. Thus a CRF may map dataset dissimilarity values to accuracy values.

Accuracy may be measured by: 1) manually establishing the ground truth for a test data set 2) recording a score generated by the classifier for each of a number of classes for each test data item 3) counting the number of test data items for which the score was highest for the correct class (with reference to the ground truth) and 4) returning the number correct as a fraction of the total number of test items.

Where there is a statistical relationship between the measured performance and the measured values of Z, rather than a deterministic one, regression analysis can be used to find a function (CRF) which best fits the data. As well as generating an estimate of the underlying function, statistical analysis can extract further information describing the relationship, which can be exploited for performance evaluation and prediction. A CRF could be multivariate in that performance could be expressed as a function of more than one measure. For example, for imagery, test data sets may differ in both brightness and contrast.

In the event that there is only a single test data set available, then this can be used to generate multiple test data sets by splitting that data set into subsets based on calculated data item property values. This assumes that property values can indeed be returned for individual data items. One method of splitting the data on the basis of data item property value is by sorting those values into ascending order, and then splitting the sorted values into quarters by calculating quartiles (q). This is shown in Figure 2 - the data item property in this case being dissimilarity to the training data set. Here, multiple plots are presented for a series of data sets generated from a source data set by adding noise of progressively increasing variance, thereby providing test data sets of increasing dissimilarity compared to the training data set. The legend in Figure 2 lists the variance of the additive noise for the test data set associated with each plot. Each of those test data sets, each one in the series, has been split into quarters - ‘qT refers to quartile 1, ‘q2’ to quartile 2 (the median) and ‘q3’ to quartile 3. Each plot therefore comprises four points, corresponding to the four subsets (quarters) generated from each data set in the series. The figure displays the performance attained for each subset, but does not, in this case, record an explicit global dissimilarity measure for each subset. Other methods for obtaining multiple data sets, such as a sliding window, could also be used.

It will be appreciated that the process of defining the CRF may be an iterative one, in which an initial, desired form of the CRF is specified, and then during training and testing cycles, the CRF is iteratively altered so as to achieve a realistic or achievable performance specification. For example, it may become apparent during testing that there is a limit to the performance that can be achieved once the value of Z increases beyond a certain point, and the CRF will need to be altered to acknowledge this fact.

Classifier resilience functions capture and quantify the generalisation capabilities of machine learning algorithms. Therefore requirements which cover generalisation performance can be expressed in terms of CRFs. This in turn enables verification to be more systematic. For example, the specification may define one or more permitted forms of a classifier resilience function that the algorithm’s performance must match. In so doing, the specification will state the required generalisation capability of the machine learning algorithm as a function of one or more quantifiable properties of a data set. A specification may also define multiple classifier resilience functions, each one of which the algorithm must be capable of meeting. By defining the requirements in this way, machine learning classifiers can in turn be systematically verified with respect to their required generalisation competence. Verification can be carried out using existing methods such as formal verification or dynamic testing.

The CRF may be used in both online and offline modes as a means for verifying the performance of the machine learning algorithm. Here, “offline” can be understood to refer to the training and testing phases of the algorithm, and “online” to refer to the real world operation of the algorithm.

The use of the CRF in the offline mode is explained by reference to Figures 3 and 4. In this case, the CRF can provide an indication as to whether or not the machine learning algorithm has undergone sufficient training in order to meet requirements for safe operation. Referring first to Figure 3, the method commences in step S300 by specifying the requirements - by reference to the CRF, for example. Following this, the method proceeds by initialising the machine learning algorithm (step S301) and then training the algorithm with a set of training data (step S303). As discussed above, training of the algorithm may include defining a cost function that specifies a difference between the output of the algorithm and the expected values when using the training set, and iteratively adjusting the parameters of the model used in the algorithm so as to minimise the cost function. At the same time, or after training is complete, one or more properties of the training data set are obtained and quantified, to be compared with the test data set as part of the verification process.

Having trained the algorithm, the method proceeds to step S305 in which the performance of the algorithm is verified using a test data set. The properties of the test data set are obtained and quantified in the same way as for the training data set, following which the performance of the algorithm is measured when the test data set is used as input to the algorithm. In step S307, the values obtained by quantifying the one or more properties of the training data set and the test data set are compared in order to determine a value of Z for the two sets of data i.e. to quantify the extent to which the data in the test data set differs from that in the training data set. A determination is then made as to whether the measured performance of the algorithm, at the given value of Z, is consistent with the CRF and hence meets the requirements set out in the specification.

In the event that the measured performance is found to be below the acceptable level for the given value of Z, a decision may be made to alter the model used in the machine learning algorithm (step S309). For example, the number of parameters included within the model may be adjusted up or down and/or the value of constants used in the model (such as a regularization parameter, for example) may be varied. In the case of an ANN or DNN, the number of layers included in the network and/or the number of nodes within the layers may be changed. Another response would be to select a different class of neural network architecture design, or a different type of classifier. Alternatively, or in addition, a decision may be made to re-train the algorithm using a different set of training data from the original training data set, or by supplementing the original training data with further training data, so as to try to find a more optimal set of parameter values for the model. Having reconfigured and/or retrained the model, steps S301 to S309 may be repeated, until such time as the performance requirement is met.

In some instances, the performance requirement(s) as set out in the verification specification (and reflected in the CRF) may also define one or more values of Z for which the performance must be above a threshold. In other words, the performance requirement may only be deemed to be met if the measured performance on the test data set is found to be acceptable for a specified value of Z. As an example, the specification may state that the “an algorithm X must perform with a performance level at greater than or equal to Y for data sets of dissimilarity equal to or below Z _threshoid”. Figure 4 shows an example of such a case. Here, steps S400 to S405 are the same as steps S300 to S305 of Figure 3, respectively. In step S407, a check is made as to whether the value of Z for at least one of the test data sets is equal to or above value Z _threshoid and less than or equal to Z _threshoid + e, where e is a specified, positive number. In the event that the measured value of Z is smaller than Z _threshoid for every test data set, then it can be inferred that the initial test data set(s) were not challenging enough; the algorithm cannot be verified because the dissimilarity is too low to draw a meaningful conclusion as to how well the algorithm will perform when exposed to a wider range of input data. To address this, the range of test data that is to be input into the algorithm may be expanded, so as to obtain a larger value of Z between at least one set of test data and the training data. Once a positive determination has been made in step S407, the method proceeds to step S409, in which a determination is made as to whether the performance of the machine learning algorithm is greater than or equal to Y for one or more test datasets that have a dissimilarity value of Z that is equal to or below Z _threshoid + e. If the required performance as set out in the specification is not achieved for these data sets, it will be necessary to adapt the model used in the algorithm and/or replace or supplement the original training data with new training data before then re-training the algorithm (step S411). If the performance does meet the specified requirement, then the machine learning algorithm’s performance can be verified.

A CRF may also be characterised by two properties, referred to herein as the instantaneous resilience and the average resilience. Together, these properties capture the extent to which the machine learning classifier can generalise to data sets whose quantifiable properties change in a graded manner. The instantaneous resilience of the classifier for the particular data set property in question is defined as the gradient of the function at a point along the x- axis, for the particular value of the data set property at this location. The average resilience of the classifier for the particular data set property in question is defined as the average gradient of the function between two different values of this property. In some embodiments, the performance specification may define additional requirements relating to the resilience of the algorithm. For example, in addition to defining a threshold Z _threshoid at which the algorithm must maintain an acceptable performance level, verification may also require that the rate of decline in the algorithm’s performance for a given range of Z must be less than or equal to that of the CRF; in other words, the algorithm must demonstrate a degree of resilience that is consistent with the trend shown in the CRF.

In addition to providing a method to verify the behaviour of a machine learning classifier during training or development of the algorithm, the CRF also provides a method for online performance estimation. If a CRF has been established for test data sets whose level of dissimilarity with the training data set is comparable with that of data received online, then online algorithmic performance can be predicted. It is also the case that the CRF allows online self-verification to be conducted: the dissimilarity to the training data set as measured for data received online must be below or equal to a threshold stated in requirements. Such online performance prediction and verification may take place whilst the algorithm is actually used in carrying out an operational task. In some embodiments, the machine learning algorithm may generate outputs used for selecting actions to be performed by an autonomous function or a robotic agent interacting with a real-world environment, or a virtual agent interacting with a simulated environment. In such cases, the “task” may be one that is carried out by the function / agent, dependent on the output from the machine learning algorithm. The manner in which the agent performs that task or the decisions made in doing so may be influenced by the output from the machine learning algorithm.

In one example, the task may comprise operating a system on-board an autonomous vehicle, such as a guidance system, for example. Given the inherent emergent behaviour of autonomy algorithms, this may provide the Command and Control Module of a real-world autonomous system with a reliable method for predicting the behaviour of the algorithm in real-time, and allow that system to assess instantaneous performance, based on the dissimilarity between the current data being sensed and the training data set. A multivariate CRF for two or more dissimilarities may be used for on-line performance prediction. Alternatively, univariate CRFs for a number of dissimilarities may be used in combination (e.g. averaging) for on-line performance prediction.

The online performance verification process is shown in Figure 5. In step S501, the system receives data captured online; that is to say sets of data (e.g. one or more images) that are captured by the system over methodically determined time intervals as the system proceeds in carrying out a task, such as navigating, for example. In step S503, a comparison is made between the newly received data and the training data, so as to quantify the dissimilarity between the received data and the training data. Having established the level of dissimilarity, the expected level of performance is now estimated by referring to the CRF for the algorithm (step S505).

Here, the CRF defines a predetermined relationship, based on measurement, that can be used to infer how well the algorithm will perform on the newly received input data (step S507). As an arbitrary example, referring to Figure 6, for a given value of dissimilarity of Zi , the machine learning algorithm can be estimated to have a performance level Yi. Similarly, for a given dissimilarity of Z2, the machine learning algorithm can be estimated to have a performance level Y2. In the event the expected performance of the algorithm is of an acceptable level, the output from the algorithm may continue to be used in performing the task at hand (step S509). Otherwise, an alternative action (e.g. fail-safe) may be activated in which the performance of the task is either modified, or terminated entirely (step S511).

In the example shown in Figure 5, the specification may define a requirement in the form of a maximum permitted dissimilarity between the input data and the training data. The maximum permitted dissimilarity may be expressed in a similar form to that above, namely that “an algorithm X must perform with a performance level at greater than or equal to Y for data sets of dissimilarity equal to or below Z _threshoid”. In this case, the maximum permitted dissimilarity will be Z _threshoid (or if it is higher, the maximum value of Z for which the performance has been found to be of an acceptable level). If the requirement for self verification is not met, the system can indicate that this is the case and respond accordingly.

Byway of example, the online performance verification as shown in Figure 5 may be implemented in an autonomous vehicle, in which one or more images of the path ahead of the vehicle are captured by one or more onboard cameras. Here, the machine learning algorithm may comprise a classifier that is used to identify the presence of any obstacles in the path ahead that the vehicle needs to steer around to avoid. The task at hand is then to control the steering of the vehicle, so as to avoid any such obstacles. In this context, the images captured by the cameras will comprise the online data that is to be input to the machine learning algorithm, with the machine learning algorithm processing the images to identify the presence of any obstacles in the path ahead. The output from the machine learning algorithm may be input to a separate steering system that will use the classifications obtained from the machine learning algorithm to carry out the task of steering the vehicle.

By considering the level of dissimilarity between the training data and the input into the classifier (in this case, the images received from the onboard cameras) and comparing this to the CRF, a prediction can be made as to how well the classifier is likely to perform on the received images, and in turn whether the algorithm can be relied upon to identify obstacles in its path to a sufficient degree of accuracy. In the event it is determined that the performance of the machine learning algorithm is unlikely to meet the acceptable level, then the way in which the task is carried out may be modified, such that the output from the machine learning algorithm is given less weight in terms of deciding how and when to steer the vehicle. For example, the output from the machine learning algorithm may still be included in the decision making process, subject to there being additional checks or input from other computer systems and/or oversight from a manual operator. In other embodiments, a fail-safe may be implemented whereby the vehicle reverts to a fully manual mode, with control of the vehicle reverting to a manual operator and the output from the machine learning algorithm ceasing to have any role in deciding how and when the vehicle should turn.

It will be appreciated that where the performance level is determined as being acceptable, then the manner in which the task is carried out may still be modified, dependent on the estimated performance level. For example, in the case of the autonomous vehicle described above, in the event that it is determined that the performance of the machine learning algorithm is likely to be well above a minimum acceptable level, the output from the machine learning algorithm may be given a higher weighting in the overall process of deciding how and when to steer the vehicle. This may then result in a reduced need for corroboration of the classifier output by other systems on board the vehicle.

Accordingly, embodiments described herein can facilitate the expanded use of autonomous systems, without any reduction in safety.

As discussed above, the term “dissimilarity” as used herein relates to a quantification of the difference between two sets of data. Dissimilarity may be determined for a single data item with respect to a second data set and/or may be determined for all, or a portion, of a first data set with respect to a second data set. For example, the dissimilarity between two sets of data may be a measure of the image quality of a first set of images with respect to a second set of images. As a second example, the dissimilarity may be a measure of the difference between the distribution of training data set items in some feature space, and the distribution of test data set items in the same space. The dissimilarity between a set of test data and the training data is an indication of the challenge, and hence the difficulty, that the input data poses to a machine learning algorithm.

In what follows, a methodology for determining the dissimilarity between training data sets and test data sets / received data sets will be described, with reference to Figures 7 to 12. Here, the internal state of the machine learning algorithm when exposed to the different data sets is itself used as a measure of the dissimilarity between those data sets. The method will be described for the specific case of an ANN, but it will be appreciated that the methodology can be extended to other forms of machine learning algorithm. For an ANN, neuron feature space refers to the space in which each axis corresponds to the activation values output by a particular neuron. Not all the network neurons need be represented in a neuron feature space.

The method is based on observing the internal neuron activations of a neural network when training data is input to the network (these activations can be considered “training activations”). The internal neuron activations are similarly observed when previously unseen input data is presented (these activations can be considered “testing activations”). The training activations and testing activations are then compared in order to determine a measure of dissimilarity between the training data and the unseen (test) data. The present approach is inspired by one described in a paper by Ma et al (Ma, L, Juefei-Xu, F., Zhang,

F., Sun, J., Xue, M., Li, B. & Zhao J. (2018, September), “DeepGauge: Multi-granularity testing criteria for deep learning systems”, Proceedings of the 33 ^rd ACM/IEEE International Conference on Automated Software Engineering, p120-131). In that paper, the authors propose a set of neuron test coverage metrics to show the extent to which a network is exercised by an entire data set. The DeepGauge metrics summarise test coverage for an entire data set, with a single value being assigned per metric; in other words, the DeepGauge criteria focus only on trends over an entire test data set, and not on individual measurements for each item (as described below) of a test data set. In contrast, embodiments described herein allow a measure of dissimilarity to be determined between an individual test item and a set of training data.

In the embodiments described throughout the present application, references to a “single” or “individual” data item can be understood to mean an item of data that contains the minimum amount of information required to meet the input requirements for the machine learning algorithm, and so enable the algorithm to perform at least one run and produce an output.

By way of example, a classifier algorithm may receive a single image or portion of an image that of itself contains sufficient information for the classifier to execute its function and perform a classification. This would then be distinguished from a set of data which provides insufficient information for the classifier to run. For example, the classifier may be constructed in such a way that it requires more than a single pixel intensity value to perform a classification. In this case, a single pixel intensity value will not, by itself qualify as a “single” data item as defined herein. Meanwhile, a set of multiple images that can each be input separately into the classifier algorithm, with the algorithm performing a separate run to classify each respective image, can be considered to comprise a group of data items, as opposed to a single data item.

It will be understood that in practice, not all the data contained in the input data item need actually be used in the classification process. For example, in the case of an image, certain regions of the image may be cropped or disregarded in a pre-processing stage of the machine learning algorithm; an image of a car may be cropped, for example, to remove items in the background such as trees and other scenery. Nevertheless, even if only a small portion of the image will actually be used in determining a classification for the image, the machine learning algorithm may still require the input image to be of a certain size or dimension in order for the algorithm to run. Thus, a single data item may be one that satisfies the input requirements of the algorithm, even if not all of that data contained within that data item will ultimately contribute to the activation values observed for that image.

The fact that the dissimilarity can be measured on an individual image-by-image basis allows for on-going self-verification and performance prediction during operation of a task that relies on the output from the algorithm. For example, it might be the case that the dissimilarity of a set of data items is defined as some form of average of the dissimilarity of each of the individual items to the training set. Further, new data items might be received at regular time intervals. When this pertains, if the average of the individual dissimilarities is calculated as a rolling average, then the value can be updated on the receipt of each new data item.

The steps in obtaining the measure of dissimilarity between the two sets of data are explained with reference to Figure 7. In step S701 , an item of test data is received and input into the algorithm. In step S703, activation values of neurons in the network are measured. The activation values observed for the respective neurons are compared with the activation values observed when using the training data as input to the algorithm (step S705). In step S707, a degree of dissimilarity between the input data item and the training data is defined, based on the comparison between the activation values associated with the test data as input and the activation values associated with the training data as input.

In embodiments described herein, when passing training data through an ANN which has completed its training, the minimum and maximum activations of each neuron are saved and serve as a basis for defining a region termed the “Major Function Region” MFR (which is particular to each neuron). When the trained network is supplied with a new data item, a count is made of the number of neurons which are activated outside the MFR. The total count can then be defined as the “Neuron Region Distance” NRD for the specific test input. The NRD values can be measured by extending a standard Convolutional Neural Network (CNN) implementation to include extra modules for storing and analysing neuron output values. Other types of ANN may be extended in a similar fashion to store and analyse the neuron output.

It will be appreciated that the main computation in the NRD is a set of two comparisons:

1) is the test point less than the lower bound of the MFR? and 2) is the test point greater than the upper bound of the MFR? This results in a complexity of 0(2n) where n is the number of neurons (strictly O(n) as constants do not contribute to the complexity). This means that the complexity does not scale with the number of training examples. Additionally most Deep Learning frameworks such as Tensorflow provide GPU accelerated implementations of matrix comparison operations which reduce the execution time required.

One of the core assumptions of machine learning is that the training and test data are independently and identically distributed. This assumption is unlikely to hold when classifiers are deployed in varied real-world scenarios. By providing a measure of how dissimilar each test data item in a test data set is from the learnt training distribution, the NRD provides context as to how well a model can be expected to perform, or is required to perform, on an entire test data set. By measuring the dissimilarity in the form of the NRD, it is possible to train a classifier which functions within an autonomous system so that it will perform in a proportionate and appropriate manner given the nature of the data to be or being processed.

The concept of the NRD can be understood further by reference to Figures 8 to 11. Figure 8 shows the activation of a single neuron in a one dimensional (1D) format. The activation values 801 of the neuron when exposed to training data may take one of any number of values from zero upwards (assuming that the neuron applies a rectified linear unit (ReLU) activation function). The maximum activation value 803 of the neuron when exposed to the training data and the minimum activation value 805 of the neuron when exposed to the training data can be used to define respective upper and lower limits a _max and a _mm. In one dimension, the MFR may be defined as a region that exists between a _max and a _mm. It will be appreciated that the values a _max and a _mm may correspond exactly to the maximum and minimum activation values of the neuron, or else may be derived from those values. As a simple example, the maximum and minimum activation values may be rounded up or down to the closest whole number to provide the values a _max and a _mm.

Figure 9 shows the same neuron as illustrated in Figure 8, this time with the activation values of the neuron when exposed to the test data. Some of the test data activation values 903 lie within the MFR, whilst certain ones of the activation values 901, 905 lie below a _mm and above a _max respectively. Those activation values 901 , 905 that lie outside the range a _mm to a _max may be regarded as corner cases.

Figures 10 and 11 show NRD measurements returned for multiple activation values which have been generated for individual test data items by more than one neuron, corresponding to the case of a network with 2 neurons (2 dimensions) and 3 neurons (3 dimensions) respectively. In practice, it will be appreciated that the MFR defined for the network will comprise an n-dimensional (nD) space, where n is the number of neurons in the network.

Referring to Figure 10, the first neuron Ni has an upper and lower activation boundary of a _max ⁽¹⁾ and a _min ⁽¹⁾, respectively. The second neuron N2 has an upper and lower activation boundary of a _max ⁽²⁾ and a _mm ⁽²⁾, respectively. In this case, the MFR for the network, assuming the network solely consists of these two neurons, may be defined by the region of overlap between areas defined by the 1 D MFR of each neuron. A test data item for which the activation value lies within the MFR of a particular neuron may be assigned a value of 0 for that neuron, and a test data item whose activation value for that neuron lies outside the MFR may be assigned a value of 1. Therefore, in 2D, any test data item having a vector of assigned values [0,0] will have activation values for the two corresponding neurons that fall within the 2D network MFR. The NRD of a test data item may then be based on the summation of the vector of the assigned values. For example, test data item 1001 has a vector [0,0] and an NRD of 0. Test data item 1003 has a vector of values [1,1] and an NRD of 2. Test data item 1005 has a vector of values [1,0] and an NRD of 1. Test data items whose activation values tend to lie outside the network MFR, and so have a larger NRD, can be understood to be more dissimilar to the data items contained within the training set than those whose activations values fall within the network MFR. A system of at least two neurons facilitates the identification of unexpected activation values (for the case where it has been assumed that the test and training data sets are drawn from the same distribution).

Figure 11 illustrates the activation of a model with three neurons in three dimensions. Here, a _max ⁽¹⁾ and a _mm ⁽¹⁾ are the respective upper and lower activation values of a first neuron Ni. The values a _max ⁽²⁾ and a _mm ⁽²⁾ are the respective upper and lower activation values of a second neuron N2 and the values a _max ⁽³⁾ and a _mm ⁽³⁾ are the respective upper and lower activation values of a third neuron N3. The MFR of the 3D system is the 3D volume defined by the upper and lower activation boundaries of each of the three neurons.

In some embodiments, the NRD can be normalized by dividing the NRD by the total number of network neurons. The normalized value may be termed the “Fractional Neuron Region Distance” FNRD. In the event that the data set contains a plurality of data items, the dissimilarity measure for the data set can be obtained by taking the median (or another effective descriptive statistic) of the respective FRND values obtained from each item of data in the data set.

The NRD may be construed as the Hamming Distance of a test example from the MFR. The Hamming Distance is traditionally used to compare two binary strings and shows how many symbols are different in the two strings. For example, the binary streams of 100 and 111 have a Hamming Distance of two as there are two symbols different between them. In terms of the vectors referred to above, the NRD is thus given by the Hamming Distance between a vector of zeros and the vector generated for a data item. However in the neuron space, the process of mapping activation values to either being inside or outside of the MFR is a binary decision and may result in a loss of information. This can be understood with reference to Figure 12, which shows the activation values 1201 , 1203 for two test data items (Image 1 and Image 2) relative to the 2D MFR defined by the training data. Here, it can be seen that both Image 1 and Image 2 have a NRD of 2, as in each case, there are two neurons whose activation values lie outside the MFR. However, the activation values of the network when using Image 1 as the input lie relatively close to the MFR, whereas the activation values seen when Image 2 is input to the network lie much further outside the MFR. This difference in the relative distances of the activation values from the MFR is not captured by the NRD.

In order to reduce this loss in information, other measures of dissimilarity are proposed here. In one embodiment, the degree of dissimilarity may be based on a difference between the activation values of neurons and the closest MFR boundary. One example of such a measure may be referred to as the “pseudo Manhattan Neuron Region Distance” (pMNRD). This is based upon the Manhattan distance measure, which defines the distance d between two n-dimensional vectors p and q as:

The pMNRD is defined as the sum of the absolute differences between each neuron’s test activation and the neuron’s closest major function region boundary. The pMNRD can be normalised in a number of ways. One possibility is to divide the pMNRD by the median of the lengths of all the neuron-wise 1D MFRs in the network. Another possibility is to divide the absolute difference calculated for each neuron by the length of that neuron’s MFR (prior to summation). For example, a neuron whose activation value lies a single unit outside of a 100-unit long MFR is indicating less dissimilarity than one which whose activation value lies a single unit outside of a 0.5-unit long MFR.

In another embodiment, the information lost in the discretisation process can be preserved by measuring the Mahalanobis Distance. The Mahalanobis Distance is a long-established measure of the distance of a point from a distribution. In a single dimension, it is possible to calculate the distance of a point from a distribution in terms of the number of standard deviations from the mean. The Mahalanobis Distance is a multi-dimensional generalisation of this concept. Specifically, the Mahalanobis Distance of a point x from a set of observations with mean m and covariance matrix S can be calculated as:

M(x) = (x — m ) ^t S ^{- 1} (x — m)

The Mahalanobis distance takes into account the variances of the training activation values generated for each neuron. It also reflects the relationship between activation values output for each pair of neurons (each considered as a variable) in that covariances are found for each pair and feature in the distance definition. The distance effectively standardises the raw Euclidean distance, and can be measured in neuron feature space (the space for which each axis corresponds to the activation values output by a particular neuron).

The measures listed above, including the NRD, may be calculated in a finer-grained manner with respect to the training distribution of activation values. Such an approach would partition the training data set into classes, or clusters. For example, by recording the network MFRs for each partition, the NRD between a data item and each of the partitions would be found in the manner as described above for the entire training data set. Then a minimal NRD would be returned as the member of the set of NRDs to each partition with the lowest value. The minimal NRD would be taken as the dissimilarity of the data item to the training data.

Clusters in feature space could be derived using a standard ‘k-means’ clustering algorithm based on the Euclidean distance, or on some other measure such as the Manhattan distance (there is evidence that the Manhattan distance might have advantages over the Euclidean in high-dimensional space).

It will be appreciated that setting a threshold on what is an acceptable value for a measure requires a heuristic and prior knowledge of the system. While the FNRD can help to address this by normalising the NRD values by the total number of neurons, there is no guarantee that different networks will show similar behaviour in their FNRD values. Therefore, in some embodiments, a probabilistic approach may be preferred. The data items in a training data set, when supplied to a network, produce activation values for each neuron in a network. These activation values are measureable properties since neurons have observable characteristics. The set of activation values generated by each training data item can be expressed as a point in neuron feature space; where there is more than one data item, a distribution of points can be represented in neuron feature space. The distribution of these points in neuron feature space may be modelled by means of a parameterised probability distribution. The value returned by the corresponding probability density function for a test data item mapped to the feature space may be used as a measure of dissimilarity. In particular, Inductive Conformal Prediction (ICP) may be implemented, due to the high dimensionality of the neuron space and number of training examples involved with modern machine learning tasks. ICP requires a nonconformity function and a calibration set. The nonconformity function can be any function which measures the difference between two points. The NRD and FNRD will be unsuitable for this purpose since by definition, the training data will always have a value of zero. However, the Mahalanobis Distance may be suitable.

For the measures described thus far, which return dissimilarity values for an individual data item, a single data set can be used to produce a CRF. As discussed above with reference to Figure 2, if a single test data set is available then it can be divided into sub-data sets on the basis of the data item dissimilarities. Performance against a suitable dissimilarity metric, that is to say one which may be applied to a plurality of data items, can then be obtained on the basis of these sub-data sets.

The only requirement for implementing the NRD measure is access to a trained ANN. As an example, the NRD measure may be implemented in the Python programming language around an ANN built in Keras with a Tensorflow backend. Once the network is trained, the Keras API provides access to the outputs of each layer of the network when it is presented with input data. To record the boundaries of the MFR, the entire training data set is presented to the network in a single batch. The program then iterates through each layer of the network. The output of the layer will be a tensor of 1 rank greater than the layer weights, where the extra rank occurs because each input data item will produce a different output. Thus, for a layer with weights of dimensions N x M presented with K input data items (e.g. images), the output will have dimensions Kx N x M. The maximum and minimum values along axis 0 are recorded, so that there are N x M maximums and minimums. It should be noted that for ANNs, layers often have dimensions of N x M x L in which case the output dimensions are K x N x M x L. The next stage involves calculating which test inputs produce activations outside the MFR on particular neurons. With K testing inputs, there will again be K x N x M outputs on each layer. The Tensorflow backend allows a comparison operation of the N x M maximums/minimums to be broadcast across the entire K x N x M output data. This results in a Kx M x N result with a True value for any element which is larger/smaller than the maximum/minimum respectively. A logical OR operation is performed with these two output tensors, resulting in a tensor of all of the corner values with dimensions K x M x N. The corner values tensor can then be summed along axis 0 to find the NRD of each image, K scalars.

The Tensorflow backend provides an efficient implementation of tensor operations which can be parallelised on a GPU. As a result, this implementation has a very short execution time.

A slower serialised implementation is also possible which makes iterative comparisons for each individual neuron for each item of test data. Iterating through large amounts of data is slow, however, and does not take advantage of the parallel GPU processing available. Thus, a parallel implementation may be preferred.

It will be appreciated that the NRD need not be calculated for the entire network. The NRD may be calculated for a subset of the network, such as one or more individual layers, for example. The NRD may be determined for at least one layer towards the end of the network. Doing so may improve the efficacy of the dissimilarity measure since it is possible that networks discriminate more significantly (classes are separated to a greater extent in feature space) towards the final layers of a network.

The NRD and related measures may be used to specify, verify and predict performance for a range of ANN forms and functions. Such measures may be used to specify, verify and predict performance for ANNs which instantiate regression functions rather than classification functions, where continuous values are output by the network rather than labels.

One advantage to specifying and verifying the form of CRFs which map performance to dissimilarity measures based on neuron feature space, is that they capture the network response to universal modifications applied at the signal level. For example a CRF might be specified for required performance as a function of additive noise, or as a function of some filter applied to the signal. However, specifying CRFs for multiple signal transforms may become difficult because of the large number of possible transforms, or the large number of combinations of transforms (for which multivariate CRFs would be employed). The representation of data items in neuron feature space is downstream of signal transforms, and so multiple or combined signal transforms can all be represented in a common frame of reference. Dissimilarities can then be measured in this frame of reference and tied to performance.

A case can be considered in which the machine learning algorithm comprises an ANN classifier. This network, whose generalisation capability is to be specified and verified, and for which a CRF is to be established and used for prediction performance, will be referred to as ‘the ANN classifier’. There may then be a second network which can contribute to these procedures and processing. The second network may be referred to as ‘the reference network’ .

The reference network may take the form of a network that has been highly and diversely trained to classify data items, using a very large number of training examples, relative to the number used when training most networks (and including the ANN classifier). The reference network may have a substantial number of layers, perhaps having the same architecture as one of the Visual Geometry Group (VGG) set of deep convolutional neural networks (K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, International Conference on Learning Representations, 2015). Such a network will extract features, like any ANN, and input data items will be mapped to a point in a feature space, where the axes of this space correspond to the activation values output by individual neurons. In general, the features extracted by the two networks will differ, as will the number of features. In other words the same input data item will be mapped to different neuron activation values within the two networks. The features extracted by highly-trained networks may be particularly effective for a range of processing tasks (see Noa Garcia, George Vogiatzis, “Learning Non-Metric Visual Similarity for Image Retrieval”, Elsevier Image and Vision Computing Journal, Vol. 82, Sept. 2017). The reference network may be used in a number of ways. First, it may be employed to measure the dissimilarity between any training data set and test data set (subject to the data items being suitable for input into the network). Secondly, it may be used to calibrate the dissimilarity values produced by the ANN classifier. Thirdly, it may be used in conjunction with the ANN classifier for performance prediction.

With reference to the first use, a training data set may be used to train the ANN classifier. A test data set may be collated, and both the aforementioned training data set and the test data set are supplied to the pre-trained reference network as input. Each data item will then activate the reference network neurons. An MFR may be defined for the reference network on the basis of the training data set activations. The same dissimilarity measures described above, such as the NRD for example, can then be applied to measure the dissimilarity between the two data sets in the reference network neuron feature space. This dissimilarity can then be associated with the performance of the ANN classifier for the test data set. By repeating this procedure for a number of test data sets, a CRF can be generated which maps dissimilarity measured in reference network space to ANN classifier performance.

Due to the nature of the features extracted by the reference network, as discussed, dissimilarities measured in reference network space may have advantageous properties. First, the nature of this highly-trained network’s feature space might yield dissimilarity measurements that map to (ANN classifier) performance in a more deterministic, less statistically scattered manner than for such measurements made in ANN classifier neuron space. Therefore CRFs generated by regression from performance versus dissimilarity relationships which are based on reference network measurements might be more effective for online performance prediction. Secondly, the dissimilarities produced by means of the reference network might be consistent with human judgement of semantic data set differences, especially if measures are only applied to deeper network layers - that is to say solely based on the activation values of neurons closer to the output layer of the network (see Richard Zhang, et al. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition and Noa Garcia, George Vogiatzis, “Learning Non-Metric Visual Similarity for Image Retrieval”, Image and Vision Computing, April 2019). As an example, a “semantic distance” may reflect the presence of different weather conditions seen in images of the same overall scene. Producing objective measures of the semantic differences between data sets is challenging but of great practical significance, since real-world image data sets might well vary in many complex ways, and are less likely to differ along a single dimension. Dissimilarities which reflect semantic differences between data sets may be used to specify and verify network classification performance as a function of the semantic differences between data sets; CRFs can be generated to specify and measure how well networks generalise to semantically different data.

It might be the case that the ANN classifier and the reference network do not differ significantly in terms of their architecture, or with regard to the nature of the data sets used for their training. It would therefore follow, that the ANN classifier would yield dissimilarity measurements, made in ANN classifier neuron space, which have some of the beneficial attributes of the reference network measurements, such as sensitivity to semantic changes. However, the reference network, as its name suggests, is intended to be stable, and its behaviour and properties to have been well-documented, whilst the ANN classifier is being trained de novo, and dynamically.

The dissimilarities generated by the reference network may themselves be used in the same manner as the dissimilarities returned by the ANN classifier. The dissimilarities can be used to specify CRFs, for example by stating how the ANN classifier performance should generalise with respect to the semantic differences between test and training data sets. The dissimilarities can also be used to measure CRFs, for example to assess how the ANN classifier performance does generalise with respect to semantic differences between test and training data sets. Finally, the dissimilarities generated by the reference network may be used for performance prediction. For example, a CRF that captures ANN classifier performance as a function of dissimilarities measured by the reference network can be used to predict online performance with the reference network itself operating online. The performance of the ANN classifier may also be predicted by exploiting both dissimilarities obtained from the ANN classifier feature space, and dissimilarities returned for the reference network feature space.

It will be understood that in the examples discussed above, the reference network is not being used for classification, but simply to measure the dissimilarity between data sets. It will be further understood that there is a key difference between the dissimilarity values returned for the ANN classifier and those returned for the reference network: since the ANN classifier is used for classification, not the reference network, only dissimilarity values returned for the ANN classifier are measured in the feature space in which classification is performed. The ANN classifier output depends on how features are extracted, and as a consequence how data items are represented in neuron feature space and then mapped to an output label. Therefore, dissimilarities between test data items and the training data set as measured in ANN classifier feature space directly gauge the classification and generalisation challenge that the ANN classifier is set. It therefore follows that CRFs for which the inputs are dissimilarities measured with respect to the ANN classifier have a specialised role to play in specifying and verifying ANN classifier generalisation capabilities.

The second listed use of the reference network is its use for calibrating dissimilarity values produced by the ANN classifier. Since the reference network has been highly-trained, features extracted by the reference network may have properties that prove of versatile use, and hence the reference network can be used as a point of reference to calibrate any other ANN classifier. The reference network is intended to be stable and its behaviour to have been extensively charted. For example its use in producing an objective measure for the semantic differences between data sets would have been thoroughly investigated and documented. In turn, the relationship between such objective semantic measures and the perceived similarity as judged by a human subjective may have been determined (see B. E. Rogowitz, et al. “Perceptual Image Similarity Experiments”, Proc. IS&T/SPIE Conf. Human Vision Electronic Imaging III, San Jose, CA, July 1998, pp. 576-590).

The dissimilarities between a training data set and a number of test data sets as measured by an ANN classifier may be mapped to the dissimilarities as measured by the reference network. This mapping, in view of the preceding discussion, can be considered as a calibration. For example if the reference network dissimilarity indicates a degree of semantic difference between two data sets, then the corresponding dissimilarity returned by the ANN classifier, for the same two data sets, reflects the same degree of semantic difference. This calibration can be used to extend the interpretation of the dissimilarity values returned by the ANN classifier.

The third listed use of the reference network is to predict ANN classifier performance in tandem with measurements made with respect to the ANN classifier. CRFs that map dissimilarity to ANN classifier performance may be established for dissimilarities measured with respect to both the ANN classifier and the reference network. One CRF would map dissimilarities measured with respect to the ANN classifier to ANN classifier performance; the other CRF would map dissimilarities measured with respect to the reference network to ANN classifier performance. Both networks could operate online. The two performance values predicted by the two networks may be mapped to a single predicted performance in some manner, for example by a weighted average. Another approach would be to establish a bivariate CRF where one input variable would be the dissimilarity measured with respect to the ANN classifier, and the second input variable would be the dissimilarity measured with respect to the reference network. Alternatively, either the ANN classifier or reference network could be used alone for this purpose.

The use of the reference network in combination with the ANN classifier (machine learning algorithm) can be further understood with reference to Figures 13 to 17. As before, a performance requirement for the ANN classifier may be expressed as a statement such as: ‘the algorithm must perform at a level greater than or equal to Y for data sets of dissimilarity equal to or below a threshold Z’. The requirement may be specified in terms of the dissimilarity between a test data set and the training data set as measured in reference network neuron feature space. In the case where the input data comprises image data, the particular form of measurement employed will return an effective measure of image visual content dissimilarity, which has been established by extensive reference network experimentation. The measure of dissimilarity between the test data set and the training data set as determined by the reference network can be quantified using arbitrary units. For example, the performance requirement may specify that the ANN classifier should generalise to imagery which is semantically at a distance of 4 units from the training data set, with the value of 4 being selected on the basis of tables drawn up during the reference network experimentation. In this case, the performance requirement R1 may be stated in the form that “ANN classifier accuracy must be greater than or equal to 0.8 for a visual dissimilarity of 4 units as measured by the reference network.”

Referring to Figure 13, a number of test data sets (1301, 1303, 1305) may be collated, where those test data sets are judged to be progressively dissimilar in visual content to the training data set used to train the ANN classifier. The process continues by verifying the performance requirement R1 using what is termed here a Dynamic Testing Procedure DT1. Here, the visual dissimilarity between each test data set and the training data set is measured using the reference network; it is assumed that there is some formula which returns visual dissimilarity for each test data set based on reference network measurements. The ANN classifier performance on those training data sets is also determined as described above.

In addition to measuring the performance of the ANN classifier, the median FNRD in the ANN classifier space may also be determined. The median FNRD may be relevant if extending the performance requirements as discussed in more detail below. To recap, the FNRD for a single data item will be zero if all neuron activations for that data item lie within the network MFR of the training data set, and the FNRD for a single item will be 1 if all neuron activations for that data item lie outside of the network MFR of the training data set. Figure 14 shows a table of example results for the dynamic testing DT1. It can be seen that the performance requirement R1 is verified; in the case where the visual dissimilarity is 4 (row 3 of the table), the ANN classifier performance exceeds the minimum required level of 0.8 with a measured value of 0.85.

Following the above, a further performance requirement R2 can be specified to provide an extra factor of safety to augment requirement R1. The theory behind R2 can be summarised as follows:

1) The median FNRD measures the dissimilarity between a test data set and the data set used to the train the ANN Classifier. The median FNRD is measured in the ANN Classifier neuron feature space. Therefore, the median FNRD directly gauges the degree to which the ANN classifier must generalise in order to perform satisfactorily when classifying items in the test data set.

2) Columns 2 and 3 in the table of Figure 14 provide a calibration for the median FNRD, in the sense that a) the median FNRD (column 2) is mapped to the visual dissimilarity measured by the reference network (column 3), and that b) it is presumed that the reference network has been subject to experimental analysis which is sufficiently extensive to confer upon visual dissimilarity the status, at least to some degree, of a standard measure. This allows further requirements on the median FNRD to be drawn up which take corresponding visual dissimilarity values into account.

3) Still referring to Figure 14, it is seen that the requirement R1 has been verified, and it has been established that a visual dissimilarity of 4 yields an ANN classifier accuracy of 0.85.

The test data set that returned a visual dissimilarity of 4 generated a median FNRD of 0.3. Note that the performance for test data sets which lie beyond a median FNRD of 0.3 from the training data set has not been established at this stage; verifying R1 did not necessitate the assessment of test data sets occupying this region of feature space.

4) If the ANN classifier performance is now specified and verified for test data sets that do lie beyond a median FNRD of 0.3 from the training data set, further confidence can be gained in the classifier’s generalisation performance beyond that provided by R1. A margin of safety relative to R1 is established by verifying performance for further test data sets whose dissimilarities to the training data set, as measured directly in ANN classifier feature space, surpass the dissimilarities recorded for the test data sets needed to verify R1. 5) In contrast to R1, which was stated in terms of visual dissimilarity, performance is now specified and verified in terms of the median FNRD, the dissimilarity as measured in ANN classifier neuron space itself. To the extent that the median FNRD is effective as a dissimilarity measure, the relationship between ANN classifier performance and the median FRND might prove more reliable than the relationship between ANN Classifier performance and visual content dissimilarity as measured in the reference network.

The requirement R2 can be drawn up based on the results of the dynamic testing procedure DT1 shown in Figures 13 and 14. The performance requirement R2 may be specified in terms of the performance required for values of the FNRD greater than 0.3. For example, the performance requirement R2 may specify that ANN Classifier accuracy must be greater than or equal to 0.7 for median FNRD values of greater than or equal to 0.3 and less than or equal to 0.6.

Having defined the second performance requirement R2, this can be verified by a further Dynamic Testing Procedure DT2 as shown in Figure 15. Here, a further set of appropriate test data sets 1501, 1503, 1505 are collated and the median FNRD of those test data sets is measured, together with the ANN Classifier performance. Figure 16 shows example results showing the requirement R2 has been verified.

The results presented in the tables for DT1 and DT2 (Figures 14 and 16, respectively) can be plotted as example CRFs, as shown in Figure 17. The graph in Figure 17 shows two CRFS: 1) accuracy as a function of visual dissimilarity between test data sets and the training data set and 2) Accuracy as function of dissimilarity between test data sets and the training data set as measured by the median FNRD. Here, the points on the line 1701 correspond to test data sets 1301, 1303, 1305 (as shown in Figures 13 and 14) and the points on the line 1703 correspond to test data sets 1501, 1503, 1505 as shown in Figures 15 and 16. It is noted that real-world data may return statistical relationships between the median FNRD, visual dissimilarity and accuracy. CRFs could be estimated from such statistical relationships by regression methods. A bivariate CRF, a function of median FNRD and visual dissimilarity (as measured by the reference network), may be used for joint online performance prediction.

It will be appreciated that the reference network need not always be trained on more data items than the machine learning algorithm (ANN classifier). However, in general, the reference network will be trained by means of a very large and varied dataset. The reference network is characterised, as its name suggests, by a second attribute: the dissimilarity measures between datasets that it returns will have been extensively recorded, analysed and documented for a large number of datasets. Reference tables might exist that document the range of dissimilarities that tend to be returned between certain types of datasets.

Implementations of the subject matter and the operations described in this specification can be realized in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be realized using one or more computer programs, i.e. , one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).

While certain embodiments have been described, these embodiments have been presented by way of example only and are not intended to limit the scope of the invention. Indeed, the novel methods, devices and systems described herein may be embodied in a variety of forms; furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the invention. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the invention.

Previous Patent: MOUNTING STRUCTURES FOR SECURING A PAYLOAD

Next Patent: A PACKAGE