Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMPROVED CLASSIFICATION METHODS FOR MACHINE LEARNING
Document Type and Number:
WIPO Patent Application WO/2024/044815
Kind Code:
A1
Abstract:
There is provided a method to train a machine learning classifier, the machine learning classifier comprises a feature extraction module, a global classifier and an interpretable classifier. Where the feature extraction module is configured to extract a first set of features from input data. The global classifier is configured to extract a second set of features from the first set of features, and generate a classification output based on the first set and second set of features. The interpretable classifier is configured to: extract a third set of features from the first set of features, determine a plurality of prototypes based on the third set of features, generate similarity maps based on similarities between the plurality of prototypes and one or more portions of the input data, assign similarity scores based on the generated similarity maps; and generate a classification output based on the assigned similarity scores. The training method comprises distilling knowledge from the global classifier to the interpretable classifier to improve accuracy of the interpretable classifier and the machine learning classifier.

Inventors:
WANG CHONG (AU)
CHEN YUANHONG (AU)
LIU YUYUAN (AU)
TIAN YU (AU)
LIU FENGBEI (AU)
MCCARTHY DAVIS (AU)
ELLIOTT MICHAEL (AU)
FRAZER HELEN (AU)
CARNEIRO GUSTAVO (AU)
Application Number:
PCT/AU2023/050836
Publication Date:
March 07, 2024
Filing Date:
August 29, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ST VINCENTS INSTITUTE OF MEDICAL RES (AU)
UNIV ADELAIDE (AU)
ST VINCENTS HOSPITAL MELBOURNE LTD (AU)
BREASTSCREEN VICTORIA (AU)
International Classes:
G06N3/08; A61B5/00; A61B6/00; G06N3/045; G06N3/0464; G06T7/00; G16H30/40; G16H50/20
Attorney, Agent or Firm:
FPA PATENT ATTORNEYS PTY LTD (AU)
Download PDF:
Claims:
1004859455 CLAIMS 1. A method to train a machine learning classifier, the machine learning classifier comprising: a feature extraction module configured to extract a first set of features from input data; a global classifier configured to: extract a second set of features from the first set of features, and generate a classification output based on the first set and second set of features; an interpretable classifier configured to: extract a third set of features from the first set of features, determine a plurality of prototypes based on the third set of features, generate similarity maps based on similarities between the plurality of prototypes and one or more portions of the input data, assign similarity scores based on the generated similarity maps; and generate a classification output based on the assigned similarity scores; wherein the method comprises distilling knowledge from the global classifier to the interpretable classifier to improve accuracy of the interpretable classifier and the machine learning classifier. 2. The method of claim 1, wherein distilling knowledge from the global classifier to the interpretable classifier includes: providing the classification output of the global classifier to the interpretable classifier; determining a knowledge distillation loss function based on the classification output of the global classifier and the classification output of the interpretable classifier; and minimizing the knowledge distillation loss function by modifying parameters of the interpretable classifier. 1004859455 3. The method of claim 2, further comprising: determining a loss function for the machine learning classifier based on the knowledge distillation loss function, a loss function of the interpretable classifier and a loss function of the global classifier. 4. The method of claim 3, further comprising minimizing the loss function for the machine learning classifier. 5. The method of any one of the preceding claims, further comprising: determining distances between the plurality of prototypes and a plurality of portions of the input data; and for each prototype, selecting a portion of the input data from a subset of the plurality of portions of the input data that has the lowest distance from the prototype and is not selected for any other prototype of the plurality of prototypes. 6. The method of any one of the preceding claims, further comprising: training the global classifier and the feature extraction module using a set of training image data. 7. The method of claim 6, further comprising terminating the training of the global classifier and the feature extraction module once an area under a curve (AUC) exceeds 95%. 8. The method of claim 7, further comprising training the interpretable classifier after terminating the training of the global classifier and the feature extraction module. 9. The method of claim 8, wherein during training of the interpretable classifier, the feature extraction module communicates the first set of features to the global classifier and the interpretable classifier substantially simultaneously. 1004859455 10. A machine learning classifier, comprising: a feature extraction module comprising one or more first convolution and pooling layers to extract a first set of features from input data; a global classifier comprising: one or more second convolution and pooling layers to extract a second set of features from the first set of features, and one or more fully connected layers to classify the input data based on the first set and second set of features; an interpretable classifier comprising: one or more third convolution layers to extract a third set of features from the first set of features, a prototype layer configured to determine a plurality of prototypes based on the third set of features, generate similarity maps based on similarities between the plurality of prototypes and one or more portions of the input data, and assign similarity scores based on the generated similarity maps; and a fully connected layer configured to classify the input data based on the assigned similarity scores; wherein knowledge from the global classifier is distilled to train the interpretable classifier. 11. The machine learning classifier of claim 10, wherein distilling knowledge from the global classifier to the interpretable classifier includes: providing the classification output of the global classifier to the interpretable classifier; determining a knowledge distillation loss function based on the classification output of the global classifier and the classification output of the interpretable classifier; and minimizing the knowledge distillation loss function by modifying parameters of the interpretable classifier. 1004859455 12. The machine learning classifier of claim 11, wherein distilling knowledge from the global classifier to the interpretable classifier further includes: determining a loss function for the machine learning classifier based on the knowledge distillation loss function, a loss function of the interpretable classifier and a loss function of the global classifier. 13. The machine learning classifier of claim 12, wherein distilling knowledge from the global classifier to the interpretable classifier further includes minimizing the loss function for the machine learning classifier. 14. The machine learning classifier of any one of claims 10-13, wherein diversity of the interpretable classifier is increased by: determining distances between the plurality of prototypes and a plurality of portions of the input data; and for each prototype, selecting a portion of the input data from the subset of the plurality of portions of the input data that has the lowest distance from the prototype and is not selected to update any other prototype of the plurality of prototypes. 15. The machine learning classifier of any one of claims 10-14, wherein the global classifier and the feature extraction module are initially trained using a set of training image data. 16. The machine learning classifier of claim 15, wherein training of the global classifier and the feature extraction module is terminated once an area under a curve (AUC) exceeds 95%. 17. The machine learning classifier of claim 16, wherein the interpretable classifier is trained after the training of the global classifier and the feature extraction module is terminated. 1004859455 18. The machine learning classifier of claim 17, wherein during training of the interpretable classifier, the feature extraction module communicates the first set of features to the global classifier and the interpretable classifier substantially simultaneously. 19. A method for increasing prototype diversity of an interpretable classifier configured to determine a plurality of prototypes from input image data and assign portions of the input image data to the plurality of prototypes, the method comprising: determining distances between the plurality of prototypes and a plurality of portions of the input image data; for each prototype, storing identifiers of at least a subset of the plurality of portions of the input image data in order of ascending distances from the prototype; updating each prototype sequentially, and for each selected prototype selecting a portion of the subset of the plurality of portions of the input image data that has the lowest distance from the selected prototype and is not selected to update any other previously selected prototype; and recording, for each prototype, the identifier of the selected portion of the subset of the plurality of portions of the input image data to indicate that the selected portion has been used already and cannot be selected for any remaining prototypes. 20. A method for using the machine classifier of any one of claims 10-18 for classifying images, the method comprising: receiving an input image; extracting, by the feature extraction module, a set of low level features from the input image; generating, by the global classifier, a classification output based on the set of low level features and a first high level set of features extracted by the global classifier; communicating, by the global classifier, the classification output to the interpretable classifier; 1004859455 extracting, by the interpretable classifier, a second set of high level features based on the set of low level features; generating, by the interpretable classifier, one or more similarity maps based on similarities between one or more portions of the input image and a plurality of prototypes; assigning, by the interpretable classifier, one or more similarity scores based on the generated one or more similarity maps; and generating, by the interpretable classifier, a classification output based on the assigned similarity scores and the classification output of the global classifier. 21. The method of claim 20, wherein the input image is a mammogram image and the classifier is configured to classify the mammogram image as being cancerous or non- cancerous. 22. The method of any one of claims 20-21, wherein the classifier is further configured to output one or more prototypes, wherein each of the one or more prototypes has a highest similarity score to one or more portions of the input image.
Description:
1004859455 IMPROVED CLASSIFICATION METHODS FOR MACHINE LEARNING TECHNICAL FIELD [0001] The present disclosure generally relates to the field of machine learning and in particular, to improved machine learning classifiers. BACKGROUND [0002] Machine learning (ML) is an umbrella term used for a set of techniques and tools that assist computers to learn and adapt on their own and then use this knowledge to perform real world tasks. By learning a pattern from example inputs, the ML algorithm is able to predict and perform tasks based on the learned pattern and not based on a predefined program instruction. These ML techniques are important in several problem domains where applying predefined program instructions is not possible. [0003] One example task ML may be used for is classification. A classification task requires the use of ML algorithms to learn how to assign a class label to examples from the problem domain. For example, an ML algorithm may perform the classification task of determining whether an email is “spam” or not “spam”, or processing a mammogram image and determining whether the image includes a cancerous tumour or not. There are many different types of classification tasks that ML may be used for and specialised techniques for modelling may be used for each. SUMMARY [0004] According to a first aspect of the present disclosure, there is provided a method to train a machine learning classifier. The machine learning classifier includes: a feature extraction module configured to extract a first set of features from input data; a global classifier configured to: extract a second set of features from the first set of features, and generate a classification output based on the first set and second set of features; an interpretable classifier configured to: extract a third set of features from the first set of features, determine a plurality of prototypes based on the third set of features, generate similarity maps based on similarities 1004859455 between the plurality of prototypes and one or more portions of the input data, assign similarity scores based on the generated similarity maps; and generate a classification output based on the assigned similarity scores. The method comprises distilling knowledge from the global classifier to the interpretable classifier to improve the accuracy of the interpretable classifier and the machine learning classifier. [0005] According to a second aspect of the present disclosure, there is provided a machine learning classifier. The classifier includes: a feature extraction module comprising one or more first convolution and pooling layers to extract a first set of features from input data; a global classifier comprising: one or more second convolution and pooling layers to extract a second set of features from the first set of features, and one or more fully connected layers to classify the input data based on the first set and second set of features; an interpretable classifier comprising: one or more third convolution layers to extract a third set of features from the first set of features, a prototype layer configured to determine a plurality of prototypes based on the third set of features, generate similarity maps based on similarities between the plurality of prototypes and one or more portions of the input data, and assign similarity scores based on the generated similarity maps; and a fully connected layer configured to classify the input data based on the assigned similarity scores. Knowledge from the global classifier is distilled to train the interpretable classifier. [0006] According to a third aspect of the present disclosure there is provided a method for increasing prototype diversity of an interpretable classifier configured to determine a plurality of prototypes from input image data and assign portions of the input image data to the plurality of prototypes. The method including: determining distances between the plurality of prototypes and a plurality of portions of the input image data; for each prototype storing identifiers of at least a subset of the plurality of portions of the input image data in order of ascending distances from the prototype; updating each prototype sequentially, and for each selected prototype using a portion of the subset of the plurality of portions of the input image data that has the lowest distance and is not used for any other previously selected prototypes to update the selected prototype; and recording, for each prototype, the identifier of the selected portion of the subset of the plurality of portions of the input image data to indicate that the selected portion has been selected and cannot be selected for any remaining prototypes. [0007] According to a fourth aspect of the present disclosure, there is provided a method for using the machine classifier of the second aspect for classifying images. The method includes: receiving an input image; extracting, by the feature extraction module, a set of low level 1004859455 features from the input image; generating, by the global classifier, a classification output based on the set of low level features and a first high level set of features extracted by the global classifier; communicating, by the global classifier, the classification output to the interpretable classifier; extracting, by the interpretable classifier, a second set of high level features based on the set of low level features; generating, by the interpretable classifier, one or more similarity maps based on similarities between one or more portions of the input image and a plurality of prototypes; assigning, by the interpretable classifier, one or more similarity scores based on the generated one or more similarity maps; and generating, by the interpretable classifier, a classification output based on the assigned similarity scores and the classification output of the global classifier. [0008] Further aspects of the present disclosure and embodiments of the aspects summarised in the immediately preceding paragraphs will be apparent from the following detailed description and from the accompanying figures. BRIEF DESCRIPTION OF THE DRAWINGS [0009] Fig.1 shows an example convolution neural network architecture. [0010] Fig.2 is a block diagram illustrating a classifier according to aspects of the present disclosure. [0011] Fig.3A is a flowchart depicting an example method for training a global classifier. [0012] Fig.3B shows example extracted features for an input image. [0013] Fig. 4 is a flowchart illustrating an example method for training the presently disclosed classifier according to aspects of the present disclosure. [0014] Fig.5 is a flowchart illustrating an example method for increasing the diversity of a prototype-based classifier according to aspects of the present disclosure. [0015] Figs. 6A and 6B show typical non-cancer and cancer prototypes and the corresponding source training images, respectively. [0016] Fig. 7 shows the interpretable reasoning process of the classifier of Fig. 2 on a cancerous test image. [0017] Fig. 8 shows precision recall curve test results for a test data set using different classifiers. 1004859455 [0018] Fig. 9 shows a visual comparison of cancer localisation performed by various classifiers. [0019] Fig.10 is a block diagram of a computing system with which the classifier of the present disclosure may be implemented. [0020] While the invention is amenable to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are described in detail. It should be understood, however, that the drawings and detailed description are not intended to limit the invention to the particular form disclosed. The intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present invention as defined by the appended claims. DETAILED DESCRIPTION Overview [0021] In the field of ML, a classification task refers to a predicative modelling problem where a class label is predicted for a given input. In particular, a classicisation model that performs a classification task identifies which category an object belongs to, and then, assigns a class label to the object based on the identified category. [0022] There are several types of classification tasks, including binary, multi-class, and multi-label. In binary classification, there are only two class labels, and the task for the ML model is to classify an object into one of those two class labels. For example, an ML model performing a binary classification task may predict whether an email is “spam” or “not spam”, or whether a radiography image includes a cancerous tumour or not. In multi-class classification, there may be more than two class labels and the task for the ML model may be to classify an object into one of those multiple class labels. An example of this may be an ML model that has to identify a type of animal given an input image. In multi-label classification, there may be multiple class labels and the task of the ML model may be to classify an object under one or more of those multiple class labels. For example, the ML model may be trained to identify one or more animals in an input image. [0023] An ML model that performs a classification task is often called a ‘classifier’ and there are a number of different classifiers that use different classification algorithms to train 1004859455 their ML model to perform the classification. These different classifiers have different advantages and disadvantages that may vary depending on a given classification task. [0024] Further, there are two main approaches to train a classifier – supervised and unsupervised learning. In supervised ML, a classifier learns to predict the correct class label using training data that has previously been classified. For example, to train a spam filter classifier, thousands if not hundreds of thousands of emails labelled as ‘spam’ or ‘not spam’ may be fed to the classifier. In this case, the classifier learns to classify new emails based on patterns it has learnt from the previously seen labelled emails. For instance, it may identify similarities between all the emails labelled as ‘spam’, such as irregular text patterns, misspelled names or email addresses, etc., and determine certain features or patterns to look for in an email to determine whether it is spam or not. In unsupervised ML, on the other hand, the training data is not pre-classified or tagged and there is no guide to a desired output. Instead, in this type of learning, the classifier identifies hidden patterns or data groupings in the training data without human intervention. It may form two different data groups – ‘spam’ and ‘not spam’ and then try to cluster new emails in one of these groups based on identification of one or more patterns in the emails. [0025] Some classifiers use artificial neural networks (ANN) or convolutional neural networks (CNN). An ANN or CNN consists of interconnected nodes or neurons in a layered structure that resembles the human brain. The interconnected nodes create an adaptive feedback system in which computers learn from their mistakes and improve continuously. [0026] In order to do so, each neuron produces an output after receiving one or more inputs. These outputs are then passed to the next layer of neurons, which use these outputs as inputs for their own function, and produce further outputs. These outputs are then passed on to the next layer of neurons, and so on until every layer of neurons have been considered, and the terminal neurons (i.e., the final layer of neurons) have received their input. The terminal neurons then output the final result for the model. [0027] Fig.1 shows an example convolutional neural network (CNN) architecture 100 for use in a classification task. The CNN architecture 100 comprises at least the following layers: one or more convolutional layers 102, one or more pooling layers 104, and a fully-connected layer 106. In this example, there are two convolutional layers 102A and 102B and two pooling layers 104A and 104B which together are used to extract features from input data (e.g., training 1004859455 data or end-use data). The combination of convolutional layers 102 and pooling layers 104 may be referred to as a feature extraction module 108. [0028] If the classification task is to predict whether an image contains a cancerous tumour or not, the input data will be a set of images, e.g., input image 110. Images are comprised of pixels and therefore an image may be represented as a matrix of pixel values. For example, image 110 may be defined as a 5x5 matrix, where there are five pixels corresponding to each of the height and width of the image. For a colour image there may be a separate matrix for each colour channel. In some examples, there may be a separate matrix for the primary colours – e.g., a 5x5 matrix corresponding to red, green, and blue. In general, each input image in the training dataset has dimensions ^^ × ^^ × ^^ which corresponds to pixel width W, pixel height H and number of colour channels D. [0029] An image of pixel dimension 5x5 is clearly very small and not practical in a real world problem. In reality input images are likely to contain millions of pixels – for example an 8 megapixel image will be defined by a matrix of dimension 7680x4320. The feature extraction module 108 reduces the input images into a form that is computationally easier to process, without losing features that are critical for accurate class predication. [0030] In particular, the convolution layers 102 extract features from the input image. The first convolution layer 102A e.g., captures low-level features from an input image such as edges of objects in the image, colour etc. While the second convolution layer 102B may capture other high-level features for example, colour gradients. [0031] The input for the first convolution layer 102A is the data of image 110. For example, the image data may be defined as three matrices (for red, green, and blue) of dimension 5x5 – with overall dimension 5x5x3. First convolution layer 102A utilises a Kernel or filter, ^^, to perform a first convolution operation. For each colour channel there is a Kernel matrix where ^^ = 1,2,3, that performs the convolution operation. For example, for each colour the Kernel may be a 3x3 matrix. [0032] The convolution operation is performed by a matrix multiplication operation between the Kernel and a portion of the image matrix of one channel. The Kernel is then shifted by a stride length (the number of rows or columns it is shifted by at a time) and the matrix multiplication operation is performed again with a new portion of the image matrix. In the example of a 5x5 red image matrix with a stride length of one, the Kernel is shifted a total of three times to parse the total width of the image matrix. Then the Kernel returns to the first 1004859455 column and is shifted down by one row. In total the Kernel is shifted nine times in order to parse the whole red image matrix. The resulting matrix or convolution feature matrix will be a 3x3 dimensional matrix corresponding to the red channel. [0033] The above process is performed for all other channels using the corresponding Kernels. In this example, a blue convolution feature matrix and a green convolution feature matrix are calculated at the first convolution layer 102A. For each colour the convoluted feature matrix is a 3x3 matrix. [0034] The convolution feature matrices for each channel are then combined to form a one-dimension feature output. This is the final output of the first convolution layer 102A. [0035] This feature output is the input for the first pooling layer 104A. The pooling layer 104 is generally responsible for reducing the spatial size of the convolved feature – i.e., the output matrices of the convolved layer. The reduction in dimensionality reduces the computational cost required to process the data. [0036] There are two types of pooling, namely max pooling and average pooling. Max pooling returns the maximum value from the portion of the image covered by the Kernel. On the other hand, average pooling returns the average of all the values from the portion of the image covered by the Kernel. Max pooling may also act as a form of noise suppressant as it discards the noisy activations altogether and also performs de-noising along with dimensionality reduction. On the other hand, average pooling simply performs dimensionality reduction as a noise suppressing mechanism. Max pooling is considered to perform a lot better than average pooling for noise suppression. [0037] Returning to the example with input image data described by three 5x5 matrices (for red, green and blue), the pooling layer 104 may be a 2x2 max pooling layer which takes the maximum value from the four 2x2 matrices for the red convoluted feature matrix. If the pooling layer 104A uses average pooling, then the average value of each of the four 2x2 matrices for the red convoluted feature matrix is used to form the output matrix with dimension 2x2. The pooling layer 104A is performed on each channel. In this example, the pooling is performed on the red, green, and blue convoluted feature matrices outputted from the first convoluted layer 102A. [0038] The first convolution layer 102A and the first pooling layer 104A together form a first layer of the CNN architecture 100. Similarly, the second convolution layer 102B and the second pooling layer 104B together form a second layer of the CNN architecture 100. By 1004859455 parsing the input image data in the feature extraction module 108 the input image is converted into a suitable form for ready classification. Depending on the complexities of the images, the number of CNN layers may be increased for capturing low-level details even further, but at the cost of more computational power. [0039] The third component of the CNN architecture 100 is a classification layer 112. This layer typically includes the fully-connected layer 106. The input to the fully-connected layer 106 is the output from the final pooling layer 104 of the feature extraction module 108. The first step of the fully-connected layer 106 is to flatten the input data. Flattening the data refers to the process of taking data outputted from the feature extraction module 108 and producing a one-dimensional vector containing all the data – for example, a column vector. This vector of data is then used in a feed forward neural network. [0040] The goal of a feedforward neural network is to approximate some function. For example, consider a classifier that maps input ^^ to class ^^. A feed forward network defines a mapping ^^ = and learns the value of the mapping parameters , i.e., the coefficients of the neural network chosen by the network itself that yield the best function approximation. This type of model is called feedforward because informationflows through the function being evaluated from input ^^, through the intermediate computations used to define ^^, andfinally to the output ^^. There are no feedback connections in which outputs of the model are fed back into itself. After passing through the fully connected layers, the final layer determines the probabilities of the input image being in a particular class based on the features extracted from the image approximating a selected cost function – hence the input is classified. [0041] Although these types of classifiers (often called global classifiers) are usually highly accurate in classifying objects, it is often difficult to know how these types of classifiers arrive at their prediction/outcome. This might not be an issue in certain tasks – such as determining whether an image is of a cat or a dog, as long as the classifier accurately classifies the images. However, in other fields, this lack of interpretability may be an issue. For example, consider the situation where a radiologist receives a mammogram image along with a prediction of breast cancer. The radiologist may view the image and determine that they do not think the image shows any cancerous tumours. In such cases, the radiologist may not trust the output of the classifier. [0042] To address this, another type of classifier has been created – an interpretable classifier – which has a higher level of interpretability – i.e., ability to explain to a human what 1004859455 the model is actually doing. In some examples, interpretable models may provide a prediction along with a decision tree for arriving at that prediction. Humans can then easily extract decision rules and understand how and why a certain class label was assigned. [0043] One particular example of an interpretable classifier uses a prototypical part network and is called ProtoPNet. The ProtoPNet classifier learns by estimating a set of class- specific prototypes. An input image is then classified by evaluating its similarity to one or more of these prototypes. [0044] For example, the ProtoPNet classifier may be used for performing the classification task of predicting whether a mammogram image has a cancerous tumour or not. During training, the classifier may be fed a large number of weakly labelled mammogram images that indicate whether the image is of a cancerous breast or not. During the training process the classifier learns to predict labels for mammogram images by estimating a set of class-specific prototypes. For example, it may create a number of prototypes based on the size, shape, density, and number of tumours within mammogram images marked as cancerous and non- cancerous. It may then allocate one or more images from the training set for each identified prototype. It can then classify an input mammogram image by evaluating the image’s similarity to the prototypes. [0045] In addition to the basic feature extraction and classification layers, a prototype- based model includes a prototype layer between the convolution and fully-connected layers that is configured to determine prototypes and assign images to the prototypes. [0046] In particular, the prototype layer receives the output from the convolutional layers and learns ^^ prototypes ^^ = { ^^ ^^ } ^ ^ ^ ^ =1 with dimension ^^ 1 × ^^ 1 × ^^ , where ^^ 1 ≤ ^^, ^^ 1 ≤ ^^. In some examples the depth of each prototype is the same as that of the convolutional output but the height and width of each prototype may be smaller than those of the whole convolutional output. Therefore, each prototype may be used to represent some prototypical activation pattern in a patch of the convolutional output, which in turn will correspond to some prototypical image patch in the original pixel space. Hence, each prototype ^^ ^^ can be understood as the latent representation of some prototypical part of some input mammogram image. For example, ^^ 1 may correspond to a pea-sized tumour in a left breast and ^^ 2 may correspond to a tumour having a particular density. [0047] For each input image, the prototype layer generates an activation map of similarity scores whose values indicates how strongly a prototype matches a part of the input image. This 1004859455 activation map preserves the spatial relation of the convolutional output, and can be up-sampled to the size of the input image to produce a heat map that identifies which part of the input image is most similar to the learned prototype. The activation map of similarity scores produced by each prototype unit is then reduced using global max pooling method to yield a single similarity score. This similarity score can be understood to show how strongly a prototypical part is present in some patch of the input image. [0048] Lastly, the ^^ similarity scores produced by the prototype layer are multiplied by the weight matrix in the fully connected layer to produce the output logits. The output logits may then be normalized to yield the predicted probabilities for a given image belonging to various classes. [0049] Accordingly, given a new image to classify, the prototype-based classification model is able to identify parts of the image where it determines the respective parts of the image look like a particular prototype. For example, the classification model may identify that an input image has one or more tumours that look very much like the pea-sized tumour of ^^ 1 or has one or more tumours that have the same density as the tumour of ^^ 2 , etc. The classification model makes its prediction based on a weighted combination of the similarity scores between parts of the new image and the learned prototypes. [0050] The output of the model may be a classification of the input image and the prototype image it was assessed to be similar to. Human operators can then view the classification and the corresponding image to understand why the model had arrived at a particular decision. In the example discussed above, the radiologist may be provided with the breast cancer diagnosis and an image of a breast that was found to be cancerous that has similar markers/features to the diagnosed mammogram image. As such, a prototype-based is a more meaningful interpretable model than a global classifier because it can directly contribute to the understanding of model’s inner workings. [0051] Although interpretable models such as the prototype-based models provide this insight into their decisions, they are often less accurate in predicting class labels than global classifiers. This is because prototype-based models generally focus on the analysis of local image regions. Even though such local analysis aids in interpreting model decisions, it fails to consider the whole image, which has much richer information. For example, a prototype-based model for mammogram classification may only focus on local image regions and may miss other important information from the whole mammogram (e.g., lesions distributed in different 1004859455 spatial locations, and contrast between healthy and abnormal regions). This usually affects the accuracy of interpretable models. [0052] To address one or more of these issues, aspects of the present disclosure provide an improved classifier that combines the accuracy of a global classifier with the interpretability of an interpretable classifier. By doing so, aspects of the present disclosure provide a classifier that not only has high accuracy but also offers good interpretability of the resulting classification model. In particular, the presently disclosed classifier includes a global classifier and a prototype-based. The prototype-based model is trained using knowledge distillation from the trained global classifier to improve its classification accuracy. Further, aspects of the present disclosure introduce a new greedy algorithm that improves the prototype diversity of the interpretable classifier model such that the learned prototypes are associated with a diverse set of training images. [0053] These and other aspects of the novel classifier will be described in the following sections. Network architecture for improved classification [0054] Fig. 2 shows an example architecture for the novel classifier 200 according to aspects of the present disclosure. It will be appreciated that classifier 200 may be utilized for many different types of classification tasks. However, the following description will be with respect to the classification task of predicting whether a mammogram image contains a cancerous tumour or not. [0055] The classifier 200 includes a feature extraction module 201, a global classifier 202, and an interpretable classifier 204. [0056] The feature extraction module (f) 201 may be similar to the feature extraction module 108, which is made up of one or more convolution and pooling layers, and therefore is not described in detail again. The global classifier (h) 202 also includes one or more convolutional layers 205, pooling layers 206, and one or more fully-connected layers 207. The additional convolutional layers 205 allow the global classifier 202 to extract additional high- level features from the input data and give it more learning capacity. The interpretable classifier 204 includes a mapping layer (t) 208, a prototype layer (g) 209, and one or more fully-connected layers (k) 210. The mapping layer 208 includes one or more convolutional layers that extract additional high-level features from the input data 110. The prototype layer 209 is configured to determine different prototypes 212, assign images to these prototypes, 1004859455 generate similarity maps 214 based on similarities between input images and one or more of the determined prototypes 212 and assign similarity scores based on the identified similarities. The fully-connected layer 210 is configured to classify the input images into the cancer or non- cancer classes based on the assigned similarity scores. Training the classifier [0057] The method of training the classifier 200 is generally divided into three sub- methods including – 1) training the feature extraction module 201 and the global classifier 202, 2) training the interpretable classifier 204 using the trained feature extraction module 201 and the global classifier 202, and 3) fine-tuning the whole classifier 200. [0058] Fig. 3A is a flowchart depicting the first sub-method 300 (i.e., the method of training the global classifier 202 to predict whether an input image includes cancerous tumour or not). Method 300 commences at step 302, where an appropriate amount (such as several thousand) of weakly labelled input images 110 (i.e., mammogram images in this case) are fed to the feature extraction module 201 and the global classifier 202. [0059] In one example, the weakly-labelled dataset may be denoted ^^ = {( ^^, ^^ ) | ^^| ^ ^ } ^^=1 , where ^^ ∈ ℝ ^^× ^^ represents an image of dimension ^^ × ^^ , ^^ ∈ { 0,1 }2 denotes a representation of the class label, i.e., cancer or non-cancer and | ^^| denotes the size of the dataset ^^. It will be appreciated that dataset ^^ may include additional parameters, for example ^^ – where ^^ ∈ ℕ (positive integers) is the anonymised identification of patients associated with the weakly labelled images. Dataset ^^ may be divided into a training and testing set in a patient-wise way. [0060] At step 304, the feature extraction module (f) 201 extracts one or more features from the weakly labelled training dataset using one or more convolution and pooling layers as described with reference to Fig. 1. That is, each convolution layer of the feature extraction module 201 may extract one or more features from the image data and provide these features as input to the next layer, which retrieves more features and so on. In some examples, the feature extraction module 201 may have a single convolutional layer and pooling layer that extracts 32 features (also referred to as feature maps) based on an input image. [0061] Fig. 3B shows examples of a few of these 32 features extracted by the feature extraction module 201 at this step. In particular, Fig. 3B shows three different feature maps (310B-D) for input image 310A. In Fig 3B, each feature map is overlapped with the input image for better visualization and is computed using both the image data and parameters of the 1004859455 extraction module 201. In this example, the model parameters of the feature extraction module 201 are [0062] The output of the feature extraction module 201 is represented by ^^ = ^^ ^^ ^^ ( ^^ ) , ^^ ^^ where ^^ ∈ ℝ 2× 2 × ^^0 , and ^^ ^^ denotes the backbone parameters. [0063] The output of the feature extraction module 201 is then fed to the global classifier (h) 202, which is trained to compute a cross entropy loss at step 306. The output or prediction of the global classifier 202 is ^^̃ ^^ . This may be denoted by ^^̃ ^^ = ℎ ^^ℎ ( ^^ ) , where ^^̃ ^^ [ 0,1 ]2 and ^^ denotes the global classification parameters determined by the at least one convolution operation of the convolution layers 205. Thus, the input for the global classifier 202 is the output of the feature extraction module 201 and the output of the global classifier 202, ^^̃ ^^ is a class predication for a given input image – i.e., cancer (1) or non-cancer (0). [0064] The cross-entropy loss is a measure of the difference between the global classifiers’ prediction ^^̃ ^^ and the actual class label or manually annotated label, y. In particular, it represents the cross-entropy loss to train the parameters ^^ ^^ and ^^ for the feature extraction module 201 and the global classifier 202 using label ^^ and global model prediction ^^̃ ^^ . [0065] At step 308, the global classifier 202 attempts to minimize the cross-entropy loss function. That is, it attempts to converge the output ^^̃ ^^ with y, the actual labels of the images – as both models 201, 202 continue to update their own parameters ^^ ^^ , ^^ , respectively, to minimize the cross-entropy loss function. A large number of training images are fed to the feature extraction module 201 and the global classifier 202 in order to allow it to minimize the loss function. Typically, when ^^̃ ^^ is sufficiently close to y, the global classifier 202 is considered to be well-trained and the training may stop. For example, the global classifier 202 may be considered well-trained when the training AUC (i.e., area under the curve) reaches approximately 95% or higher. [0066] Fig.4 is a message passing diagram illustrating the second sub-method 400 – i.e., training the interpretable classifier 204 using the trained feature extraction module 201 and global classifier 202. The message passing diagram is described with respect to the three main components of the presently disclosed classifier 200 – the feature extraction module 201, the global classifier 202, and the interpretable classifier 204. [0067] The method 400 commences at step 402, where a training image 110 belonging to dataset ^^, where the image data is denoted by ^^ ∈ represents an image of dimension 1004859455 ^^ × ^^, is received at the trained feature extraction module 201. This step is similar to step 302 and therefore is not described in detail again. [0068] Next at step, 404, the feature extraction module 201 processes the image to extract low-level features ^^. This step is similar to step 304 of method 300. The output from the feature extraction module 201 ^^ = ^^ ^^ ^^ ( ^^) is then fed to the interpretable classifier 204 and to the trained global classifier 202. It will be appreciated that the feature extraction module 201 may pass the extracted features to both the classifiers at substantially the same time. Further, it will be appreciated that as the feature extraction module 201 and the global classifier 202 have been previously trained, their parameters remain fixed and are not changed during method 400. [0069] At step 406, the interpretable classifier 204 receives the extracted features from the feature extraction module 201 and processes these to determine its own output – i.e., its own classification prediction and closest matching prototypes. [0070] To this end, the mapping layer (t) 208 of the interpretable classifier 204, which includes one or more convolutional layers and pooling layers, receives the features ^^ = generated by the feature extraction module 201 and extracts high-level features, ^^ ∈ ^^ ^^ ℝ 32× 32× ^^ . In particular, the mapping layer 208 has parameters ^^ ^^ and produces high-level features ^^ – which are then fed to the prototype layer 209 as an input. [0071] The prototype layer 209 has ^^ learnable class-representative prototypes 212, ^^ = where ^^ ^^ ∈ ℝ 1×1× ^^ . Each prototype ^^ ^^ is denoted by a vector of dimension D. The prototypes are used to form similarity maps 214, defined by: where ^^ ∈ {1, … , denote spatial indexes in the similarity maps, ^^ is a temperature factor. The temperature factor is used to prevent saturation effect of ^^ −( ^^) in the equation 1. For example, if ^^ is very large, then ^^ −( ^^) will be very small, therefore equation 1 will be insensitive to the change of ^^. A large T, e.g., T = 128 may be used to alleviate this effect. [0072] The prototype layer 209 outputs ^^ similarity scores obtained from max pooling of imilarity maps 214: {m ( ) ^^ the s ℎ, ^^ ℎ,a ^^x ^^ ^^ } . The output of the prototype layer 209 is the input ^^=1 1004859455 for the fully-connected classification layer 210 and the data are parsed to determine a classification result – i.e., cancer (1) or no cancer (0). [0073] The output of the interpretable classifier 204, which in this example is ProtoPNet, may be denoted by ^^̃ ^^ = ^^ ^^ ^^ where ^^̃ ^^ ∈ [0,1] 2 is the interpretable model prediction of cancer or non-cancer, ^^ ^^ ^^ ( ^^) represents the output of the mapping layer (t) 208, ^^ ^^ ^^ (⋅) represent the output of the prototype layer (g) 209 and ^^ ^^ ^^ ( ) represents the output of the fully-connected classification layer (k) 210. [0074] At step 408, the interpretable classifier 204 determines its own loss function. The loss function of the prototype-based classifier is given by: where ^^ 2 and ^^ denote hyper-parameters – that is parameters that control the learning process. And is the cross-entropy loss between the label ^^ and the prototype-based classifier’s output ^^̃ ^^ determined at step 406, and [0075] For each training image, the cluster loss term ℓ ^^ ^^ ( ^^, ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^) encourages the input image to have at least one local feature close to one of the prototypes of its own class, while the separation loss ℓ ^^ ^^ ( ^^, ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^) ensures all local features to be far from the prototypes that are not from the image’s class. The loss function shown in equation 2 above differs from the loss functions of conventional prototype-based classifiers such as ProtoPNet by the hinge loss term on the separation loss – i.e., ( ^^ − ℓ ^^ ^^ ( ^^, ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^). This hinge loss can prevent overfitting during training and show much better visual prototype results. [0076] Returning to step 404, the output from the feature extraction module 201 is also fed to the trained global classifier 202. The convolution layers 205 and pooling layers 206 of the global classifier 202 extract global classification features ^^ and the fully-connected layers 207 of this classifier predict a classification ^^̃ ^^ for the image at step 410. 1004859455 [0077] At step 412, the classification result, ^^̃ ^^ , of the global classifier is fed to the interpretable classifier 204. This process of using the output or classification result of one classifier as an input to another classifier is known as knowledge distillation (KD), which is related to the general notion of model compression by training a smaller network to do something based on a larger already trained network. As the output of the global classifier 202 is generally more accurate than the output of an interpretable classifier 204, the output of the global classifier 202 can help the interpretable classifier 204 to improve its own result accuracy. [0078] In particular, the output of the global classifier, ^^̃ ^^ , is used by the interpretable classifier 204 such that it can be trained to learn the exact behaviour of the global classification model. To do so, a correspondence needs to be established between the output of the interpretable classifier 204 and the output of the global classifier 202. This correspondence may involve directly passing the output of a layer in the global classifier 202 to the interpretable classifier 204 or performing some data augmentation before passing it to interpretable classifier 204. [0079] At step 413, the cross-entropy loss the global classifier 202 is computed. As described previously, the cross-entropy loss is a measure of the difference between the global classifiers’ prediction ^^̃ ^^ and the actual class label or manually annotated label, y. [0080] Next, at step 414, a KD loss function is determined by the interpretable classifier 204 based on the output from the global classifier 202 and the output of the interpretable classifier 204. The KD loss function is defined by equation 3 as follows: [0081] Where denote the predicted probability score of the labelled classes from the global classifier 202 and the interpretable classifier 204, respectively and ^^ represents a pre-defined positive margin to control the interpretable classifier’s confidence gain. The value of this confidence gain can be between 0 and 1. The KD loss function is designed to distil the knowledge from the global classifier 202 to the interpretable classifier 204 to increase the classification accuracy of interpretable classifier 204 and enable a better ensemble classification of both models. [0082] Lastly, at step 416 the loss function of the entire classifier 200 is determined based on the KD loss function, the loss function of the interpretable classifier 204 determined at step 1004859455 408, and the cross-entropy loss function of the global classifier computed at step 413. The classifier loss function is defined by equation 6 as follows – [0083] Where ^^, ^^ are hyper-parameters, ℓ ^^ ^^ ^^ ( ^^, ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^) is the prototype- based classifier loss, ℓ ^^ ^^ ^^ ( ^^, ^^ ^^ , ^^ ) is the global classifier’s cross-entropy loss function and ℓ ^^ ^^ ( ^^, ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^ ^^ , ^^) is the KD loss function. Once this classifier loss function has been determined, the interpretable classifier 204 attempts to produce results or outputs that minimize this loss function. Greedy strategy for selecting diverse prototypes [0084] Fig.5 illustrates an example method 500 for increasing the prototype diversity of the interpretable classifier 204 while processing input images. [0085] Diversity of the training data ensures that the training data can provide more discriminative information for the model, diversity of the learned model (diversity in parameters of each model or diversity among different base models) makes each parameter/model capture unique or complement information and the diversity in inference can provide multiple choices each of which corresponds to a specific plausible local optimal result. [0086] The method commences at step 502, where the prototype layer 209 determines distances between the prototypes ^^ ^^ and all image patches in the same class that have the same shape as ^^ ^^ . The distances are determined by: where ^^ (ℎ, ^^) is a feature map of a training image. [0087] Next, at step 504 the computed distance values from step 502 are sorted in ascending order and a distance dictionary is created and stored (e.g., in non-transient memory 1010). The distance dictionary includes an ordered list of prototypes 212 and for each prototype, it includes the distances between the prototype 212 and all the image patches that have the same shape as that prototype. 1004859455 [0088] For example, the identifiers (image indices) for the set of prototypes may be stored in a distance dictionary as shown in table A below. Where, for a given prototype, the identifiers are stored in an order where the distance of each image to the corresponding prototype is ascending – i.e., Table A: example prototype-image distance dictionary [0089] Next, the prototype-image distance dictionary computed above is used to update each prototype. That is, replacing each prototype ^^ ^^ to be represented by the nearest latent feature vector ^^ ∈ ℝ 1x1x ^^ from all training images of the same class after each training epoch: 2 ^^ ^^ ← ^^ ^^ ^^ m ^^∈ ^^ ^ ^^^ | | ^^ − ^^ ^^ | | 2 (8) [0090] Initially, this process will start from the first prototype at step 505, i.e., Prototype1 ^^ 1 in the example Table A above. [0091] At step 506, the prototype layer 209 determines the image nearest to the selected prototype ^^ 1 . This may be the image that has the least distance from the selected prototype in the identifier column. In the example Table A above, the image nearest to prototype ^^ 1 is Image 2. [0092] At step 507, the prototype layer 209 determines whether the selected image has already been selected for any other prototype. Initially, for the first prototype, as no images have been selected as yet, the prototype layer 209 determines that the selected image (e.g., Image 2) has not previously been selected and the method proceeds to step 508, where the selected image is used to update the first prototype ^^ 1 . The identifier of the image (Image2) is recorded to indicate that the image has been selected already to update the current prototype ^^ 1 and should never been used for the remaining prototypes. [0093] On the other hand, if the selected image had already been selected for updating another prototype, the next nearest image is selected at step 509 and the method then returns to 1004859455 step 507, where the prototype layer 209 once again determines whether the selected image has already been selected. This process repeats until at step 507 the prototype layer 209 determines that the selected image has not yet been selected for any other processed prototype. [0094] Once a prototype has been updated, the method proceeds to step 510, where the prototype layer 209 determines whether there are any unprocessed prototypes in the prototype- image distance dictionary – i.e., it determines whether there are any prototypes that have not yet been updated by an image patch. If the prototype layer determines that there are no such unprocessed prototypes, method 500 ends. [0095] Otherwise, the method returns to step 505, where the next unprocessed prototype is selected (e.g., prototype ^^ 2 ). Then, at step 506, the prototype layer 209 determines the image nearest to the selected prototype according to the prototype-image distance dictionary. This may be the image that has the least distance from the selected prototype in the identifier column. In the example Table above, the image nearest to prototype ^^ ^^ is Image 2. [0096] At step 507, the prototype layer 209 determines whether the selected image has already been selected for any other prototype. As Image2 has already been selected for ^^ 1 , the next nearest image is selected (e.g., Image1) and a determination is made whether image1 has also already been selected for another prototype. As it has not been previously selected, the method proceeds to step 508, where the prototype ^^ ^^ is updated by the selected image (Image1). The identifier of selected image (Image1) is also recorded to indicate that it has been selected already. And the method proceeds to 510 again. Thereafter, method steps 505-510 are repeated in a similar fashion until all the prototypes are updated with unique image patch and the same image patch is never used to update more than one prototype. [0097] This method 500 is designed to improve the prototype diversity of the interpretable classifier 204, which can provide a qualitative improvement in the model’s interpretability. [0098] Method 500 may be used in conjunction with method 400, e.g., while the prototype- based classifier is being trained. In one example, method 500 may be used after each training epoch is completed. Experimental setup [0099] To validate the classification accuracy and model interpretability of the classifier 200, experiments were performed using private Annotated Digital Mammograms and Associated Non-Image data (ADMANI) set. This dataset contains high-resolution (size of 1004859455 5416 × 4040) 4-view mammograms (including left and right craniocaudal (CC) images and left and right mediolateral oblique (MLO) images) with screening diagnosis outcome per view (i.e., malignant, and no malignant findings). The dataset has 20592 (3262 cancer, 17330 non- cancers) training images and 22525 (806 cancer, 21719 non-cancers) test images. In the test set, 410 cancer images have lesion annotations labelled by experienced radiologists for evaluating cancer localisation performance. [0100] Further, experiments were performed using the public Chinese Mammography Database (CMMD) dataset to validate the generalisation performance of the classifier 200. CMMD consists of 5200 (2632 cancers, 2568 non-cancers) mammograms of 4 views. The original image resolution is 2294 × 1914. [0101] The classifier 200 was implemented on Pytorch (An open source machine learning framework). The model is trained with Adam optimizer (a replacement optimization algorithm for stochastic gradient descent for training deep learning models) with an initial learning rate of 0.001, weight decay of 0.00001, and batch size of 16. The hyper-parameters ^^, ^^, ^^, ^^ 2 , ^^ are set as 1.0, 0.5, 0.2, 0.1, 0.1, 10, respectively. [0102] For the two datasets, images are pre-processed using an Otsu threshold algorithm to crop the breast region, which is subsequently resized to 1536 × 768 pixels, i.e., there is ^^ = 1536, ^^ = 768. The feature size ^^ = 128, and the temperature factor to compute the similarity maps is set as ^^ = 128. The number of prototypes ^^ = 400 (200 for the cancer class and 200 for the non-cancer class). The EfficientNet-B0 and DenseNet-121 are used as the feature extraction module 201. [0103] Data augmentation techniques (e.g., translation, rotation, and scaling) are utilised to improve generalisation. To evaluate classification accuracy, the area under the receiver operating characteristic curve (AUC) is used. To assess model interpretability, the accuracy of lesion localisation provided by the model on the test samples is measured, which are labelled with lesion annotation. In particular, the following measures are used: intersection over union (IoU), Dice, and area under the precision recall curve (PR-AUC). Experimental results [0104] Table B shows a comparison of the classifier 200 with the following models: EfficientNet-B0, DenseNet-121, Sparse Multiple Instance Learning (MIL), Globally-aware multiple instance classifier (MIC), and ProtoPNet. For all these models, publicly available codes were used for the comparison. EfficientNet-B0 and DenseNet-121 are non-interpretable 1004859455 classification model. Sparse MIL can localise lesions by dividing a mammogram into regions that are classified using multiple-instance learning with a sparsity constraint on malignant probability. For a fair comparison, EfficientNet-B0 was used as backbone in the Sparse MIL method. The GMIC method first uses a global module to select the most informative regions of the input mammogram, then it relies on a local module to analyse those selected regions, and it finally employs a fusion module to aggregate the global and local information for classification. [0105] All methods above, including the classifier 200, were trained on the training set from ADMANI, and tested on the testing sets from ADMANI and CMMD, using the same cropping and resizing method as the classifier 200 described above. Table B below shows the AUC results for different methods or classifiers for both datasets. The best results are highlighted. Table B: AUC classification results on both mammogram datasets. 1004859455 [0106] For classifier 200, the classification results of the interpretable classifier 204 and the global classifier 202 branches are shown independently, and then their ensemble result is shown to illustrate the importance of combining the classification results of both the classifiers. [0107] The results in Table B also show the role played by distilling the knowledge from the global classifier 202 to train the interpretable classifier 204. The best result is achieved with the ensemble model trained with KD when DenseNet-121 or EfficientNet-B0 is used, which reaches the best results on both ADMANI and CMMD datasets. It is also interesting to see that the original ProtoPNet classifier is inferior to its non-interpretable global classifier counterparts EfficientNet-B0 and DenseNet-121. However, in the classifier 200 with KD strategy, the interpretable classifier 204 can obtain significant performance gain over the original ProtoPNet, which demonstrates the importance of KD to train the ProtoPNet. [0108] It is also observed that using DenseNet-121 as the feature extraction module 201 exhibits better generalisation results on CMMD than using EfficientNet-B0, which may indicate that DenseNet-121 is more robust against domain shift. [0109] Fig. 6A shows typical non-cancer prototypes 602A and 602B, the corresponding source training images 604A and 604B, and the similarity maps 606A, 606B. Fig.6B shows typical cancer prototypes 610A, 601B, the corresponding source training images 612A, 612B, and the similarity maps 614A, 614B. It can be seen from these figures that the cancer prototypes 610 usually come from source training images that include regions containing cancerous visual biomarkers (e.g., malignant mass) which align with radiologists’ criterion for diagnosing breast cancer, while non-cancer prototypes 602 are from normal breast tissues or benign regions. [0110] Fig.7 shows the interpretable reasoning process of the classifier 200 on a cancerous test image 702. The images in the second row are top-2 activated non-cancer (704, 706) and cancer (708, 710) prototypes. Where top-2 activated means that for a test image, the top-2 closest (similar) prototypes to it are found. The images 712-718 in the third row are the determined similarity maps with the highest score used for classification. The numbers below the similarity map images are the classification confidence for non-cancer (712, 714) and cancer (716, 718). [0111] As can be seen, the model classifies the image as cancer because the lesion present in the image looks more like the cancer prototypes than the non-cancer ones – see Figs.6A and 6B. The similarity scores among prototypes are summed in the fully connected classification 1004859455 layer 208 (represented by a fully connected layer, where the non-cancer weights are denoted by ^^ ^^ , and the cancer weights by ^^ ^^ ) to obtain the classification result. [0112] The cancer localisation performance of different models is also evaluated and displayed in the chart 800 in Fig. 8. For the EfficientNet-B0, DenseNet-121, Sparse MIL, GMIC, and the classifier 200, cancer regions are predicted by applying a threshold of 0.5 on the Grad-CAM (for EfficientNet-B0 and DenseNet-121), malignant map (for Sparse MIL), salience map (for GMIC), and similarity map (classifier 200) with the top-1 activated cancer prototype, respectively. For all models, images with predicted classification probability less than 0.1 are excluded since they are classified as non-cancer by the model. When computing PR-AUC, an IoU threshold is needed to determine a true cancer detection, i.e., a cancer is truly detected if the IoU between predicted cancer region and ground-truth cancer mask is higher than the IoU threshold. The IoU threshold is varied from 0.05 to 0.5 to obtain a series of PR- AUC values, as shown in chart 800. As seen in this chart, the classifier 200 consistently achieves superior performance over the other compared classifiers under different IoU thresholds. [0113] Fig.9 displays a visual comparison of cancer localisation, where it can be observed that the prototype-based methods can more accurately detect the cancer region. Image 902 is the original image. Images 904-912 show the visual comparison of cancer localisation using the EfficientNet-B0, Sparse MIL, GMIC, ProtoPnet, and the Classifier 200, respectively. [0114] The effect of the greedy prototype selection strategy on the prototype diversity and classification accuracy is also observed. Mean pairwise intra-class prototype distance is given in Table C. As can be observed, the greedy prototype selection strategy can significantly increase (note the larger cosine and L2 distances) prototype diversity and enable the model to learn richer and more representative prototype patterns from training samples, which is beneficial for classification and interpretability. Table C: Effect of greedy selection Example computer system 1004859455 [0115] The modules and classifiers described with respect to Fig.2 can be implemented using one or more computer processing systems. [0116] Fig. 10 provides a block diagram of a computer processing system 1000 configurable to implement the feature extraction module, global classifier, and interpretable classifier described herein. System 1000 is a general purpose computer processing system. It will be appreciated that Fig.10 does not illustrate all functional or physical components of a computer processing system. For example, no power supply or power supply interface has been depicted, however system 1000 will either carry a power supply or be configured for connection to a power supply (or both). It will also be appreciated that the particular type of computer processing system will determine the appropriate hardware and architecture, and alternative computer processing systems suitable for implementing features of the present disclosure may have additional, alternative, or fewer components than those depicted. [0117] Computer processing system 1000 includes at least one processing unit 1002 – for example a general or central processing unit, a graphics processing unit, or an alternative computational device). Computer processing system 1000 may include a plurality of computer processing units. In some instances, where a computer processing system 1000 is described as performing an operation or function all processing required to perform that operation or function will be performed by processing unit 1002. In other instances, processing required to perform that operation or function may also be performed by remote processing devices accessible to and useable by (either in a shared or dedicated manner) system 1000. [0118] Through a communications bus 1004, processing unit 1002 is in data communication with one or more computer readable storage devices which store instructions and/or data for controlling operation of the processing system 1000. In this example, system 1000 includes a system memory 1006 (e.g. a BIOS), volatile memory 1008 (e.g. random access memory such as one or more DRAM applications), and non-volatile (or non-transitory) memory 1010 (e.g. one or more hard disks, solid state drives, or other non-transitory computer readable media). Such memory devices may also be referred to as computer readable storage media (or a computer readable medium). [0119] System 1000 also includes one or more interfaces, indicated generally by 1012, via which system 1000 interfaces with various devices and/or networks. Generally speaking, other devices may be integral with system 1000, or may be separate. Where a device is separate from system 1000, connection between the device and system 1000 may be via wired or 1004859455 wireless hardware and communication protocols, and may be a direct or an indirect (e.g. networked) connection. [0120] Wired connection with other devices/networks may be by any appropriate standard or proprietary hardware and connectivity protocols, for example Universal Serial Bus (USB), eSATA, Thunderbolt, Ethernet, HDMI, and/or any other wired connection hardware/connectivity protocol. [0121] Wireless connection with other devices/networks may similarly be by any appropriate standard or proprietary hardware and communications protocols, for example infrared, BlueTooth, Wi-Fi; near field communications (NFC); Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), long term evolution (LTE), code division multiple access (CDMA – and/or variants thereof), and/or any other wireless hardware/connectivity protocol. [0122] Generally speaking, and depending on the particular system in question, devices to which system 1000 connects – whether by wired or wireless means – include one or more input/output devices (indicated generally by input/output device interface 1014). Input devices are used to input data (e.g., training, test, or actual image data) into system 1000 for processing by the processing unit 1002. Output devices allow data to be output by system 1000 (e.g., classification outputs). Example input/output devices are described below, however it will be appreciated that not all computer processing systems will include all mentioned devices, and that additional and alternative devices to those mentioned may well be used. [0123] For example, system 1000 may include or connect to one or more input devices by which information/data is input into (received by) system 1000. Such input devices may include keyboards, mice, trackpads (and/or other touch/contact sensing devices, including touch screen displays), and/or other input devices. System 1000 may also include or connect to one or more output devices controlled by system 1000 to output information. Such output devices may include devices such as displays (e.g. cathode ray tube displays, liquid crystal displays, light emitting diode displays, plasma displays, touch screen displays), speakers, light emitting diodes/other lights, and other output devices. System 1000 may also include or connect to devices which may act as both input and output devices, for example memory devices/computer readable media (e.g. hard drives, solid state drives, disk drives, compact flash cards, SD cards, and other memory/computer readable media devices) which system 1000 can 1004859455 read data from and/or write data to, and touch screen displays which can both display (output) data and receive touch signals (input). [0124] System 1000 also includes one or more communications interfaces 1016 for communication with a network. Via a communications interface 1016 system 1000 can communicate data to and receive data from networked devices, which may themselves be other computer processing systems. [0125] System 1000 may be any suitable computer processing system, for example, a server computer system, a desktop computer, a laptop computer, a netbook computer, a tablet computing device, a mobile/smart phone, a personal digital assistant, or an alternative computer processing system. [0126] System 1000 stores or has access to computer applications (also referred to as software or programs) – i.e. computer readable instructions and data which, when executed by the processing unit 1002, configure system 1000 to receive, process, and output data. Instructions and data can be stored on non-transitory computer readable media accessible to system 1000. For example, instructions and data may be stored on non-transitory memory 1010. Instructions and data may be transmitted to/received by system 1000 via a data signal in a transmission channel enabled (for example) by a wired or wireless network connection over interface such as 1012. [0127] Applications accessible to system 1000 will typically include an operating system application such as Microsoft Windows™, Apple macOS™, Apple iOS™, Android™, Unix™, or Linux™. [0128] System 1000 also stores or has access to applications which, when executed by the processing unit 1002, configure system 1000 to perform various processing operations described herein. In some cases, part or all of methods 300-500 will be performed by a single computer processing system 1000, while in other cases processing may be performed by multiple computer processing systems in data communication with each other. [0129] Reference to any classification models in the specification is not an acknowledgment or suggestion that these classification models form part of the common general knowledge in any jurisdiction or that these models could reasonably be expected to be understood, regarded as relevant, and/or combined with other models by a skilled person in the art. 1004859455 [0130] As used herein, except where the context requires otherwise, the term "comprise" and variations of the term, such as "comprising", "comprises" and "comprised", are not intended to exclude further additives, components, integers or steps.