Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
A METHOD TO PROVIDE MODEL EXPLANATIONS FOR ORDINAL MULTI-CLASS PROBLEMS
Document Type and Number:
WIPO Patent Application WO/2024/091227
Kind Code:
A1
Abstract:
Example implementations described herein involve a method, which can involve training, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; applying the trained stacking model to new data to obtain predicted class probabilities; obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and outputting the explanation and the predicted class probabilities.

Inventors:
TANG HSIU-KHUERN (US)
Application Number:
PCT/US2022/047758
Publication Date:
May 02, 2024
Filing Date:
October 25, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
HITACHI LTD (JP)
International Classes:
G06N5/02; G06N3/08; G06N20/20
Foreign References:
US20210406780A12021-12-30
US20170161613A12017-06-08
US20210150391A12021-05-20
Other References:
VADIM BORISOV: "Deep Neural Networks and Tabular Data: A Survey", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, IEEE, USA, 1 January 2024 (2024-01-01), USA, pages 1 - 21, XP093168685, ISSN: 2162-237X, DOI: 10.1109/TNNLS.2022.3229161
Attorney, Agent or Firm:
HUANG, Ernest C. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method, comprising: training, from historical data, a stacking model comprising a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; applying the trained stacking model to new data to obtain predicted class probabilities; obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and outputting the explanation and the predicted class probabilities.

2. The method of claim 1, wherein the historical data comprises a plurality of class labels and a plurality of quantities associated with the plurality of class labels, wherein the training the stacking model from historical data comprises: executing a first machine learning workflow to extract features from the historical data and to learn the learn to rank model that is fit on the extracted features and the plurality of quantities; calculating out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing a loss function configured to determine ranking; and executing a second machine learning workflow to learn the classifier based on the ranking scores, the classifier configured to predict classes from the ranking scores.

3. The method of claim 2, wherein the calculating the out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing the loss function configured to determine the ranking comprises: partitioning the extracted features and the plurality of quantities into a number of subsets at random; for each of the random subsets, fitting a corresponding subset learn to rank model on ones of the extracted features and ones of the plurality of quantities not in said each of the random subsets; and for the each of the random subsets; applying the fitted corresponding subset learn to rank model on said each of the random subsets to determine a ranking score for each sample in said random subset.

4. The method of claim 1, wherein the applying the trained stacking model to new data to obtain the predicted class probabilities comprises: applying the learn to rank model to the features extracted from the new data to obtain ranking scores; and applying the classifier to the ranking scores to obtain the predicted class probabilities.

5. The method of claim 1, wherein the obtaining, from executing the explainable Al method on the trained stacking model, the explanation of the predicted class probabilities based on the features extracted from the new data comprises: determining a matrix of feature contributions to predictions, wherein each row of the matrix corresponds to a sample in the new data and its element measures a local importance of the corresponding feature from the features extracted from the new data to the predictions for that sample; and determining, for each of the features extracted from the new data, a summation of absolute values of the feature contributions.

6. The method of claim 5, wherein the outputting the explanation and the predicted class probabilities comprises outputting a list of ordered features from most important to least importance, and outputting ones of the feature contributions associated with the list of ordered features.

7. The method of claim 1, further comprising determining, for each of the features extracted from the new data, summary statistics derived from feature values

8. The method of claim 1, further comprising removing predictions associated with dubious explanations.

9. The method of claim 1, further comprising identifying errors in the new data or the trained stacking model.

10. The method of claim 1, further comprising scoring the trained stacking model based on frequency of dubious explanations.

11. A non-transitory computer readable medium, storing instructions for executing a process, the instructions comprising: training, from historical data, a stacking model comprising a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; applying the trained stacking model to new data to obtain predicted class probabilities; obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and outputting the explanation and the predicted class probabilities.

12. The non-transitory computer readable medium of claim 11, wherein the historical data comprises a plurality of class labels and a plurality of quantities associated with the plurality of class labels, wherein the training the stacking model from historical data comprises: executing a first machine learning workflow to extract features from the historical data and to learn the learn to rank model that is fit on the extracted features and the plurality of quantities; calculating out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing a loss function configured to determine ranking; and executing a second machine learning workflow to learn the classifier based on the ranking scores, the classifier configured to predict classes from the ranking scores.

13. The non-transitory computer readable medium of claim 12, wherein the calculating the out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing the loss function configured to determine the ranking comprises: partitioning the extracted features and the plurality of quantities into a number of subsets at random; for each of the random subsets, fitting a corresponding subset learn to rank model on ones of the extracted features and ones of the plurality of quantities not in said each of the random subsets; and for the each of the random subsets; applying the fitted corresponding subset learn to rank model on said each of the random subsets to determine a ranking score for each sample in said random subset.

14. The non-transitory computer readable medium of claim 11, wherein the applying the trained stacking model to new data to obtain the predicted class probabilities comprises: applying the learn to rank model to the features extracted from the new data to obtain ranking scores; and applying the classifier to the ranking scores to obtain the predicted class probabilities.

15. The non-transitory computer readable medium of claim 11, wherein the obtaining, from executing the explainable Al method on the trained stacking model, the explanation of the predicted class probabilities based on the features extracted from the new data comprises: determining a matrix of feature contributions to predictions, wherein each row of the matrix corresponds to a sample in the new data and its element measures a local importance of the corresponding feature from the features extracted from the new data to the predictions for that sample; and determining, for each of the features extracted from the new data, a summation of absolute values of the feature contributions.

16. The non-transitory computer readable medium of claim 15, wherein the outputting the explanation and the predicted class probabilities comprises outputting a list of ordered features from most important to least importance, and outputting ones of the feature contributions associated with the list of ordered features.

17. The non-transitory computer readable medium of claim 11, the instructions further comprising determining, for each of the features extracted from the new data, summary statistics derived from feature values

18. The non-transitory computer readable medium of claim 11, the instructions further comprising removing predictions associated with dubious explanations.

19. The non-transitory computer readable medium of claim 11, the instructions further comprising identifying errors in the new data or the trained stacking model.

20. The non-transitory computer readable medium of claim 11, further comprising scoring the trained stacking model based on frequency of dubious explanations.

Description:
A METHOD TO PROVIDE MODEL EXPLANATIONS FOR ORDINAL MULTI-CLASS PROBLEMS

BACKGROUND

Field

[0001] The present disclosure is directed to artificial intelligence (Al) systems, and more specifically, to model explanations for ordinal multi-class problems.

Related Art

[0002] Explainable Al (XAI) is the concept that an Al system should be able to provide meaningful explanations for its outputs. As Al systems become more prevalent in everyday life, the need for XAI has also increased. Explainability is essential to the trustworthiness and hence adoption of an Al system.

[0003] Such explanations are important for many reasons: they can reveal errors in the data or problem formulation and uncover new insights about the process being modeled. They can help the domain expert judge the quality of the model and possibly reject a particular prediction if the explanations appear wrong.

[0004] Many methods have been developed to explain an Al model’s outputs in terms of its inputs. Popular methods include SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), which are model agnostic: they can be used with any model. FIG. 1 shows an example of applying SHAP to a model with three features. For a particular data point (x 1 , x 2 , x 3 ), the model predicts a label value of 0.6. If the features are all unknown, the prediction is 0.2 (the base value). The SHAP method quantifies the contribution (or effect) of each feature so that the total contribution equals the deviation of the predicted value from the base value.

[0005] In practice, there is often a trade-off between model accuracy and explainability. Models with a simple structure (such as linear models and shallow decision trees) based on a few features are easy to explain but tend to be less accurate than complicated models with many features. SUMMARY

[0006] Aspects of the present disclosure can involve a method, which can involve training, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; applying the trained stacking model to new data to obtain predicted class probabilities; obtaining, from executing an explainable artificial intelligence (AT) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and outputting the explanation and the predicted class probabilities.

[0007] Aspects of the present disclosure can involve a computer program, which can involve instructions involving training, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; applying the trained stacking model to new data to obtain predicted class probabilities; obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and outputting the explanation and the predicted class probabilities. The computer program and instructions can be stored on a non-transitory computer readable medium and executed by one or more processors.

[0008] Aspects of the present disclosure can involve a system, which can involve means for training, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; means for applying the trained stacking model to new data to obtain predicted class probabilities; means for obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and means for outputting the explanation and the predicted class probabilities.

[0009] Aspects of the present disclosure can involve an apparatus, which can involve a processor configured to train, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels; apply the trained stacking model to new data to obtain predicted class probabilities; obtain, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data; and output the explanation and the predicted class probabilities.

[0010]

BRIEF DESCRIPTION OF DRAWINGS

[0011] FIG. 1 shows an example of applying SHAP to a model with three features.

[0012] FIG. 2 illustrates an example of the multi-class approach and the regression approach.

[0013] FIG. 3 illustrates a regression approach.

[0014] FIG. 4 illustrates an example of how SHAP feature contributions are harder to understand for multi-class models.

[0015] FIG. 5 illustrates an example of model learning, prediction, and explanation, in accordance with an example implementation.

[0016] FIG. 6(A) illustrates an example of the overall flowchart, in accordance with an example implementation.

[0017] FIG. 6(B) illustrates an example of the input data, in accordance with an example implementation.

[0018] FIG. 7 illustrates the flow for learning the stacking model, in accordance with an example implementation.

[0019] FIG. 8 illustrates one way to calculate out-of-sample ranking scores, in accordance with an example implementation. [0020] FIG. 9 illustrates how to apply the stacking model to get the predicted class probabilities for some new data, in accordance with an example implementation.

[0021] FIG. 10 illustrates how to obtain explanations for the predictions, in accordance with an example implementation.

[0022] FIGS. 11(A) and 11(B) illustrate an example output, in accordance with an example implementation.

[0023] FIG. 12 illustrates a system involving a plurality of physical systems networked to a management apparatus, in accordance with an example implementation.

[0024] FIG. 13 illustrates an example computing environment with an example computer device suitable for use in some example implementations.

DETAILED DESCRIPTION

[0025] The following detailed description provides details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term “automatic” may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.

[0026] A multi-class prediction problem where the K > 3 classes have a meaningful ordering is called ordinal regression or ordinal classification in statistics. Generally, one wants to seek models that preserve the ordering information and are amenable to explanations. One approach is to assume that the classes correspond monotonically to some unobserved linear combination y = w T x of the input features x and use a generalized linear model to fit the weights w and the thresholds that determine this correspondence; the ordered logit and ordered probit models fall in this category. Example implementations described herein uses the same notion of an underlying quantity y. Example implementations assume that y is observed and learn the correspondence between an intermediate “surrogate” quantity for y and the classes. Further, the present disclosure is not limited to models that depend linearly on x and do not constrain the learned correspondence to be monotonic (although such modeling choices are allowed for); hence the proposed approach is more flexible.

[0027] Another approach transforms the K-class problem into K-1 binary problems, which can be modeled using any binary classifier. However, the multiple prediction outputs are harder to explain. In the proposed method, a special stacking model that involves a leam-to-rank (LTR) model and a classifier is utilized. Stacking is a way to try to improve prediction accuracy. The proposed method requires explaining only the LTR model. LTR models are used for information retrieval applications, such as ranking subsets of documents by their relevance to search queries. In example implementations, the LTR models are used to learn to rank all the data as an intermediate step for classification.

[0028] Example implementations described herein provide model explanations for multi-class prediction problems where the classes are ordinal, i.e., have a meaningful ordering. Multi-class problems have K ≥ 3 classes. The example implementations described herein involves a method that is well suited to both model accuracy and explainability.

[0029] In a first example, air quality is often reported using a six-level system, based on the Air Quality Index (AQI). The assignment of the numeric AQI to these levels is meant to help people better understand the impact on their health and make an appropriate decision. In a second example, suppose that a system is monitoring a piece of equipment and there is a need to predict three classes for the time to failure, which correspond to different actions. For ordinal classes, there is a numeric quantity y that corresponds monotonically to the classes. In these examples, AQI and time to failure are natural choices for y. There is also the option of using the class index 1, 2, ..., K as y.

[0030] FIG. 2 illustrates an example of the multi-class approach and the regression approach.

Since the label is categorical, it is common to use a multi-class model to predict the class label based on the relevant features from the training samples. Typically, a multi-class model outputs the probability of each class. An alternative approach is to use a regression model to predict the numeric y and then assign the class that corresponds to the predicted value y.

[0031] FIG. 3 illustrates a regression approach. In the regression approach, the predicted value y tends to be near the center of the distribution. In this example, the predicted class corresponding to y is C. An end class like A or F may never be predicted this approach, even though it may have the largest probability P(·) For modeling, the multi-class approach is generally better: the output probabilities are well suited for decision-making (for example, they can be used to calculate the expected costs for different actions).

[0032] The regression approach is generally less accurate at predicting the class because y is a measure of centrality (such as the mean or median) of the predicted distribution of y given the features. Hence, classes that correspond to small or large values of y may not be predicted.

[0033] FIG. 4 illustrates an example of how SHAP feature contributions are harder to understand for multi-class models. Even though the multi-class approach is generally more accurate, it has some disadvantages for explanations. There is a need to explain how each input feature affects all K predicted probabilities. The K effects cannot be interpreted independently: since the class probabilities must sum to 1, an increase for one class must be accompanied by a decrease elsewhere. Further, the explanations could seem counter-intuitive or contradictory. For example, a feature might be found to increase the probability of both “Good” and “Unhealthy” air quality levels.

[0034] These problems do not arise for binary classification problems, since a single model output defined as the ratio of the two class probabilities can be used. With three or more classes, there is no simple way to reduce to a single output.

[0035] In contrast, explaining the regression model avoids these pitfalls and is easier to understand, since there is only a single output y. In many applications, it is also more natural to describe the effects of the input features on the underlying quantity like AQI or time to failure, instead of the derived classes. [0036] FIG. 5 illustrates an example of model learning, prediction, and explanation, in accordance with an example implementation. Example implementations described herein train a model that learns to rank the samples by the quantity y. This so-called LTR model predicts a numeric score r for each sample so that ranking the samples by r and by y are as similar as possible. Example implementations train a separate classifier to predict the class label from r.

[0037] In the model learning, data with class labels and rankings are used as training samples and validation sets. The validation sets can be a subset of the data, and used for model selection. The leam-to-rank model generates ranking scores, wherein each r i is obtained by applying an LTR model to the i th feature vector. The ranking scores and class labels are then used to train the classifier accordingly. Further details are provided herein.

[0038] In the prediction, or deployment of the leam-to-rank model and classifier, a feature vector is provided to the leam-to-rank model, which generates the ranking score. The ranking scores are then provided to the classifier to obtain the predicted class probabilities.

[0039] For explanations, example implementations apply existing XAI techniques to the LTR model. For example, the SHAP method can quantify the contribution of each input feature to r. For a good LTR model, r accurately reproduces the order of y, so r also corresponds monotonically to the classes. Example implementations can then treat r as a numeric surrogate for the classes and interpret the feature contributions to r as effects on the predicted class index. This allows the example implementations to obtain the local and global important features for the multi-class problem.

[0040] FIG. 6(A) illustrates an example of the overall flowchart of FIG. 5, in accordance with an example implementation. The flows of FIG. 6(A) are described with respect to FIGS. 6(B) to 11 herein. At first, the flow receives input 601 which can include data with class labels and the underlying quantity. FIG. 6(B) illustrates an example of the input data 601, in accordance with an example implementation. Each column from 1 to m contains data of a single type, such as numbers, dates, categories, and text. The class labels and underlying quantity are in the last two columns. There is a known mapping from y to the label, such as in the air quality and time to failure examples earlier. The class labels are the prediction target. Predictions are sought for each row. [0041] At 602, the flow leams a stacking model involving a leam-to-rank model and a classifier. FIG. 7 illustrates the flow for learning the stacking model 602, in accordance with an example implementation. Two models are learned: an LTR model and a classifier. The training process for the leam-to-rank model and the classifier receives input 701, which is the data with the class labels and underlying quantity. At 602, the flow selects a loss function suitable for ranking tasks, such as pairwise or listwise loss. The LTR model M is learned at 702 using a standard machine learning workflow: feature engineering (including feature selection), partitioning the data into training and validation sets, fitting models on the training set, and selecting the model with the best performance on the validation set. The models are fit on numeric features extracted from the data; typically, the feature vector (x 1 , x 2 , .., x p ) for each label to be predicted is a function of the corresponding data values (such as val 1 , 1 , val 1 , 2 , ..., val 1 , m ) where the dimension p is constant.

[0042] This workflow is similar to learning a regression model to predict y. The important difference is that for fitting a regression model, the L1 or L2 loss function are used (corresponding to mean absolute error and mean squared error), whereas for an LTR model, pairwise or listwise loss is used, as these are more suitable for the ranking task. LTR models can be fit using existing methods as known in the art.

[0043] When selecting the best model on the validation set, the same loss function that was used for fitting the LTR model can be used for model evaluation. Example implementations described herein can also use a rank-based metric that compares the model predictions r with y. For example, the Spearman’s correlation of r and y can be used for model evaluation.

[0044] The learned LTR model M can be applied to any new feature vector to get a number r, which is referred to as the ranking score. For example, the model M might calculate r as 1.5 x 1 + 0.3 x 2 - 2 x 3 .

[0045] The classifier to be learned will use the LTR predictions as a feature, in our case the only feature. This combination of the LTR model and the classifier is an example of a stacking model. As is known in stacking, care must be taken to use out-of-sample LTR predictions for fitting the classifier, to avoid overfitting as shown in 703. Further detail will be described with respect to FIG. 8. [0046] After the out-of-sample ranking scores are obtained, example implementations use a standard machine learning workflow to learn a classifier to predict the class labels in the input data from those scores at 704. A simple workflow is applicable since there is only one feature. For example, example implementations do not perform feature selection and only consider simpler models such as decision trees and linear discriminant analysis models. The learned classifier can be applied to any number r that represents a ranking score to produce the predicted class probabilities.

[0047] FIG. 8 illustrates one way to calculate out-of-sample ranking scores 703, in accordance with an example implementation. At 801, the input provided involves an LTR model M and the N samples (features vectors and y) used to fit the LTR model. At 802, the flow partitions the samples into / subsets (e.g., 10), or folds, at random, such as in cross-validation. At 803, for each j, the flow fits an LTR model Mj on the samples not in fold j in the same way that M was fit on all the samples (c.f. 702). At 804, for each j, the flow applies model Mj to the samples in fold j. Let ri be the result of applying the appropriate model to the ith sample (i=l, 2,...N). Each ri is a number (the ranking score). The flow then returns as output the ranking scores r1. r2..... TN.

[0048] FIG. 9 illustrates how to apply the stacking model to get the predicted class probabilities for some new data 603, in accordance with an example implementation. This task is called prediction or inference. For each data instance, the output is a vector of K numbers (the class probabilities) that sum to 1.

[0049] At first, the input 901 involves data for which class probabilities predictions are desired. At 902, the flow extracts the feature vectors for the data. At 903, the flow applies the LTR model to the feature vectors to get the ranking scores. At 904, the flow applies the classifier to the ranking scores to get the predicted class probabilities for each data instance.

[0050] In an example of the flow of FIG. 9, suppose there are three classes A, B, and C in their natural order. If a single data instance comprises the values (val 1 , 1 , val 1 , 2 , ..., val 1 , m ) (c.f. FIG. 6(B)), extract the feature vector (x 1 , X 2 , ..., x p ) from these values, apply the LTR model to the feature vector to get a single number r (the ranking score), then apply the classifier to r to get the predicted class probabilities P(A), P(B), and P(C). The output might be P(A)=0.65, P(B)=0.22, P(C)=0.13. [0051] FIG. 10 illustrates how to obtain explanations for the predictions 604, in accordance with an example implementation. Example implementations obtain the explanations for the predictions by applying standard XAI techniques such as SHAP to the LTR model. As input, the flow involves an LTR model M and the samples used to fit it; new data and their predicted class probabilities at 1001.

[0052] For SHAP, the output is the contribution of each feature to each model prediction (the ranking score). If there are n ranking scores and p features, the output may be arranged in an n by p matrix as shown at 1002.

[0053] At 1003 and 1004, the explanations are given as lists of features sorted from most to least important. It can be useful to give just the most important features, for example, the top k features or the top x% of features, where k and x are user-specified parameters. Another way of limiting each list is to calculate the total importance (local or global) of all the features and drop features whose importance relative to the total importance is below a user-specified threshold.

[0054] FIGS. 11(A) and 11(B) illustrate an example output 1003, in accordance with an example implementation. Specifically, FIG. 11(A) shows the three features with the highest local importance for predicting, say, that P(A)=0.65, P(B)=0.22, and P(C)=0.13 from the feature vector (x 1 , x 2 , ..., x P ) Features 7 and 23 have the largest positive contributions. This can be interpreted to mean that they have the strongest effect for moving the predicted class towards C. Similarly, feature 12 has the strongest effect for moving the predicted class towards A. The last three columns contain the minimum, median, and maximum values of these features in the training data. The feature values and these summary statistics may help a domain expert decide if the explanations are sensible or dubious.

[0055] The interpretation for FIG. 11(A) is justified for a good LTR model (as measured by the selected loss function, Spearman’s correlation, or the like). This is because the LTR model output r accurately reproduces the order of y, and since y corresponds monotonically to the classes, so does r. Hence a feature with a positive contribution to r tends to move the predicted class towards the last class, and one with a negative contribution tends to move it towards the first class. [0056J FIG. 11(B) illustrates an example output with the feature contributions, in accordance with an example implementation. In the example of FIG. 11(B), the feature contributions are demonstrated in the form of a waterfall chart, however, other outputs are also possible and the present disclosure is not particularly limited thereto. Through use of the explainable Al techniques as described herein, by either using local importance 1003 or global importance 1004, the features can be output in order from most important to least important features with a cutoff set in accordance with the desired implementation. Such a cutoff can be useful if there are many features for consideration. Through such an output, the contribution of each feature to the predicted ranking score can thereby be known and understood.

[0057] By explaining the LTR model, which has a single output r, it is thereby possible to avoid the difficulty of explaining multiple dependent prediction outputs, which is produced by some existing approaches. This leads to explanations that are easier to understand.

[0058] The proposed stacking model predicts class probabilities and hence are amenable to decision-making. Existing models like the ordered logit and ordered probit also predict class probabilities but are less flexible because of their linearity and distributional assumptions.

[0059] If the classes change but the underlying quantity y remains the same, only the classifier needs to be retrained; the LTR model can remain the same. This might happen if classes are merged or split, or if the class definition in terms of y changes. Since the classifier is simple and most of the modeling complexity (feature engineering, model selection, etc.) is in the LTR model, this can yield substantial cost and performance benefits.

[0060] Explanations can take the form of lists of locally and globally important features. A domain expert can compare them with her domain knowledge and flag cases that appear to be dubious or suspect. This can be used for rejecting predictions with dubious explanations, to help identify errors in the data, problem formulation, or model, and for model assessment based on the frequency of dubious explanations.

[0061] FIG. 12 illustrates a system involving a plurality of physical systems networked to a management apparatus, in accordance with an example implementation. One or more physical systems 1201 integrated with various sensors are communicatively coupled to a network 1200 (e.g., local area network (LAN), wide area network (WAN)) through the corresponding network interface of the sensor system installed in the physical systems 1201, which is connected to a management apparatus 1202. The management apparatus 1202 manages a database 1203, which contains historical data collected from the sensor systems from each of the physical systems 1201. In alternate example implementations, the data from the sensor systems of the physical systems 1201 can be stored to a central repository or central database such as proprietary databases that intake data from the physical systems 1201, or systems such as enterprise resource planning systems, and the management apparatus 1202 can access or retrieve the data from the central repository or central database. The sensor systems of the physical systems 1201 can include any type of sensors to facilitate the desired implementation, such as but not limited to gyroscopes, accelerometers, global positioning satellite (GPS), thermometers, humidity gauges, or any sensors that can measure one or more of temperature, humidity, gas levels (e.g., CO2 gas), and so on. Examples of physical systems can include, but are not limited to, shipping containers, lathes, air compressors, and so on. Further, the physical systems can also be represented as virtual systems, such as in the form of a digital twin.

[0062] FIG. 13 illustrates an example computing environment with an example computer device suitable for use in some example implementations, such as a management apparatus 1202 as illustrated in FIG. 12. Computer device 1305 in computing environment 1300 can include one or more processing units, cores, or processors 1310, memory 1315 (e.g., RAM, ROM, and/or the like), internal storage 1320 (e.g., magnetic, optical, solid state storage, and/or organic), and/or I/O interface 1325, any of which can be coupled on a communication mechanism or bus 1330 for communicating information or embedded in the computer device 1305. I/O interface 1325 is also configured to receive images from cameras or provide images to projectors or displays, depending on the desired implementation.

[0063] Computer device 1305 can be communicatively coupled to input/user interface 1335 and output device/interface 1340. Either one or both of input/user interface 1335 and output device/interface 1340 can be a wired or wireless interface and can be detachable. Input/user interface 1335 may include any device, component, sensor, or interface, physical or virtual, that can be used to provide input (e.g., buttons, touch-screen interface, keyboard, a pointing/ cursor control, microphone, camera, braille, motion sensor, optical reader, and/or the like). Output device/interface 1340 may include a display, television, monitor, printer, speaker, braille, or the like. In some example implementations, input/user interface 1335 and output device/interface 1340 can be embedded with or physically coupled to the computer device 1305. In other example implementations, other computer devices may function as or provide the functions of input/user interface 1335 and output device/interface 1340 for a computer device 1305.

[0064] Examples of computer device 1305 may include, but are not limited to, highly mobile devices (e.g., smartphones, devices in vehicles and other machines, devices carried by humans and animals, and the like), mobile devices (e.g., tablets, notebooks, laptops, personal computers, portable televisions, radios, and the like), and devices not designed for mobility (e.g., desktop computers, other computers, information kiosks, televisions with one or more processors embedded therein and/or coupled thereto, radios, and the like).

[0065] Computer device 1305 can be communicatively coupled (e.g., via I/O interface 1325) to external storage 1345 and network 1350 for communicating with any number of networked components, devices, and systems, including one or more computer devices of the same or different configuration. Computer device 1305 or any connected computer device can be functioning as, providing services of, or referred to as a server, client, thin server, general machine, special-purpose machine, or another label.

[0066] I/O interface 1325 can include, but is not limited to, wired and/or wireless interfaces using any communication or I/O protocols or standards (e.g., Ethernet, 802.1 lx, Universal System Bus, WiMax, modem, a cellular network protocol, and the like) for communicating information to and/or from at least all the connected components, devices, and network in computing environment 1300. Network 1350 can be any network or combination of networks (e.g., the Internet, local area network, wide area network, a telephonic network, a cellular network, satellite network, and the like).

[0067] Computer device 1305 can use and/or communicate using computer-usable or computer-readable media, including transitory media and non-transitory media. Transitory media include transmission media (e.g., metal cables, fiber optics), signals, carrier waves, and the like. Non-transitory media include magnetic media (e.g., disks and tapes), optical media (e.g., CD ROM, digital video disks, Blu-ray disks), solid state media (e.g., RAM, ROM, flash memory, solid-state storage), and other non-volatile storage or memory.

[0068] Computer device 1305 can be used to implement techniques, methods, applications, processes, or computer-executable instructions in some example computing environments. Computer-executable instructions can be retrieved from transitory media, and stored on and retrieved from non-transitory media. The executable instructions can originate from one or more of any programming, scripting, and machine languages (e.g., C, C++, C#, Java, Visual Basic, Python, Perl, JavaScript, and others).

[0069] Processor(s) 1310 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 1360, application programming interface (API) unit 1365, input unit 1370, output unit 1375, and inter-unit communication mechanism 1395 for the different units to communicate with each other, with the OS, and with other applications (not shown). The described units and elements can be varied in design, function, configuration, or implementation and are not limited to the descriptions provided. Processor(s) 1310 can be in the form of hardware processors such as central processing units (CPUs) or in a combination of hardware and software units.

[0070] In some example implementations, when information or an execution instruction is received by API unit 1365, it may be communicated to one or more other units (e.g., logic unit 1360, input unit 1370, output unit 1375). In some instances, logic unit 1360 may be configured to control the information flow among the units and direct the services provided by API unit 1365, input unit 1370, output unit 1375, in some example implementations described above. For example, the flow of one or more processes or implementations may be controlled by logic unit 1360 alone or in conjunction with API unit 1365. The input unit 1370 may be configured to obtain input for the calculations described in the example implementations, and the output unit 1375 may be configured to provide output based on the calculations described in example implementations.

[0071] Processor(s) 1310 can be configured to execute a method or instructions involving training, from historical data, a stacking model involving a learn to rank model and a classifier, the learn to rank model configured to output a ranking score, the classifier configured to output a predicted probability of each class label from a plurality of class labels as shown at 602; applying the trained stacking model to new data to obtain predicted class probabilities as shown at 603; obtaining, from executing an explainable artificial intelligence (Al) method on the trained stacking model, an explanation of the predicted class probabilities based on features extracted from the new data as shown at 604; and outputting the explanation and the predicted class probabilities as shown at 605 of FIG. 6(A).

[0072] Processor(s) 1310 can be configured to execute a method or instructions as described above, wherein the historical data involves a plurality of class labels and a plurality of quantities associated with the plurality of class labels as shown in FIG. 6(B), wherein the training the stacking model from historical data involves executing a first machine learning workflow to extract features from the historical data and to learn the learn to rank model that is fit on the extracted features and the plurality of quantities as shown at 702; calculating out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing a loss function configured to determine ranking as shown at 703; and executing a second machine learning workflow to learn the classifier based on the ranking scores, the classifier configured to predict classes from the ranking scores as shown at 704 of FIG. 7.

[0073] Processor(s) 1310 can be configured to execute a method or instructions as described above, wherein the calculating the out-of-sample ranking scores corresponding to the extracted features and the plurality of quantities from executing the loss function configured to determine the ranking involves partitioning the extracted features and the plurality of quantities into a number of subsets at random as shown at 802; for each of the random subsets, fitting a corresponding subset learn to rank model on ones of the extracted features and ones of the plurality of quantities not in said each of the random subsets as shown at 803; and for the each of the random subsets; applying the fitted corresponding subset learn to rank model on said each of the random subsets to determine a ranking score for each sample in said random subset as shown at 804 of FIG. 8.

[0074] Processor(s) 1310 can be configured to execute method or instructions as described herein, wherein the applying the trained stacking model to new data to obtain the predicted class probabilities can involve applying the learn to rank model to the features extracted from the new data to obtain ranking scores as shown at 903; and applying the classifier to the ranking scores to obtain the predicted class probabilities as shown at 904 of FIG. 9. [0075] Processor(s) 1310 can be configured to execute the method or instructions as described above, wherein the obtaining, from executing the explainable Al method on the trained stacking model, the explanation of the predicted class probabilities based on the features extracted from the new data can involve determining a matrix of feature contributions to predictions, wherein each row of the matrix corresponds to a sample in the new data and its element measures a local importance of the corresponding feature from the features extracted from the new data to the predictions for that sample as shown at 1002 and 1003; and determining, for each of the features extracted from the new data, a summation of the absolute values of the feature contributions as shown at 1004 of FIG. 10. As described at 1003 of FIG. 10, the outputting the explanation and the predicted class probabilities can involve outputting a list of ordered features from most important to least importance, and outputting ones of the feature contributions associated with the list of ordered features.

[0076] Processor(s) 1310 can be configured to execute methods or instructions as described above, and further involve determining, for each of the features extracted from the new data, summary statistics derived from the feature values as illustrated in FIG. 11(A) and FIG. 11(B). Depending on the desired implementation, Processor(s) 1310 can also be configured to execute methods or instructions involving removing predictions associated with dubious explanations.

[0077] Processor(s) 1310 can be execute methods or instructions as described above, and further involve identifying errors in the new data or the trained stacking model as described in FIG. 11(A) and FIG. 11(B).

[0078] Processor(s) 1310 can be configured to execute methods or instructions as described above, and further involve scoring the trained stacking model based on frequency of dubious explanations as described in FIG. 11(A) and FIG. 11(B).

[0079] Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result. [0080] Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system’s memories or registers or other information storage, transmission or display devices.

[0081] Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer- readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.

[0082] Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the techniques of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.

[0083] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

[0084] Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the techniques of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.