Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FEDERATION OF SCORING SYSTEMS
Document Type and Number:
WIPO Patent Application WO/2024/039301
Kind Code:
A1
Abstract:
Systems and methods for federated scoring by a plurality of nodes, wherein each node comprises sensitive data based on which a first set of scoring model coefficients generated. The first set of scoring model coefficients are broadcast to rest of the nodes and at least one node generates a federated scoring model based on the received contributory intermediate statistics and its respective first set of scoring model coefficients.

Inventors:
LIU NAN (SG)
LI SIQI (SG)
ONG ENG HOCK (SG)
Application Number:
PCT/SG2023/050574
Publication Date:
February 22, 2024
Filing Date:
August 18, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NAT UNIV SINGAPORE (SG)
International Classes:
G06N20/00; G06F9/54; G16H50/30; G16H50/70
Foreign References:
CN114358912A2022-04-15
Other References:
DAYAN ITTAI; ROTH HOLGER R.; ZHONG AOXIAO; HAROUNI AHMED; GENTILI AMILCARE; ABIDIN ANAS Z.; LIU ANDREW; COSTA ANTHONY BEARDSWORTH;: "Federated learning for predicting clinical outcomes in patients with COVID-19", NATURE MEDICINE, vol. 27, no. 10, 15 September 2021 (2021-09-15), New York, pages 1735 - 1743, XP037589801, ISSN: 1078-8956, DOI: 10.1038/s41591-021-01506-3
XIE FENG, CHAKRABORTY BIBHAS, ONG MARCUS ENG HOCK, GOLDSTEIN BENJAMIN ALAN, LIU NAN: "AutoScore: A Machine Learning–Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records", JMIR MEDICAL INFORMATICS, vol. 8, no. 10, pages 1 - 19, XP093145003, ISSN: 2291-9694, DOI: 10.2196/21798
DUAN RUI, BOLAND MARY REGINA, LIU ZIXUAN, LIU YUE, CHANG HOWARD H, XU HUA, CHU HAITAO, SCHMID CHRISTOPHER H, FORREST CHRISTOPHER B: "Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm", JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, vol. 27, no. 3, 1 March 2020 (2020-03-01), pages 376 - 385, XP093145011, ISSN: 1527-974X, DOI: 10.1093/jamia/ocz199
Attorney, Agent or Firm:
DAVIES COLLISON CAVE ASIA PTE. LTD. (SG)
Download PDF:
Claims:
Claims

1. A system for federated scoring, the system comprising: a plurality of nodes, a communication network enabling communication between the plurality of nodes, each node comprising at least one processor and a memory, the memory of each node comprising sensitive data and program code, the sensitive data comprises a plurality of records; wherein the program code is executable by the respective processor of each node of the plurality of nodes to: generate a first set of scoring model coefficients based on sensitive data accessible to the respective nodes; broadcast the first set of scoring model coefficients to rest of the nodes; and wherein at least one of the nodes is configured to: receive contributory intermediate statistics from the rest of the nodes; and generate a federated scoring model based on the received contributory intermediate statistics and its respective first set of scoring model coefficients.

2. The system of claim 1 , wherein each node is configured to generate a node specific scoring model based on the first set of scoring model coefficients.

3. The system of claim 2, wherein the system further comprises a central server, wherein the central server is configured to evaluate each of the node specific scoring models and the federated scoring model based on model parsimony statistics.

4. The system of claim 3, wherein each record comprises a plurality of variables and each node is further configured to: determine a rank of each of the plurality of variables based on the relevance of each of the variables to a scoring result generated by the respective scoring model; transmit the ranks of each of the plurality of variables to the central server; wherein the central server is configured to: define a global variable rank based on the ranks of the plurality of variables received from the plurality of nodes; transmit the global variable rank to at least one of the plurality of nodes. The system of claim 4, wherein the relevance of each of the variables to a scoring result is evaluated based on model parsimony statistics or model area under curve statistics. The system of claim 4 or claim 5, wherein the federated scoring model is generated based on the global variable rank. The system of claim 6, wherein the federated model is generated by incorporating variables above a threshold in the global variable rank. The system of any one of claims 4 to 7, wherein the nodes determine a rank of each of the plurality of variables using a random forest model. The system of any one of claims 4 to 8, wherein the central server defines the global variable rank by averaging the rank of the plurality of variables received from each of the plurality of nodes. The system of any one of claims 1 to 9, wherein the scoring models are implemented using any one of: linear classification models, logistic regression models, clinical decision support models. The system of any one of claims 1 to 10, wherein the each of the plurality of nodes is configured to transmit its node specific scoring model and scoring model performance data to the central server. The system of claim 11 , wherein the central server is configured to receive the federated model from at least one of the nodes. The system of claim 10, wherein the central server is configured to transmit the federated model to at least a subset of the plurality of nodes. The system of any one of claims 1 to 13, wherein the variables comprise one or more continuous data variables, and each of the nodes is further configured to transform the continuous data variables into discrete variables. The system of any one of claims 1 to 14, wherein at least one of the nodes is configured to process new clinical data using the federated model to generate a score. The system of any one of claims 1 to 15, wherein the contributory intermediate statistics are computed by each respective node based on the sensitive data accessible to the respective nodes. A method for federated scoring comprising: providing a plurality of nodes, each node comprising: at least one processor and a memory, the memory of each node comprising sensitive data and program code the sensitive data comprises a plurality of records; providing a communication network enabling communication between the plurality of nodes, executing the program code by the respective processor of each of the plurality of nodes to: generate a first set of scoring model coefficients based on sensitive data accessible to the respective nodes; broadcast the first set of scoring model coefficients to rest of the nodes; executing the program code by at least one of the nodes to: receive contributory scoring intermediate statistics from the rest of the nodes; and generate a federated scoring model based on the received contributory scoring intermediate statistics and its respective first set of scoring model coefficients. One or more non-transitory computer-readable storage media storing instructions that when executed by one or more processors cause the one or more processors to perform the method of claim 17.

Description:
Federation of Scoring Systems

Technical Field

[0001] This disclosure generally relates to methods and systems for the federation of scoring systems or scoring models.

Background

[0002] This background description is provided for the purpose of generally presenting the context of the disclosure. Contents of this background section are neither expressly nor impliedly admitted as prior art against the present disclosure.

[0003] Cross-institutional partnerships in research, including healthcare research, have been increasingly popular. With the computerization of healthcare processes, a large volume of healthcare data is being generated by interactions of individuals with healthcare systems. The volume of data being generated reflective of healthcare interactions or outcomes will continue to increase as healthcare service providers continue to further automate or computerize greater parts of their services. The increasing amount of data provides a great opportunity to develop and implement computational constructs such as Machine Learning (ML) models in healthcare systems. The increasing volume of data regarding healthcare interactions presents opportunities to develop and train ML models that are more robust and accurate. As more data is available to ML models, the ML models may be trained to more accurately model diverse real-world situations.

[0004] However, data held by individual institutions are subject to constraints including privacy constraints that limit the sharing or transmission of data to other entities or institutions. The inability to share data presents a significant hurdle in the development of ML models by an individual institution or data holder and it is desirable to provide frameworks, systems and methods that address the problem or at least provide an alternative to existing solutions.

Summary

[0005] In one embodiment, the present disclosure provides a system for federated scoring, the system comprising: a plurality of nodes, a communication network enabling communication between the plurality of nodes, each node comprising at least one processor and a memory, the memory of each node comprising sensitive data and program code, the sensitive data comprises a plurality of records; wherein the program code is executable by the respective processor of each of the plurality of nodes to: generate a first set of scoring model coefficients based on sensitive data accessible to the respective nodes; broadcast the first set of scoring model coefficients to rest of the nodes; and wherein at least one of the nodes is configured to: receive contributory intermediate statistics from the rest of the nodes; and generate a federated scoring model based on the received contributory intermediate statistics and its respective first set of scoring model coefficients.

[0006] In another embodiment, the present disclosure provides a method for federated scoring comprising: providing a plurality of nodes, each node comprising: at least one processor and a memory, the memory of each node comprising sensitive data and program code the sensitive data comprises a plurality of records; providing a communication network enabling communication between the plurality of nodes, executing the program code by the respective processor of each of the plurality of nodes to: generate a first set of scoring model coefficients based on sensitive data accessible to the respective nodes; broadcast the first set of scoring model coefficients to rest of the nodes; executing the program code by at least one of the nodes to: receive contributory intermediate statistics from the rest of the nodes; and generate a federated scoring model based on the received contributory intermediate statistics and its respective first set of scoring model coefficients.

[0007] One or more non-transitory computer-readable storage media are also disclosed, the storage medium or media storing instructions that when executed by one or more processors cause the one or more processors to perform the method described above.

Brief Description of the Drawings

[0008] Some embodiments of systems and methods for privacy-preserving federated execution of scoring models are described, by way of non-limiting example only, with reference to the accompanying drawings in which:

[0009] Figure 1 illustrates a schematic block diagram of a system for federated scoring;

[0010] Figure 2 illustrates data partitioning for an experiment using the system for federated scoring;

[0011] Figure 3 illustrates the performance of various scoring models at various nodes in an experiment;

[0012] Figure 4 illustrates the combined performance of various scoring models including pooled and federated scoring models;

[0013] Figure 5 illustrates parsimony plots generated during performance evaluation of scoring models in an experiment; and

[0014] Figure 6 is a block diagram of a system for federated scoring.

Detailed Description

[0015] Federated learning, also known as distributed learning or distributed algorithms, can address the problems associated with limitations on data sharing by collectively training algorithms without exchanging data. In the context of healthcare data, embodiments of federated scoring systems and methods disclosed herein can safeguard patient privacy by distributing the model training to the data owners and aggregating the results across the various data owners/sites. Federated learning breaks down data silos and allows for faster development of much-needed scoring systems/models for the analysis of healthcare data and decision-making based on healthcare data. In addition, federation systems disclosed herein enable the federation of interpretable models that are preferred in clinical research. [0016] Figures 1 and 6 show system architectures for federated scoring. Firstly, with reference to Figure 6, a system 600 is used for federated scoring. The system 600 includes a plurality of nodes 630 that communicate between each other over a communication network 620. As shown for nodes 630(1) and 630(N) (indicating that there are N nodes in this system or network 600), each node comprises at least one processor 632 and a memory 634. The memory 634 comprises sensitive data 638 and program code 636. The sensitive data 638 can include a plurality of records such as patient data records.

[0017] The program code 636 is executable by the respective processor 632 of each node 630 to generate a first set of scoring model coefficients based on sensitive data accessible to the respective nodes. In general, there will be two scenarios: a node 630 has access only to the sensitive data 638 stored in its own memory 634; a node from a group of affiliated nodes (e.g. a group of clinics operated by a particular company) has access to the sensitive data stored on all the affiliated nodes. Regarding coefficients, the example model used involves logistic regression, which is logit Pr(Y = 1 |X)) = p 0 + ^2^2 + ■■■ Here p is the probability of response Y = 1 (T can only be either 0 and 1), and X X 2 ... are the predictors, such as gender, age etc. Since it is possible to have more than one predictor ( ), the visualization can be multi-dimensional instead of being 2D. logit(Pr(Y = 1 |X)) would be the y axis, but there can be more than one x axis). So, the coefficients are the betas.

[0018] The instructions stored in the memory 634 also cause the processor(s) 632 to broadcast the first set of scoring model coefficients to rest of the nodes 630.

[0019] Each node 630 is also configured to generate a node specific scoring model based on the first set of scoring model coefficients.

[0020] At least one of the nodes 630 is configured to receive contributory intermediate model statistics (also referred to as contributory intermediate statistics and others) from the rest of the nodes 630 and discussed below. That node or nodes then generate a federated scoring model based on these intermediate statistics and its respective data. Intermediate statistics are, for example, the components of Equation (5) for any particular node. [0021] Each of the plurality of nodes 630 is configured to transmit its node specific scoring model and scoring model performance data to the central server 610. Since at least one of the nodes 630 generates a federated scoring model, the central server 610 is also configured to receive the federated model from the node or nodes 630 that create such a federated scoring model. To ensure a consistent model is implemented across some or all of the nodes 630, the central server 630 is configured to transmit the federated model to at least a subset of the plurality of nodes 630. Where more than one federated scoring model is received at the central server 610, the central server 610 may select a particular federated scoring model to be deployed on all of the nodes 630. The model selection may be conducted via parsimony plot as shown in Figure 5. In some embodiments, this step involves user selection and in other embodiments selection is automatic based on a general criterion that adding new variables in the model would not result in increase in model performance above a predetermined threshold (increase of AUC values). That predetermined threshold may be set by a user. Variables may also be added based on domain knowledge - this can either be accessed using natural language processing over publications in the relevant domain, to identify variables most frequently mentioned, or may be based on a users domain knowledge (i.e. user added variables). These additional variables may not necessarily completely align with the parsimony plots. For example, suppose in the parsimony plot the model performance is already 0.85 using the first five variables (which is high enough to stop adding new variables), a further variable may be included based on domain knowledge or domain publications.

[0022] Some embodiments federate scoring systems across various nodes of computer systems. Scoring systems can be classification models that comprise the definition of a series of computations on input data, the computations being executed to make a prediction based on the input data. The scoring models can be implemented using various types of model, such as linear classification models, logistic regression models, and clinical decision support models. The series of computations include computations such as addition, subtraction, multiplication etc.. The models are used to assess the risk of numerous serious medical conditions since they provide efficient and interpretable predictions.

[0023] Table 1 shows an example of a scoring system.

Table 1: A scoring system for sleep apnoea screening

[0024] A doctor can easily determine whether a patient screens positive for obstructive sleep apnoea by adding points for the patient’s age, whether they have diabetes, body mass index, and sex. If the score is above a threshold, the patient would be recommended to a clinic for a diagnostic test.

[0025] Traditional scoring systems have largely been developed on single-source data. Consequently, training or sample data sets are often small or not representative - e.g. data taken from an affluent community will likely have a lower rate of adverse outcomes than data from a poor community. Although it is possible to develop scoring systems on pooled data, the pooling process is time-consuming and difficult to achieve due to privacy reasons. The framework shown in Figure 6 is for building scoring systems in a federated manner to address such difficulties.

[0026] The disclosed systems and method (also referred to as FedScore) provide an approach for building federated scoring systems executable across multiple computer system nodes provided at various locations. The embodiments improve robustness and remove biases from medical research, particularly research in contexts with relatively small sample sizes. Figures 1 and 6 show the overall architecture of a system according to the embodiments.

FedScore Framework [0027] When implementing clinical models and other models where data privacy is to be maintained (e.g. financial records), users usually consider the degree of parsimony as a key characteristic of the model. A model is considered parsimonious when it is sparse (i.e. , it uses the least amount of variables possible) and has good prediction accuracy. Figure 6 illustrates a block diagram of a FedScore system/framework 600. The FedScore system of some embodiments may comprise a central server 610 and a plurality of nodes 630. Alternatively, a specific node of the plurality of nodes of the FedScore system may perform the functions of the central server and the node or system/framework 600 thereby not requiring a designated central server. As discussed above, each node 630 comprises at least one processor 632 and a memory 634. Memory 634 comprises program code 636 and sensitive data 638. The various nodes are in communication with each other over a communication network 620. The sensitive data is accessible only to the respective nodes.

[0028] To incorporate the privacy requirement while achieving good parsimony and interpretability, the FedScore framework consists of five modules: (1) federated variable ranking module; (2) federated variable transformation module; (3) federated score derivation module/scoring module; (4) model selection module and (5) model evaluation module. Some or all of these modules may be provided in a particular node, distributed across nodes or on the central server depending on the architecture of the specific embodiment. In the embodiment shown in Figure 6, these modules are stored in memory 614.

(1) Federated variable ranking module

[0029] To construct a global model across several sites or nodes, some embodiments may pre-identify a set of unified variables as candidate variables for ranking to be performed independently across the various nodes 630. For example,: suppose sites A and B both use 0%, 25%, 50%, 75% and 100% to cut their variables. Due to data heterogeneity, their cut off for variable age may be different: A: (,24], (24, 49], (49, 62], (62,) years; B: (,24], (24, 52], (52, 67], (67,) years. This suggests that site B has a relatively older population. In this case, federation cannot be conducted, as these categorical variables have different meanings and are not unified. Instead, when federating, the two sites specify cuts that are sensible for both sites - e.g. (,24], (24, 50], (50, 60], (60,) years, with federation then being able to be conducted. This unification may result from collecting data from all contributing sites (e.g. the ages of all patients) and segregating based on statistical paramteres such as percentiles as mentioned above. In some embodiments, random forests may be utilized to perform variable ranking. In the FedScore framework, variable ranking is first performed at each local site/node 630. Each node 630 then transmits the variable rankings to a central server 610 that generates a global variable ranking. The global variable ranking may be generated by ordering variables by their averaged ranks at each site.

[0030] The random forests of some embodiments may comprise a collection of randomized classification and regression trees. One importance measurement of a given variable in random forests is the increase in mean of a tree's error when the observed values of a particular variable are randomly permuted in out-of-bag samples. More specifically, the importance of a variable may be quantified based on the mean square error for regression and misclassification rate for classification. In classification tasks, Gini index is defined for each node 0 for a decision tree 0 as:

Gini(Q) = ^ =1 p r x (1 - p r ) (1) where p r is the fraction of training samples in the r th class at the node, and R = 2 for a binary classification task. The importance of a variable X m is the weighted total impurity decrease w(0)A6mj(0) for all nodes. When averaged over all trees this metric is calculated as: where w(0) is the proportion N 0 /No samples reaching node 0, A6mj(0) is the impurity decrease after the split of node 0, and v(0) is the variable used in the split. In the FedScore framework, variable ranking is performed first at each local site/node, and then a global variable ranking is created by rearranging variables by their averaged ranks at each site.

In some embodiments, the variable ranking at each node may be determined based on a model parsimony statistic such as a model parsimony plot which demonstrates the relevance of each variable or a combination of variables to the performance of the scoring model. In some embodiments, the area under curve statistic of various models may be used to evaluate the variable ranking. The federated scoring model of some embodiments may take into account variables in the global variable rank that are above a predefined threshold such as an importance threshold. For example, suppose a random is used forest for importance measurement. After scaling importance values to 0 and 1 , the following variable importance values may result: var1 : 0.8, var2: 0,5, var3: 0.3, var4: 0.2, var5: 0.15, var6: 0.08... The threshold can be empirically set to be 0.1, resulting in selection of only the first five variables. By doing so, the federated scoring model reduces the number of variables incorporated in the model and serves as a model that is more interpretable while retaining its accuracy in clinical environments - i.e. the model becomes sparse without significant loss in accuracy.

(2) Federated variable transformation module

[0031] The FedScore framework also transforms continuous variables into categorical/discrete variables after the global variable ranking is determined. For example, the age of a person may be banded - e.g. 0 to 20 years old, 20 to 30 years old, 30 to 45 years old, 45 to 60 years old, and 60 years old and older. The maximum number of categories for such transformation may be pre-determined (for example, 5), and if the maximum value for a particular variable is surpassed, categories may be combined so that the maximum requirement is met. A global cutoff/discrete bucket for each continuous variable is calculated by averaging the k values acquired at each site, k values are used to cut a continuous variable into several categorical variables. For example, suppose there is only one k value of 50% for variable age, and the 50% cutoff for age is 50 years old, the age variable would be transformed from a continuous variable to two categorical variables: age(<=50) (true of false) and age (>=50) (true or false). After defining the discrete buckets/cutoffs for each variable, the defined cutoffs are transmitted to the plurality of nodes to enable the nodes to process the continuous variable data in a unified and standardized discrete manner. In some embodiments, quantiles of continuous variables were set to be 0%, /c 1 %, k 2 %, k 3 %, k 4 % and 100%, where the value of k 1 , k 2 , k 3 , k 4 was set to 5, 20, 80 and 95. The federated variable transformation by providing the standardized discrete buckets for the continuous variables improves the accuracy of the federation as the various nodes observe a common set of discrete buckets when providing input to their respective scoring models.

(3) Federated score derivation module

[0032] The score derivation process could be flexibly adjusted for different clinical modelling purposes by incorporating a suitable ML model depending on the clinical need and the context. For instance, a logistic regression model may only support binary outcomes. By switching the logistic regression in Module 3 to other suitable models, the frameworks can be expanded to support survival outcomes and ordinal outcomes etc (i.e. non-binary outcomes). A step of the score derivation process includes the generation of a first set of scoring model coefficients based on sensitive data. The scoring model coefficients include the various parameters of the scoring model. For example, in embodiments where the scoring model is a linear model, the coefficients are the linear parameters and intercepts, etc. Each node generates its own first set of scoring model coefficients because each node has access to different sensitive data. In general, the data will be non-overlapping. The first set of scoring model coefficients are used by each node to define a node-specific scoring model. The node-specific scoring models serve as candidate models for comparison with the federated scoring model.

[0033] The first set of scoring model coefficients may be broadcast to the rest of the nodes and each node may generate a set of contributory/intermediate scoring model coefficients based on the received first set of scoring model coefficients and the sensitive data accessible to each respective node. The contributory scoring model coefficients may subsequently be trans mitted/broadcast to the rest of the nodes. One or more nodes may generate a federated scoring model based on its first set of scoring mode coefficients and the contributory scoring mode coefficients received from the rest of the nodes. An example of fedegrated scoring model generation is described with reference to an ODAL2 algorithm.

[0034] As another example, logistic regression is a common choice for modelling binary outcomes. Federated logistic regression may be implemented by some embodiments calling for multiple iterations of logistic regression or logistic regression over one iteration (one-shot approach).

[0035] FedScore is a privacy-preserving framework to provide unified and robust scoring systems across multiple sites without the need for sharing sensitive data, such sensitive medical data or other personal information. FedScore was tested using models for clinical scoring for 30-day mortality prediction utilizing emergency department (ED) data from Singapore General Hospital (SGH) and a simulation of 10 nodes/sites that did not exchange sensitive patient data during the experiment. FedScore’s robustness and generalizability were established by achieving a high average area under the curve (AUG) on the testing data of each site with the smallest variance when compared to baseline scores. [0036] Experiments were performed using an ODAL2, a one-shot privacy preserving distributed algorithm as disclosed in R. Duan et al., “Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm" J. Am. Med. Inform. Assoc., vol. 27, no. 3, pp. 376-385, Dec. 2019, doi: 10.1093/jamia/ocz199 to execute federated logistic regression. Embodiments utilized information from the local site/node with the first-order (ODAL1) (first set of scoring model coefficients) and second-order (ODAL2) gradients (contributory scoring model coefficients) of the likelihood function from remotes sites to construct an approximation of the global likelihood function that forms a part of the federated scoring model. Data from the remote sites/nodes was not accessible in the execution of the logistic regression computations. Coefficients generated at each node during the logistic regression were transmitted to the central server.

[0037] The coefficients in a global logistic regression model are generated by optimizing the likelihood function and then are rounded to obtain scores based on each variable. A scoring table was defined by the central server and the overall score is calculated by adding all the points together. The ceiling number for total score and normalization of score breakdowns could both be adjusted to fit the needs of an intended clinical application.

[0038] The coefficients of a global logistic regression model may be obtained by optimizing a global likelihood function. Let x 1 ,x 2 , ... Xp^denote the p - 1 predictors, y denotes a binary outcome, and the logistic regression model can be expressed as where x = l,x 1 ,x 2 , ... Xp_ t ) , p is the vector of intercepts and coefficients, and logit(t) = log t/(l - t). Suppose a total of N = Xj=i n j identically and independently distributed (i.i.d.) observations are distributed at K sites/nodes, then the likelihood function (LLR) of global logistic regression by pooling data from all sites is

The pooled estimator can be obtained by optimizing L( ?). [0039] However, when data cannot be shared computation of the pooled likelihood function is not possible. As envisaged by the embodiments, approximation of the likelihood function is performed as described herein. As an example, the ODAL2 algorithm applies the idea of Taylor expansion, proposing to use first and second order gradient of LLR (log-likelihood ratio) to perform the approximation:

Here ft is an initial value, is the LLR of the /-th site (J = 1 is assumed to the local site) is the first gradient of the second gradient of LLR of site j.

[0040] When executing the ODAL2 algorithm, the initial value ft (first set of scoring model coefficients) is first obtained from the result of local logistic regression performed at each local site/node - e.g. based on data such as that represented in Table 1. Then ft is broadcast to the rest of the remote sites/nodes of the system. A federated scoring model is generated based on the received contributory scoring model coefficients and a node's first set of scoring model coefficients. After receiving the broadcasted first set of scoring model coefficients, a site/node may compute V (/?) to build a surrogate likelihood which forms part of a federated scoring model. The federated scoring model may be generated by one or more of the nodes or the central server depending on the configuration of the FedScore framework. In some embodiments, the federated scoring model may also take into account the global variable rank generated by the (1) federated variable ranking module. Variables that are lowly ranked may be discarded by the federated scoring model in the interest of model parsimony and interpretability.

[0041] The global beta estimator 2 is obtained by optimizing the surrogate likelihood function. In some embodiments, two json files, ‘site_beta.json’ and ‘intermediate.json’ as illustrated in Figure 1 are used to store values that need to be broadcast in the process. The information in these two shared files is aggregated and does not contain any patientlevel information, which guarantees data privacy. Part of the process for generating the federated scoring model is illustrated in Figure 1. [0042] In some embodiments, the coefficients in the federated logistic regression model (federated scoring model) are rounded to get relevance scores for each variable. A scoring table is created, and an overall score is calculated by adding all the points together. A ceiling number for the total score and normalization of score breakdowns could both be adjusted to fit the needs of an intended clinical application.

(4) Model Selection Module

[0043] The scoring models trained at each node serve as candidates that could be adopted by any or all of the rest of the nodes for obtaining the most accurate results going forward. Model evaluation and selection could be performed using parsimony plots generated by the mean AUG (area under curve) of all sites/nodes. Parsimony plots represent the performance of a scoring model as a function of the number of variables incorporated in the model. Let i denote site/node index, where i e {1, 2, ... K}. A general model selection criteria could be defined by maximizing P2< P3< ■ ■ ■ Pm) , where w t is the weight for site i, is the scoring model's performance on ith validation set and m is a pre-defined number of total variables to include, which may be uniform across all sites. In some embodiments, weights may be defined as w t = -, indicating equal weights for all sites. Yet w t > - may be flexibly assigned for a site i if the performance of scores on this site is considered more important than others.

[0044] Different constraints can be added for the maximization task as well. For example, the total number of variables m may not exceed an integer number N. The set of variables {pi, p 2 < ■■■ p m } may also be constrained to satisfy a predefined standard required by the system. For instance, the system of some embodiments may be configured to include in the federated scoring model a set of variables {%!, x 2 , . . x q }, where q < m. Moreover, 'P may be maximized using a number of d variables that are smaller than m, as long as increasing the variable numbers from d to m has an acceptable impact on the change in l P: 1^ - T' < e, where the size of e may be decided intuitively by users.

[0045] A final selection of variables may be confirmed based on the selected federated model from among the plurality of scoring models at the respective nodes. A new scoring model may be refitted to any new data using step (2). The performance of the selected federated scoring model is validated on each testing data set of each site participating in the federated learning process. The selected federated scoring model may be transmitted to each node to allow the nodes to process new clinical data using the federated scoring model.

[0046] Notably, to maintain parsimony without loss of accuracy, variables that are interrelated may be identified - e.g. variables that have substantially the same influence on patient outcomes and vary substantially in proportion with each other (e.g. with a 5% of each other) - e.g. weight and height may vary substantially in proportion with each other (e.g. within 5%) for a particular sex. Of the interrelated variables, the system may select a best variable for the federated model (e.g. the variable of the interrelated variables, that correlates most closely with outcomes such as 30 day mortality) based on each node either capturing that variable in its sensitive data, or capturing a different variable that is interrelated with the best variable. Each node that does not capture the best variable may then substitute the relevant interrelated variable in the local model.

[0047] The resulting federated scoring model is interpretable. Unless context dictates otherwise, being interpretable means that the correlation between sensitive data and outcomes is clear and explicable from the model - e.g. age correlates well with sleep apnoea for people over 60, per Table 1. This is as opposed to machine learning models that use hidden layers to identify features in the data and thus for which correlations between data and outcomes may not be readily apparent.

(5) Model Evaluation Module

[0048] The performance of the federated scoring model and/or the node-specific scoring models is validated using the testing data sets of each site engaged in the FedScore framework. Model evaluation may be performed by a designated node or a central server depending on the configuration of the FedScore framework. Following the defined in step (4), the overall average performance of a federated scoring model may be defined as: = - J] |ij(p 1 , p 2 , p 3 , ... p m ), where [i s the scoring model's performance on tth testing set; and M 2 = VSCMi - 1^) 2 /K as a measurement of performance variation across sites. A higher M 1 value and lower M 2 value indicates a score’s better performance and generalizability.

Clinical Study Design [0049] A retrospective analysis was conducted using the emergency health record data of Singapore General Hospital (SGH) extracted from the SingHealth Electronic Health Intelligence System. The initial study cohort of a total of 86527 admissions was identified by selecting ED admission at SGH data between 2016 and 2017. After excluding patients under the age of 18 and those with missing values, a total of 80613 admissions remained, which were then randomly divided into 10 simulated sites, in the proportion of 4%, 5%, 7%, 9%, 10%, 11%, 12%, 13%, 14%, and 15% respectively. Data partitioning for an experiment using the system for federated execution of an ML model is shown in Figure 2.

[0050] The outcome of the study was 30-day mortality, which was defined as deaths that occurred within 30 days after ED admission. The candidate predictors include a total of 29 variables: demographics information, PACS triage categories, shift time, day of the week (Friday, Monday, Weekend, Midweek), vital signs, comorbidities, and previous health care usage, etc.

Results

[0051] Analysis was performed over three groups (1) ten local scores trained independently on each site (2) one federated score trained using all sites without data sharing (3) one pooled score generated using pooled data, which is the ideal case and usually impossible in real-world applications. The models were selected based on corresponding parsimony plots, with a predefined criterion that the maximum number of variables in a model should not exceed 10 and adding more variables should cease when there is no significant improvement in AUG. The variables selected for each model are included in the footnotes of Table 2. Figure 5 illustrates a series of parsimony plots obtained for the various scoring models. Plots (a) - (j) relate to local models generated on site 1 to site 10. Plot (k) relates to the federated scoring model generated on any one of the sites/nodes. Plot (I) relates to a scoring model generated using pooled data that would otherwise not be feasible in a real-world environment.

[0052] A total of twelve scoring models were obtained and tested, including ten local models generated independently on each site, one federated model developed by FedScore, and one pooled model generated based on pooled data from all the sites that would not be otherwise possible in a real-world scenario. The AUG values and confidence intervals (Cl) of each model on different sites’ testing data are presented in Table 2. The AUC values and 95% Cis for each score are plotted individually in Figure 3 to better illustrate the results of Table 2 in terms of model performance variability across all sites. The mean and standard deviation (SD) of the AUC values for each model over all 10 testing sets were also computed and presented in Table 2. Figure 4 depicts the information accordingly. As shown in both Figure 3 and Figure 4, the federated scoring model outperformed all local scores in terms of stability and generalizability by achieving the lowest SD. The federated score also exhibited a satisfactory average AUC for each site, indicating that the existing FedScore framework can generate global clinical scores that are trustworthy. The bottom row of Table 2 displays the averaged AUC values of all local models on each site, which are predominantly exceeded by the AUC values of the federated score. This suggests that for a single site, FedScore has the potential to yield a better scoring system than those developed locally, particularly when the sample size of the local site is inadequate.

Table 2: Comparison of performance of FedScore model with baseline models [0053] In Table 2, the following abbreviations apply: AUC, area under the curve; Cl, confidence interval; SD, standard deviation; SBP, systolic blood pressure; DBP, diastolic blood pressure; SpC>2, oxygen saturation as measured by pulse oximetry; PACS, Patient Acuity Category Scale; ED, emergency department. Moreover: a Local model obtained via AutoScore on Site 1 ; variables selected in the model (in the order of ranking): pulse, age, SBP, DBP, SpC>2, respiration, day of week, ED admissions in the past year; b Local model obtained via AutoScore on Site 2; variables selected in the model (in the order of ranking): SBP, DBP, pulse, age, SpC>2, respiration, ED admissions in the past year; c Local model obtained via AutoScore on Site 3; variables selected in the model (in the order of ranking): pulse, age, SBP, DBP, SpC>2, respiration, ED admissions in the past year; d Local model obtained via AutoScore on Site 4; variables selected in the model (in the order of ranking): pulse, SBP, age, DBP, SpC>2, respiration, ED admissions in the past year, PACS triage categories; e Local model obtained via AutoScore on Site 5; variables selected in the model (in the order of ranking): SBP, pulse, DBP, age, SpC>2, respiration, day of week, ED admissions in the past year, shift time, PACS triage categories; f Local model obtained via AutoScore on Site 6; variables selected in the model (in the order of ranking): SBP, pulse, age, DBP, SpC>2, respiration, ED admissions in the past year; 9 Local model obtained via AutoScore on Site 7; variables selected in the model (in the order of ranking): SBP, pulse, DBP, age, SpC>2, respiration, ED admissions in the past year; h Local model obtained via AutoScore on Site 8; variables selected in the model (in the order of ranking): SBP, pulse, DBP, age, SpC>2, reparation, day of week, ED admissions in the past year; ' Local model obtained via AutoScore on Site 9; variables selected in the model (in the order of ranking): pulse, SBP, age, DBP, SpC>2, respiration, day of week, ED admissions in the past year, shift time, PACS triage categories; i Local model obtained via AutoScore on Site 10; variables selected in the model (in the order of ranking): SBP, pulse, age, DBP, SpC>2, respiration, ED admissions in the past year, day of week; k Federated model obtained via FedScore; variables selected in the model (in the order of ranking): SBP, pulse, age, DBP, SpC>2, respiration, ED admissions in the past year, day of week, shift time, PACS triage categories; 1 Pooled model obtained via AutoScore; variables selected in the model (in the order of ranking): SBP, pulse, age, DBP, SpO 2 , respiration, day of week, ED admissions in the past year.

[0054] The framework provided by the embodiments is scalable and flexible, given that each scoring model can be modified or replaced to adapt to different clinical research questions. For example, the score derivation module may be adjusted to generate ordinal outcomes. FedScore fills a gap in existing medical machine learning applications that lack established methods for generalizing unified scores across multiple sites. It also addresses the deficiency of the absence of reproducible benchmark methods, especially for more interpretable models.

[0055] The embodiments generate scoring models that are more interpretable due to the preference of using fewer variables and a more simplified model structure when compared with black box techniques such as deep learning based models. Conventional deep learning techniques also require a larger volume of data to generate models that are more accurate. In contrast, the disclosed FedScore system allows multiple nodes/sites to work towards a federated scoring model without the need for sharing confidential data. Since each site may have access to a limited volume of data, if each site pursues a deep learning based model, the outcomes at each site may not be optimal because of the limited volume of data. The FedScore system address this challenge by providing a more interpretable scoring model and generating a federated scoring model that addresses the constraints associated with black box machine learning models such as deep learning models. In addition, the scoring models being trained on data from multiple sites are more generalizable because of the diversity in the origin of data that such models as a whole are trained on. The federated scoring models generated by the embodiments are therefore more generalizable to new setting such as data from a new clinical setting or data from a different population. The disclosed FedScore framework could serve as the foundation of a data science software platform, which deals with large- scale, multi-centre data analysis and risk scoring development.

[0056] The reference in this specification to any prior publication (or information derived from it), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that that prior publication (or information derived from it) or known matter forms part of the common general knowledge in the field of endeavour to which this specification relates.

[0057] Throughout this specification and the claims which follow, unless the context requires otherwise, the word "comprise", and variations such as "comprises" and "comprising", will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. [0058] The scope of this disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that a person having ordinary skill in the art would comprehend. The scope of this disclosure is not limited to the example embodiments described or illustrated herein. Moreover, although this disclosure describes and illustrates respective embodiments herein as including particular components, elements, feature, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that a person having ordinary skill in the art would comprehend. Although this disclosure describes or illustrates particular embodiments as providing particular advantages, particular embodiments may provide none, some, or all of these advantages.