| WO/2004/023232 | QUANTUM CIRCUIT AND QUANTUM COMPUTER |
| WO/2001/006337 | NEURAL NETWORKS FOR INGRESS MONITORING |
| JP06259156 | LEARNING SYSTEM FOR SELF-ORGANIZATION PATTERN |
VAN DER LAAN, Mark (278 Glorietta Blvd, Orinda, CA, 94563, US)
| Claims
We claim:
1. A method comprising: determining a data set, the data set associated with a set of variables related to an input variable, the input variable having a relationship of interest to an outcome variable; determining a plurality of candidate target adjustment sets of variables from the set of variables, a target adjustment set being variables that are determined to be adjustable; analyzing bias caused by different candidate target adjustment sets of variables on a predicted outcome for an estimation from the input variable to the output variable based on the data set; determining a target adjustment set based on the analyzed bias from the different candidate target adjustment sets; determining an estimator for the input variable using the target adjustment set of variables; and outputting the estimator for the input variable using the target adjustment set.
2. The method of claim 1 , wherein analyzing comprising: ranking variables in the set of variables according to a correlation to the input variable and the outcome variable; and selecting different candidate target adjustment sets of variables based on the ranking.
27
3. The method of claim 2, wherein the different candidate target adjustment sets of variables are determined by truncating a number of variables at different points in the ranking.
4. The method of claim 1 , further comprising using a mean square error calculation with respect to the target adjustment set to analyze the bias or variance of the estimator to determine an effective adjustment set, the effective set removing variables in the target set based on bias or variance caused in the estimator.
5. The method of claim 1 , wherein determining the target adjustment set comprising: determining a measurement of bias for the different candidate target adjustment sets of variables; and selecting one of the candidate target adjustment sets of variables based on the measurement.
6. The method of claim 5, wherein the one of the candidate target adjustment sets of variables is selected based on being close to a full set of variables and still having an acceptable amount of bias or mean square error determined by a threshold.
7. The method of claim 1, further comprising outputting information relating to an amount of bias that is determined by using the target set of variables.
8. The method of claim 1 , wherein determining the estimator comprises:
28 adjusting for the target set of variables in an estimation using the estimator, the adjusting accounting for a full range of observations where variables not included in the target set are not used in the estimation.
9. The method of claim 1, wherein the estimator estimates a parameter identified by the target adjustment set of variables.
10. A method comprising : determining a family of path- wise differentiable parameters; determining estimators relying on assumptions not affecting the information bound of the pathwise differentiable parameters based on a data set; generating measures of sparse-data bias for the estimators; selecting a target parameter from the family of path- wise differentiable parameters based on the measures of sparse-data bias; and outputting the selected target parameter.
11. The method of claim 10, further comprising outputting an estimator for the selected target parameter, the estimator configured to estimate the target parameter.
12. The method of claim 10, further comprising outputting a confidence level indicating a measure of reliability for the estimator.
13. The method of claim 12, wherein different path- wise differentiable parameter comprises an effect of an input variable on an outcome adjusting for different target set of confounders.
29
14. The method of claim 10, wherein selecting comprises using a criterion to select the target parameter with a desired level of sparse data bias.
15. The method of claim 10, further comprising determining an effective parameter by minimizing mean square error of an estimate of the target parameter.
16. The method of claim 10, where each choice in the family of parameters may be identified by a truncation level and or a set of variables.
17. The method of claim 10, wherein sparse data bias of the estimators is measured based on influence-curve-based measures of the sparse data bias.
18. The method of claim 10, wherein the estimators are targeted maximum likelihood estimators.
19. An apparatus comprising : one or more processors; and logic encoded in one or more tangible media for execution by the one or more processors and when executed operable to: determine a data set, the data set associated with a set of variables related to an input variable, the input variable including a relationship of interest to an outcome variable; determine a plurality of target adjustment sets of variables from the set of variables, a target adjustment set being variables that are determined to be adjustable;
30 analyze bias caused by different target adjustment sets of variables on a predicted outcome for an estimation from the input variable to the output variable based a known data set; determine a target set based on the analyzed bias; determine, for a second data set, a an estimator for the input variable using the target adjustment set of variables; and output the estimator for the input variable using the target adjustment set of variables.
20. An apparatus comprising: one or more processors; and logic encoded in one or more tangible media for execution by the one or more processors and when executed operable to: determine a family of path- wise differentiable parameters, wherein different path- wise differentiable parameter comprises an effect of an input variable on an outcome adjusting for different target set of confounders; determine estimators relying on assumptions not affecting the information bound of the pathwise differentiable parameters based on a data set; generate influence-curve-based measures of sparse-data bias for the estimators; select a target parameter from the family of path- wise differentiable parameters based on the generated influence-curve based measures of sparse-data bias; and output the selected target parameter.
31 |
PATENT COOPERATION TREATY
PCT
INTERNATIONAL SEARCH REPORT
(PCT Article 18 and Rules 43 and 44)
Applicant's or agent's file reference FOR FURTHER see Form PCT/ISA/220 010030-00221 OPC ACTION as well as, where applicable, item 5 below
International application No International filing date (day/month/year) (Earliest) Priority Date (day/month/year) PCT/US 09/34585 19 February 2009 (1902 2009) 22 February 2008 (22 02 2008)
Applicant
THE REGENTS OF THE UNIVERSITY OF CALIFORNIA
This international search report has been prepared by this International Searching Authority and is transmitted to the applicant according to Article 18 A copy is being transmitted to the International Bureau
This international search report consists of a total of £- sheets
I I It is also accompanied by a copy of each prior art document cited in this report
1 Basis of the report a With regard to the language, the international search was carried out on the basis of | X| the international application in the language in which it was filed I J a translation of the international application into which is the language of a translation furnished for the purposes of international search (Rules 12 3(a) and 23 l(b)) b I I This international search report has been established taking into account the rectification of an obvious mistake authorized by or notified to this Authority under Rule 91 (Rule 43 6bιs(&)) c I I With regard to any nucleotide and/or amino acid sequence disclosed m the international application, see Box No I
D Certain claims were found unsearchable (see Box No II)
3 I I Unity of invention is lacking (see Box No III)
4 With regard to the title,
I I the text is approved as submitted by the applicant
IXI the text has been established by this Authority to read as follows
PREDICTING THE THERAPEUTIC OUTCOME OF A MEDICAL TREATMENT USING STATISTICAL INFERENCE MODELLING
With regard to the abstract,
IXI the text is approved as submitted by the applicant
I I the text has been established, according to Rule 38 2(b), by this Authority as it appears in Box No IV The applicant may, within one month from the date of mailing of this international search report, submit comments to this Authority
With regard to the drawings, a the figure of the drawings to be published with the abstract is Figure No J
IXI as suggested by the applicant
I I as selected by this Authority, because the applicant failed to suggest a figure
I I as selected by this Authority, because this figure better characterizes the invention
I I none of the figures is to be published with the abstract
Form PCT/ISA/210 (first sheet) (April 2007)
PREDICTING THE THERAPEUTIC OUTCOME OF A MEDICAL TREATMENT USING STATISTICAL INFERENCE MODELLING
Cross References to Related Applications
This application claims priority from U.S. Provisional Patent Application Serial No. 61/030919. entitled DATA-ADAPTIVE SELECTION OF THE ADJUSTMENT SET, tiled on February 22, 2008, which is hereby incorporated by reference as if set forth in full in this application for all purposes.
Background
[01] Particular embodiments generally relate to data adaptive selection of adjustment sets for assessing the impact of a variable on an outcome.
[02] Statistical applications are concerned with estimating the impact of a treatment variable on an outcome of interest based on a data set that measures treatment variables, an outcome, and other variables, on many units such as a patient. Some variables measured on each subject may be considered confounders, which may confound the relationship between treatment and outcome of interest. Estimators of the impact of a treatment variable on an outcome involves adjustment by a set of confounders, and to estimate a causal effect of the treatment variable the estimator needs to adjust for all confounders. The treatment may not be randomized so that an association between the treatment and the outcome does not imply the existence of a causal effect of treatment on the outcome. In one example, a relationship observed may be a high dose of medicine
that may be associated with cancer. The reason for this may be that high doses were given with high probability to people that are very ill but with low probability to people who are not as sick. It could also happen that the high dose was only given to people who were very ill, so that a full range of observations for the high dose is not available (e.g., observations of what happens when high doses were given to people who were not as sick are missing). Adjustment by such a confounder often results in biased and highly variable estimators of a causal effect and unreliable statistical inference as measured by confidence intervals, p-value, and standard error estimates. Therefore, some confounders should not be adjusted for. In general, estimators of parameters that are hardly identifiable from the collected data are typically very biased (referred to as sparse-data bias) and due to the large bias have relatively small variance. In these cases, even a nonparametric bootstrap distribution (i.e., sampling distribution of estimator based on resampling from data set) of such estimators fails to discover the level of sparse data bias. Thus current statistical practice would assume these estimators are approximately unbiased. In these cases, statistical inference based on an assessment of the variance of the estimator only will result in false (positive) conclusions.
Summary
[03] In one embodiment, a method is provided for selection of a (general) target parameter among a family of (general) parameters based on a criterion assessing the degree of lack of identifiability, and subsequent selection of effective parameter for the purpose of estimating this selected target parameter based on minimizing an estimated mean squared error. Therefore, particular embodiments data adaptively select target parameters for which reasonable estimators and corresponding statistical inference as measured by (e.g.) confidence intervals and p-values exist , so that the method might acknowledge (and output to the user) that the a priori wished target parameter might be unachievable given the data at hand. Subsequently, given the selection of the target
parameter that is assessed to be reasonably well identifiable from the data, it is of interest to select data adaptively among the family of estimators indexed by the family of the parameters for the purpose of estimation of this target parameter.
[04] In one embodiment, a method for determining a target adjustment set of variables is provided that can be reliably adjusted for when assessing the effect of a treatment variable on an outcome. The method comprises determining a data set, the data set associated with a set of variables related to a treatment variable. A target adjustment set of variables is determined from the set of variables. The target adjustment set being variables that are determined to be adjustable based on the data set. For example, based on an identifiability criterion (e.g., sparse data bias) for the effect of treatment on an outcome, controlling for a subset of variables, the target adjustment set may be determined from the set of all variables. The target set may be chosen to be the largest set that still results in an acceptable degree of sparse data bias. The target parameter identified by the target set of variables results in more reliable statistical inference such as confidence intervals and p-values that assess the signal to noise ratio. Given a collection of candidate subsets of the target set, and corresponding estimators, an effective adjustment set of variables may then determined from the target adjustment set. For example, minimizing (over all subsets of the target set) an estimate of mean squared error of the subset specific estimators with respect to the target parameter defined by target adjustment set may be used to determine the effective adjustment set. The effective adjustment set may be used to determine an estimator to estimate the effect of treatment variable on outcome adjusting for target adjustment set of confounders. The estimator (of the effect of treatment on outcome adjusted for the target set of variables) that is determined may result in less bias and variance (with respect to the target parameter), and more reliable statistical inference such as confidence intervals and p- values that assess the signal to noise ratio. A data adaptive method for selection of a truncation level for the adjustment set used in an estimator, such as an Inverse Probability of Treatment or Censoring Weighted Estimators, is also developed as a by-product.
[05] A further understanding of the nature and the advantages of particular embodiments disclosed herein may be realized by reference of the remaining portions of the specification and the attached drawings.
Brief Description of the Drawings
[06] Fig. 1 depicts a system for selecting a target set of variables and subsequently a corresponding estimate of the effect of input variable on an output variable adjusting for the set of target variables.
[07] Fig. 2 depicts a more detailed example of a target set determiner according to one embodiment.
[08] Fig. 3 depicts a simplified flowchart of a method for determining a target set of variables according to one embodiment.
[09] Fig. 4 depicts a simplified flowchart of a method for determining an estimator according to one embodiment.
[10] Fig. 5 depicts a simplified flowchart for selecting the targeted parameter of the data generating distribution among a family of candidate parameters based on sparse-data bias according to one embodiment.
Detailed Description of Embodiments
[11] Fig. 1 depicts a system 100 for adjusting by a set of variables to provide an estimate of an effect of an input variable on an outcome according to one embodiment. A
computing device 102 is configured to receive a set of variables, determine a target set of variables, and provide the estimate of the treatment effect on outcome adjusting for the target set of variables. The effective set of variables is determined data adaptively from a group of candidate target sets included in the selected target set of variables, based on a data set. By subsequent selecting the effective set of variables from the target set of variables data adaptively, an estimator using the effective set may provide an optimal estimator of the treatment effect adjusting for target set of variables.
[12] The input variable may be a variable in which the estimator wants to understand the effect of input variable on an outcome variable. In one embodiment, the input variable is a treatment variable, which is a variable related to the treatment of a patient (such as the treatment of a medical disease).
[13] A set of variables may be covariates. A covariate may be of direct interest or it may be a confounding or interacting variable on the relationship of the input variable to the outcome variable. In one embodiment, the set of variables are considered confounders that have a potential effect on the treatment variable and have an effect on the outcome. A confounder correlates (positively or negatively) with the input variable. Because of the correlation, there is a need to control for these factors to avoid bias in the estimator of a causal effect of treatment on outcome. Conventionally, the complete (i.e., wished) set of confounders was adjusted to estimate the effect of treatment However, adjusting for all of the variables the user wishes to adjust for may not yield a reliable estimate and statistical inference. For example, if a confounder cannot or should not be adjusted, bias may result in the estimate of the parameter of interest and possibly also in the estimate of the standard error, either one resulting in false claims based on biased confidence intervals or p-values. A p-value the probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true.
[14] Particular embodiments truncate the set of variables that are used in the estimator. To calculate the truncated set, a set of variables is determined. The set of
variables are related to the treatment variable and the outcome variable. A target set of variables is determined from the data where these variables are ones determined to be adjustable based on the data set. For example, some variables should not be adjusted and are not included in the target set. The target set of variables defines the target parameter of interest that needs to be estimated. An effective set of variables is then determined from the target set where the effective set includes variables that will be adjusted in the estimation of the target parameter. The effective adjustment set is determined data adaptively . For example, an estimate of a mean squared error with respect to the target parameter is obtained based on the data set, and is used to determine which set of variables should be used in the estimator.
[15] Each target parameter is a feature of the distribution of the data. Once the target parameter is selected, the goal is to estimate that target feature of the distribution of the data. In many examples, the target parameter is identified by a target set of variables. For example, an effect of treatment on outcome controlling for target set of variables may be the target parameter. This latter target parameter is now identified by target set of variables. In order to estimate the target parameter identified by target set of variables a following strategy may be used. For each set of variables contained in target set of variables construct an estimator of the corresponding parameter. Now, select among all these candidate estimators one that is considered optimal for the target parameter using a data based criterion, and the corresponding choice of set of variables is referred to as the effective set of variables. The effective set can be smaller than the target set since, for example, the estimator defined by target set of variables might be too variable, while smaller sets of variables exist that have much less variance at cost of a little bias with respect to the target parameter. The goal of this estimate is not the parameter identified by effective set of variables, but the goal of this estimate is the target parameter identified by target set of variables. The target set may be determined by a threshold that is set. For example, a user may set a threshold of an acceptable level of bias, variance, or mean square error. The effective set may be determined by minimizing an empirical criterion
assessing the performance of subset specific estimators as an estimator of the target parameter identified by target set of variables.
[16] Referring to Fig. 1 , the process of first selecting target set is determined. The selection of the target set defines the target parameter, which provides characteristics of the relationship of interest, such as the effect of a treatment on an outcome controlling for the target set of variables. Different target sets identify a different target parameter. For example, a first target parameter may be the effect of a treatment adjusted for a first set of target variables. When the target set is changed, a second target parameter may result, such as the effect of a treatment adjusted for a second set of target variables. A target set determiner 104 receives a full set of variables and determines the target set of variables that are determined should be adjusted in the estimation. The target set may be determined based on which target parameter is determined to be reasonably well identifiable and is close to the full (i.e., wished) set of variables. Target set determiner 104 is configured to data adaptively adjust the full set of variables to the target set of variables. The target set of variables may include fewer variables than the full set of variables.
[17] An estimator determiner 106 is configured to determine an estimator of effect of input variable on an outcome, adjusting for the target set of variables (i.e., target parameter). A data set may be used to determine the effect of the input variable on the outcome adjusting for the target set of variables. The target set of variables is adjusted for and used to determine the effect of the input variable on the outcome adjusting for the target adjustment set. In one example, effective set determiner 108 may reduce the target set to an effective set. The reduction may be performed. Effective set determiner 108 may remove variables that it determines may cause too much bias or variance in the estimate with respect to the target parameter defined by target set of variables.
[18] The estimator determined may be displayed to a user. Also the selected target adjustment set is displayed to user, and information related to how much bias or mean
square error the estimator has relative to the target parameter defined by target set is output.
[19] A particular embodiment of the determination of the target set of variables will now be described in more detail. Target set determiner 104 uses a criterion that determines whether the set of variables can be adjusted with a certain amount of bias. For example, a user can be told how biased the estimation may be based on the set of variables that were adjusted. In one example, the criterion may be used to diagnose the bias using an inverse probability of treatment weighted (IPTW) estimator or a truncated IPTW estimator. These estimators will be described in more detail below.
[20] Fig. 2 depicts a more detailed example of target set determiner 104 according to one embodiment. The target set of variables or full set may be input into a variable ranker 202, which can rank the variables based on a criterion. For example, different target sets of variables may be determined. The bias caused by different target sets of variables may be estimated and used to rank the sets. For example, a data set is used to estimate the bias for each target set. The data set may be one in which the data for the input variables and outcome variables are known or have known probability distribution. The largest target set within the wished full set of variables that still has an acceptable amount of bias may be selected as the target set to use in the estimator.
[21] Different ways of selecting the target sets may be used. To determine the different candidate target sets, the variables in the full set may be ranked based on how strongly they are correlated to the input variable. Other methods of ranking the full set of variables may also be appreciated. The correlation may be determined based on the relationship of the variables to the input variable. The relationship is used because if the relationship is very strong, it may be expected that the variables should be removed. For example, if a variable is too strongly correlated to the input variable, then it may not be adjusted for.
[22] Once the variables have been ranked, a subset determiner 204 determines a target set of variables. The process of determining the target set may be iterative. For example, different candidate target sets of variables are evaluated to determine which target set provides an optimal set of variables for the estimation. In one example, subset determiner 204 uses different algorithms to select the variables to include in the target set. The algorithms may try to include the largest number of variables in the full set within an acceptable amount of bias. For example, the whole set may be taken and evaluated. Then, a smaller set is used and evaluated, and so on. Once all the candidate sets have been evaluated, a target set may be selected.
[23] When subset determiner 204 determines a target set of variables, an estimator 206 is used to estimate the effect of interest as identified by this target set. The effect of interest may be the effect of input variable on output variable controlling for target set of variables. The target set is used to form an estimator of the target parameter identified by the target set. Different estimators using different candidate target sets are formed. A criterion is then used to evaluate how biased the estimators are for the effect of interest (i.e., the target parameter) where the target set of variables is adjusted for in the estimation. . For example, bias may be calculated using the different target set of variables. The bias may be determined based on a known data set that includes data on the input variable, outcome and other variables , and this known data set is chosen so that the effect of interest for that data set is known. The data set is input into estimator 206 and an estimate is determined using the target set of variables. The target adjustment set selection is based on an empirical criterion that measures performance of the estimator in estimating the effect of interest identified by the target set. The estimates based on the known data sets (obtained by simulating from an estimated data generating distribution) in which the corresponding true effects of interests are known are then compared to the known effects of interests to determine the bias caused by the target set of variables.
[24] The bias may also be estimated based on influence curves of the different estimators for the different candidate target sets, and used to select the target set of variables, as described in detail below. The different estimators for the different candidate target sets may be targeted maximum likelihood estimators. The influence curve of a targeted maximum likelihood estimator evaluated at the targeted maximum likelihood estimator is sensitive to sparse data bias, so that estimates of bias and variance of this estimated influence curve provide excellent measures for sparse data bias.
[25] The process reiterates to subset determiner 204 where different candidate target sets of variables are determined and evaluated for bias. When the bias for the target sets has been determined, evaluator 208 determines the optimal target set of variables per a criterion. Based on the estimated bias, the different target sets of variables may be ranked. The estimated bias may be based on the variation of an estimate from a true answer based on known data sets. Since these known data sets are chosen so that the known answers are known, the estimates are compared to the known answers. The average difference between the known answers and the estimated outcome may be the bias for the target set.
[26] Evaluator 208 may use a criterion to determine which of the target sets may be optimal. For example, the target set that is largest and still has an acceptable amount of bias may be selected. However, another target set may be selected. Many other choices will be appreciated. Overall the choice of target parameter may be driven by properties such as reliable statistical inference, bias, variance, mean squared error, and being close to wished parameter that the estimator wants to learn (but might be impossible to learn).
[27] Fig. 3 depicts a simplified flowchart 300 of a method for determining a target set of variables according to one embodiment. The process will be described where the variables are considered confounders. However, the confounder could be any variable that is related to the relationship of interest.
10
[28] In step 302, target set determiner 104 determines the set of confounders. The set of confounders may be determined based on a statistical analysis being performed and may be the full set.
[29] In step 304, target set determiner 104 determines different target sets of confounders. The target sets may be determined by different algorithms that select different sets of confounders. For example, a cut-off point may be determined in a ranked list of confounders to determine the target set. Also, different nested sets may be determined for the target sets.
[30] In step 306, target set determiner 104 evaluates the different target sets against a criterion. For example, a criterion is used to determine bias for the target set using the data set that is known. The bias of each target set in estimation is used to evaluate whether the target set includes an optimal number of variables that should be adjusted in the estimation. For example, a target set that has acceptable bias or mean square error may be selected. Step 308 outputs the optimal target set as the selected target set.
[31] Once the truncated set is determined, an estimate may be determined using an estimator. Fig. 4 depicts a simplified flowchart 400 of a method for an estimator using the target set of confounders according to one embodiment.
[32] In step 402, estimator determiner 106 receives the target set of confounders that defines a target parameter.
[33] In step 404, estimator determiner 106 receives receive data set and defines an estimator for many subsets of the target set. The data set may include an input variable in which an estimate is desired. For example, an effect of a treatment variable on an outcome variable, controlling for the target set of variables, may be estimated for the data set. The data set may include information for the treatment variable, outcome, and variables in the target set.
11
[34] In step 406, estimator determiner 106 defines a mean squared error criterion measuring performance of subset specific estimators as an estimator of target parameter. In step 408, estimator determiner 106 selects an effective set minimizing an estimated mean squared error.
[35] For example, an estimator of the effect of the treatment variable on the outcome variable, controlling for the effective set of confounders, is determined. This latter estimator is optimized to estimate the effect of treatment variable on the outcome, controlling for the target set of variables. In other words, ' 'effective' ' refers to "effective for estimating the effect identified by target set of variables".
[36] In step 410, an estimator for the target parameter may be output. In addition, the effective set of variables may be output. For example, the estimator, target set, effective set, and a bias and variance estimate, may be displayed to a user. The estimator may be generated by training it on the data set received. The bias and variance are used to indicate to a user a confidence level to indicate a measure of reliability. The level may be the amount of bias or variance that may be included if the estimator is used. This gives the user a good idea of how to evaluate the results of the estimator. For example, a user may not rely on results that include a large (sparse data) bias. This may be helpful for a researcher or other person who is basing observations or analysis on the estimator.
[37] In step 412, an estimate of the sparse data bias with respect to the target parameter defined by the target set of confounders may be outputted. The estimate may indicate to the user the risk in relying on the estimate as an estimate of the target parameter, and corresponding confidence level, such as confidence intervals for the target parameter. A confidence interval (CI) or confidence bound is an interval estimate of a population parameter. Instead of estimating the parameter by a single value, an interval likely to include the parameter is given. Thus, confidence intervals are used to indicate the reliability of an estimate. The p-value is the probability of obtaining a result at least as extreme as the one that was actually observed, given that the null hypothesis is true.
12
The fact that p-values are based on this assumption is crucial to their correct interpretation.
[38] Particular embodiments will now be described in more detail with respect to an experiment or study. However, the methods can be applied to any other estimation. Particular embodiments provide a statistical application for estimating the impact of a treatment variable on an outcome of interest. If the treatment variable has not been randomized, it is desirable to adjust such effect estimates for a set of covariates that are thought to confound the relationship of interest (i.e., a set of confounders). Such an adjustment, however, relies on the assumption of experimental treatment assignment (ETA) according to which each experimental unit has positive probability of being observed at any of the possible levels of the treatment variable, regardless of the values the confounding factors may take on. In many data analyses, this assumption is practically violated in the sense that certain values of the confounding factors cause some treatment levels to become if not impossible, at least highly unlikely. Under such a practical ETA violation, adjusted variable importance or effect estimates based on moderate amounts of data often become biased and highly variable, and the resulting statistical inference is often very unreliable.
[39] Data-adaptive selection of the adjustment set of confounders represents an automated approach for avoiding such problems. The selection is based on a criterion for deciding if a particular adjusted variable importance parameter suffers from too strong an ETA violation (i.e., a correlation between confounders and the input variable) to be reliably estimated from the data. Given a proposed adjustment set, particular embodiments use this criterion to identify a maximal subset of adjustment variables for which the ETA assumption appears reasonably well satisfied.
[40] The adjustment set defining the parameter of interest, as selected based on a given identifiability criterion, is referred to as the targeted adjustment set; the possibly smaller data adaptively determined adjustment set used in estimating this parameter, on
13
the other hand, is referred to as the effective adjustment set. The effective adjustment set is thus nested in the targeted adjustment set, which in turn is nested in the full adjustment set.
[41] Even if the variable importance parameter corresponding to a particular adjustment set can be estimated reliably from the data at hand, it may be advantageous to base estimation of this parameter on an adjustment set that in fact excludes additional covariates. Particular embodiments therefore include a second step that is aimed at evaluating whether such additional exclusions can be used to obtain more efficient estimates of that parameter. This second step then results in the effective adjustment set.
[42] Many applications in modern biology measure a large number of genomic or proteomic covariates and are interested in assessing the impact of each of these covariates on a particular outcome of interest. In a study of HIV-positive patients, for example, a researcher may genotype the virus infecting each patient to ascertain the presence or absence of a large number of mutations, in the hope of identifying mutations that affect how a patient's plasma HIV RNA level (viral load) responds to a new drug regimen. Estimates of the impact of each of these mutations on viral load could then be used to inform the decision of which drugs should be included in the regimen of a patient with a particular pattern of mutations.
[43] The impact of a particular mutation can be assessed on viral load would be to compare the virologic response among patients whose virus has the mutation to that among patients whose virus does not. If it is found that patients in the first group respond much more poorly to a particular drug regimen, a clinician might be inclined not to give this regimen to a new patient entering his office who has this mutation. Patients in the first group are, however, also quite likely to differ from those in the second group in terms of the remaining mutations as well as other measured clinical covariates. The mutation of interest may, for example, be very common among patients who have previously failed several similar drug regimens, making them far more likely to also fail
14
the current one, but very rare among other patients. If the clinician's new patient comes from a population that differs from our original study population in that the mutation is not associated with having previously failed similar drug regimens, it might be wrong to conclude that the regimen under consideration would be a poor choice in this situation. Since the impact of the mutation of interest on viral load is confounded by the remaining mutations as well as other clinical covariates, such unadjusted estimates thus do not generalize to a new population in which the mutation of interest and the confounding factors are related to each other in a different way.
[44] Particular embodiments estimate the impact of a given mutation on viral load that is not due to associations of this mutation with any of the other measured covariates. Specifically, questions that can be considered include: What difference in viro logic response would be observed if it could somehow give every patient in the study population the mutation interest, holding the remaining covariates fixed at their current values, as opposed to the scenario in which none of the patients are given this mutation, holding again other covariates fixed? Any observed difference could then not be due to differences of the two populations with regard to the remaining covariates and would thus be more likely to generalize to a new population in which the mutation of interest and the other covariates may be related to each other differently.
[45] While such adjusted variable importance estimates are thus often more interesting than the corresponding unadjusted estimates, they also rely on an additional assumption in order to be identifiable from the collected data. Specifically, the assumption of experimental treatment assignment (ETA) requires that the adjustment variables cannot take on a set of values such that the group of patients corresponding to those values shows no variation in the mutation of interest. This assumption would be violated if, for example, there existed a second mutation that always occurred in concordance with the mutation of interest. Since patients may never be observed that exhibited each of the two mutations in the absence of the other one, it would be
15
impossible to disentangle the individual effects of these two mutations, this precludes estimating their impact on viral load adjusting for the other mutation.
[46] The set of adjustment variables may contain covariates that are not perfectly predictive of the mutation of interest, but still determine the presence or absence of that mutation in a nearly deterministic fashion. A second mutation may, for example, be so strongly correlated with the mutation of interest that 99% of patients with this second mutation also exhibit the mutation of interest. In such instances, a substantial amount of data would be required before the adjusted variable importance of the mutation of interest could be estimated in any reliable way. In smaller samples, it could easily occur by chance that no patients are observed that are discordant for these two mutations, again precluding obtaining an adjusted variable importance estimate. To distinguish this scenario from the one described in the previous paragraph, it is referred to as a practical rather than a theoretical violation of the ETA assumption.
[47] Under either of these two violations of the ETA assumption, the desired adjusted variable importance is not identifiable from the data at hand, making any estimates and confidence intervals of this parameter unreliable and hard to interpret. A practical ETA violation, for example, often causes such estimates to become biased, unstable and highly variable. An analysis that under such circumstances still aims to rank mutations based on adjusted variable importance estimates may lead to unsatisfying results. Suppose, for example, that a mutation with no impact on viral load is strongly correlated with a second mutation that itself has a considerable impact. The practical ETA violation caused by this correlation would likely lead to highly variable and thus statistically non-significant adjusted variable importance estimates for both mutations. In this case, more useful results could be obtained by turning to variable importance estimates that do not attempt to adjust for the other mutation. This approach would likely yield significant estimates for both mutations, allowing the user to conclude that at least one of these two mutations has an impact on viral load. While it might be acknowledged
16
that the individual contributions of the two mutations cannot be disentangled, such a qualified identification of two mutations would generally seem preferable to the conclusion drawn from a fully adjusted analysis, according to which neither mutation would seem important in determining viral load. In addition, practical or theoretical ETA violation can easily result in so biased estimates for the effect of a mutation, and far too small estimate of the standard error in this estimate, so that the resulting confidence intervals and p-values are biased and wrong, resulting in false claims. That is, the data is not only not able to assess the effect of the mutation adjusting for this set of confounders, but it is also not able to reliably assess the variance of the estimate of this effect, making statistical inference impossible.
[48] Particular embodiments are based on the idea of developing a criterion that can give the user a sense of the extent to which the variable importance parameter corresponding to a proposed adjustment set is identifiable from the data at hand. If this criterion suggested that the parameter corresponding to the full adjustment set was not well identified, it could then also be used to identify a smaller, more workable adjustment set. In one embodiment, two criteria may be used. The first criterion makes use of a simulation-based approach for diagnosing the bias that a so-called Inverse-Probability-of- Treatment- Weighted (IPTW) estimator is subject to if the ETA assumption is violated. The second criterion makes use closed- form estimates derived for the asymptotic bias of a truncated IPTW estimator. In some applications the greater computational burden of the first criterion may make the second approach a more appealing option. The choice of IPTW estimator can be replaced by other estimators that are sensitive to ETA violations such as the nonparametric targeted maximum likelihood estimator. In addition, particular embodiments involve an approach for defining a sequence of nested candidate target adjustment sets that, in combination with a given identifiable criterion, can be used to select an appropriate effective adjustment set data-adaptively.
[49] Even if the variable importance parameter corresponding to a particular
17
adjustment set is identified reasonably well by the data at hand, it may be advantageous to base estimation of this parameter on an adjustment set that in fact excludes additional covariates. The adjustment set defining the parameter of interest may, for example, contain a covariate that is a good predictor of the mutation under consideration, but only a weak predictor of viral load. Such a covariate may be only a weak confounder of the relationship between the mutation and viral load, but can still lead to a mild practical violation of the ETA assumption that would cause the variable importance estimator to become more variable. Not adjusting for this covariate could thus, at the price of a slight increase in bias, offer a considerable reduction in variability, thus leading to an overall reduction in mean squared error. Particular embodiments therefore also involve an approach that, given an adjustment set defining the parameter of interest, can be used to evaluate whether such additional exclusions from the adjustment set can be expected lead to a more efficient estimator with smaller mean squared error.
[50] The closed- form mean squared error estimates used for truncated IPTW estimators have an additional application in selecting an appropriate truncation level for IPTW estimators. By weighting subjects by the inverse of the conditional probability of having selected their observed treatment, given available confounders, these estimators create a new sample in which treatment assignment is independent of the measured confounders. If the ETA assumption is practically violated, observations with very small treatment probabilities and corresponding large weights can dominate the remainder of the sample so that the estimator tends to become highly variable. In such instances, the use of truncated weights can often, at the price of a slight increase in bias, lead to a dramatic reduction in variability and thus typically also to a reduction in mean squared error. The closed- form estimates for the asymptotic bias of a truncated IPTW estimator can be used to select this truncation data-adaptively based on the goal of minimizing the mean squared error of the estimator.
[51] One of the criteria used for assessing if a given adjusted variable importance
18
parameter can be reliably estimated from the data at hand is based on closed- form estimates developed for the mean squared error of a truncated Inverse-Probability-of- Treatment- Weighted (IPTW) estimator. If the ETA assumption is practically violated, the performance of this estimator can often be improved by truncating Inverse- Probability-of-Treatment weights. The mean-squared-error estimates can thus in particular be used to select an appropriate truncation level based on the goal of minimizing the mean squared error of the estimator. The data adaptive selection of a truncation level is an important ingredient itself since many estimators require selection of a truncation level in order to make the estimators most robust, and the method for selection of truncation by minimizing an estimated mean squared error apply to each of these cases.
[52] The method for selection of the adjustment set in a variable importance analysis is unified to selection of the targeted parameter among a family of parameters based on a criterion that assesses the amount of bias in the target parameter of interest due to lack of identifiability. This criterion is referred to as sparse-data bias, which is a bias due to lack of data.
[53] This unifying method is based on using the influence curve of each estimator in the family of estimators corresponding with the family of parameters. Since an influence curve can be derived for any asymptotically linear estimator of any pathwise differentiable parameter, this generalizes the methodology to all problems of interest, and thereby provides a unifying methodology.
[54] In the selection of the target parameter among a family of parameters based on sparse-data bias, consider a data set 0(1),...,O(n) corresponding to a family of n experiments, whose probability distribution implies a common (in n) probability distribution P. In one case, the experiments are independent, and possibly identically distributed, but in some applications the experiments can be dependent. For discussion purposes, it is assumed that the n experiments are independent and identically distributed
19
with distribution P, but results have straightforward analogues for non-identical and possibly dependent experiments still allowing asymptotically linear (and thereby normally distributed) approximations of the estimators of the smooth parameters psi(delta) defined below of a common P across experiments as well as estimators of the asymptotic variance of these estimators.
[55] In many examples the data structure O on one experimental unit can be represented as a censored data structure, O=Phi(C,X), on some full data random variable X, so that its distribution is identified by the full data distribution and a conditional distribution G of the censoring variable C, given X. The typical model may then be the model implied by a model for the full data distribution F (e.g. nonparametric) and some model for G.
[56] Fig. 5 depicts a simplified flowchart 500 for selecting the targeted parameter based on sparse-data bias according to one embodiment. Step 502 defines family of path-wise differentiable parameters. The path-wise differentiable parameters may be the effect of an input variable on an outcome adjusting for different target set of confounders. Firstly, a family of path- wise differentiable parameters psi(delta)=Psi(delta)(P) of P indexed by delta in an index set, ranging from potentially non-identifiable or hardly identifiable parameters to parameters that are relatively easy to identify is defined. For example, psi(delta) might represent the delta-variable importance parameter defined by an adjustment set W(delta) contained in the full adjustment set W, where 0_i=(W_i,A_i,Y_i), i=l,..,n. More generally, psi(delta) could be defined as a causal- effect parameter based on a reduction of O indexed by delta (e.g., the reduction is obtained by removing certain baseline or time-dependent covariates/confounders). The choice of delta can also index a particular algorithm applied to the data generating distribution defining the parameter psi(delta).
[57] Step 504 defines data based estimators that are estimators that only rely on assumptions not affecting the information bound of the pathwise differentiable
20
parameters. Particular embodiments defines a corresponding family of estimators psi_n* (delta) whose consistency does not, or only minimally, rely on unknown modeling assumptions (i.e., beyond the assumptions specified by the actual model for P), but these estimators are allowed to rely on assumptions that do not affect the asymptotic information bound for psi(delta). In other words, these estimators are heavily data-based estimators minimally or not relying on model-based extrapolation improving the information bound for psi(delta) relative to the information bound for psi(delta) in the actual model for the data generating distribution P. In particular, these estimators need to be chosen such that, if the parameter psi(delta) is hardly identifiable from the data given the actual model (i.e., the model known to be true), then the influence curve of this estimator may be a random variable with very large variance (i.e., allowing for extreme outliers), or the estimator will simply be asymptotically biased. In general, , psi_n* (delta) may be targeted maximum likelihood estimator since the influence curve of targeted maximum likelihood estimators are very sensitive to sparse data bias. In the delta- variable importance application, psi_n* (delta) may also bean IPTW-estimator, possibly with known treatment/censoring mechanism. In this application an IPTW- estimator may be selected in the model in which the treatment mechanism is known, since a model with known CAR-treatment/censoring mechanism has the same information bound for the variable importance parameter as a model with unknown CAR- treatment/censoring mechanism.
[58] Step 506 derives influence curves of family of data-based estimators. Let IC*(delta,P) represent the influence curve of these data-based estimators psi_n* (delta) of psi(delta) for all delta. For any parameters that are identifiable at P, the expectation of the influence curve equals zero, E_PIIC*(delta,P)(O)=0, but these parameters might still suffer from practical lack of identifiability of such a parameter in the sense that for some possible observations from the probability distribution P the influence curve values might be extremely large. This results in finite-sample bias of psi_n* (delta) with respect to the parameter Psi(delta)(P).
21
[59] Step 508 generates analytic influence-curve-based measures of sparse-data bias: As an alternative to a simulation-based approach for assessing finite-sample sparse-data bias under an estimated data generating distribution, the following analytical approaches may be used. Firstly, a user-supplied, practically reasonable truncation level M for the influence of a single observation may be defined, such as let IC*(delta,M) be the corresponding truncated influence curve. For example, a user might decide that a single observation should never represent an influence curve contribution exceeding more than 2% of the sample variance of the influence curve. In essence, the user needs to decide what level of robustness is minimally required for any estimator to be admissible.
[60] As measure of sparse-data bias of the data-based estimator psi_n* (delta) as an estimator of psi(delta), the bias (e.g, the empirical mean) of the truncated influence curve under a particular estimated data generating distribution is used. Particular embodiments estimate the unknowns in the influence curve in such a way that the empirical mean of the untruncated influence curve equals zero. In one embodiment, a targeted maximum likelihood estimator is used to solve the empirical mean of the untruncated influence curve. The sparse-data bias is defined as the empirical mean of the at M truncated influence curve. It is noted that by requiring that empirical mean of the untruncated influence curve to equal zero, this measure of bias is only due to truncation of the influence curve and is therefore indeed very much a measure of bias due to lack of identifiability based on the data at hand. In one particular embodiment the empirical mean of the truncated influence curve is replaced by the cross-validated empirical mean of the truncated influence curve. The proposed measures of sparse-data bias are numbers on the same scale as the parameters psi(delta), thereby allowing natural quantification, and their theoretical underpinning is based on the fact that the expectation of an influence curve at a candidate Pl equals in first order the deviation of the parameter of interest at Pl minus the true parameter. Other measures of sparse data bias derived from the influence curve of the delta-specific (e.g. targeted maximum likelihood) estimators may be appreciated.
22
[61] Step 510 selects the target parameter. An example of a target parameter is an effect of an input variable on an outcome adjusting for a target set of confounders. The target parameter is now defined by setting an acceptable level of sparse-data bias, possibly relative to a particular parameter that is easy to estimate. In this way, the user is able to evaluate how far a given parameter is from the edge (in a space/set of candidate parameters) at which parameters become, for all practical purposes, impossible to identify so that statistical estimation of this parameter as well as inference (i.e., standard errors) will become completely unreliable. If there are multiple parameters within the specified identifiable range, then the user may have to make a choice. Typically, the parameters psi(delta) can be ordered with respect to their distance to the wished (but possibly impossible to estimate) parameter so that the first parameter is selected in this ordered list which satisfies the acceptable identifiable bias. Let delta* denote the selected choice of target parameter.
[62] Step 512 data adaptive Iy selects the effective parameter in order to estimate target parameter based on mean square error: Given the selection of the targeted parameter Psi(delta*) and given user-supplied wished estimators (e.g., targeted MLE) psi n(delta) of psi(delta) with influence curves IC(delta) for all delta, an effective parameter choice is selected by minimizing over delta an estimate of the Mean Squared Error (MSE). (Note that the wished estimators psi n(delta) need not to be equal to the estimators psi_n* (delta) used to assess identifiability bias to select target parameter). The bias component of the MSE at delta can be estimated as the difference between the estimator psi n(delta) and psi_n(delta*) and the variance component is estimated with the influence curve of psi n(delta). Due to the fact that the target parameter is chosen away from the edge of becoming non-identifiable, this method for selecting delta may have a superior practical performance relative to its performance applied to a target parameter that is close to non-identifiable. The effective parameter could be an effect of input variable on outcome adjusting for an effective set of confounders.
23
[63] The estimate of the bias term in the MSE criterion for psi n(delta) can be penalized with an additional finite sample bias term which can be estimated with the cross-validated mean of the influence curve of psi n(delta). The purpose of this finite sample bias term is that it picks up a contribution due to the second order term in a Taylor expansion of the estimator psi n(delta) relative to psi θ(delta), while the variance of the influence curve is only based on the first order linear approximation of the estimator. For example, increasing delta (assuming an ordering exists) could correspond with increasing finite sample bias of the corresponding estimator psi n(delta) relative to psi(delta), while, possibly, the actual variance of psi n(delta) decreases. By not taking into account the increase in finite sample bias with increasing delta one might end up with a choice providing a wrong bias-variance trade-off. Therefore, in situations in which finite sample bias is a concern, including a finite sample bias estimate of E psi_n(delta)-psi(delta) is recommended. This finite sample bias may also be estimated with the bootstrap.
[64] A particular application of the selection methodology above concerns the selection among candidate targeted maximum likelihood estimators of a variable importance or a causal effect indexed by different algorithms for the nuisance parameters such as the initial regression estimator. The use of data adaptive algorithms for these nuisance parameters can potentially cause non-negligible finite sample bias in the targeted maximum likelihood estimators which is not captured by variance estimates based on the influence curve.
[65] A particular case of interest is the case that all targeted maximum likelihood estimators (i.e., for each choice of delta) are actually known to asymptotically target the right parameter psi, such as targeted maximum likelihood estimators of the causal effect of a treatment in a randomized trial. In this case these targeted maximum likelihood estimators may be indexed by different initial regression estimators, but each targeted maximum likelihood estimator is known to be asymptotically consistent for the wished causal effect psi. That is, a class of asymptotically linear estimators psi n(delta) of psi is
24
indexed by different algorithms identified by delta with influence curve IC(delta) as above, but when selecting among these candidate targeted maximum likelihood estimators the finite sample bias of these estimators need to be taken into account as well (beyond comparing the variance of the candidate estimators as can be derived from their influence curves). The variance of these estimators can be based on the influence curve, and the bias term could be estimated with the finite sample bias estimate only, since none of these targeted maximum likelihood estimators have any asymptotic bias.
[66] Although the description has been described with respect to particular embodiments thereof, these particular embodiments are merely illustrative, and not restrictive. Any suitable programming language can be used to implement the routines of particular embodiments including C, C++, Java, assembly language, etc. Different programming techniques can be employed such as procedural or object oriented. The routines can execute on a single processing device or multiple processors. Although the steps, operations, or computations may be presented in a specific order, this order may be changed in different particular embodiments. In some particular embodiments, multiple steps shown as sequential in this specification can be performed at the same time.
[67] Particular embodiments may be implemented in a computer-readable storage medium or tangible medium for use by or in connection with the instruction execution system, apparatus, system, or device. Particular embodiments can be implemented in the form of control logic in software or hardware or a combination of both. The control logic, when executed by one or more processors, may be operable to perform that which is described in particular embodiments.
[68] Particular embodiments may be implemented by using a programmed general purpose digital computer, by using application specific integrated circuits, programmable logic devices, field programmable gate arrays, optical, chemical, biological, quantum or nanoengineered systems, components and mechanisms may be used. In general, the functions of particular embodiments can be achieved by any means as is known in the art.
25
Distributed, networked systems, components, and/or circuits can be used. Communication, or transfer, of data may be wired, wireless, or by any other means.
[69] It will also be appreciated that one or more of the elements depicted in the drawings/figures can also be implemented in a more separated or integrated manner, or even removed or rendered as inoperable in certain cases, as is useful in accordance with a particular application. It is also within the spirit and scope to implement a program or code that can be stored in a machine-readable medium to permit a computer to perform any of the methods described above.
[70] As used in the description herein and throughout the claims that follow, "a", "an", and "the" includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of "in" includes "in" and "on" unless the context clearly dictates otherwise.
[71] Thus, while particular embodiments have been described herein, latitudes of modification, various changes, and substitutions are intended in the foregoing disclosures, and it will be appreciated that in some instances some features of particular embodiments will be employed without a corresponding use of other features without departing from the scope and spirit as set forth. Therefore, many modifications may be made to adapt a particular situation or material to the essential scope and spirit.
26
