Nadon, Robert (35 South Drive, St Catharines, Ontario L2R 4T9, CA)
Susko, Edward (80 Camelot Lane, Apt. 126 Halifax, Nova Scotia B3M 4KB, CA)
Nadon, Robert (35 South Drive, St Catharines, Ontario L2R 4T9, CA)
| 1. | A method for improving the reliability of physical measurements obtained from array hybridization studies performed on an array having a large number of genomic samples including a replicate subset containing a small number of replicates insufficient for making precise and valid statistical inferences, comprising the step of estimating an error in measurement of a sample by combining estimates obtained with individual samples in the replicate subset, and utilizing the estimated sample error as a standard for accepting or rejecting the measurement of a sample under test. |
| 2. | The method of claim 1 wherein the'combining step includes taking the difference between estimates obtained for a pair of samples in the replicate subset. |
| 3. | The method of claim 2 wherein the difference is taken between the estimates for all permutations of pairs of samples for the replicate subset. |
| 4. | The method of claim 3 wherein the difference is taken between the estimates for all combinations of pairs of samples for the replicate subset. |
| 5. | The method of any one of claims 14 used with respect to two new samples to establish a confidence level that two samples under test express different outcomes of a physical measurement. |
| 6. | The method of any one of claims 14 wherein the estimates of measurement error are used to plan, manage and control array hybridization studied on the basis of (a) the probability of detecting a true difference of specified magnitude between physical measurements of a given number of samples under test, or (b) the number of samples under test required to detect a true difference of specified magnitude. |
| 7. | The method of anyone of claims 16 wherein the sample under test is in the array. |
| 8. | The method of anyone of claims 16 wherein the sample under test is in an array other than the array. |
| 9. | The method of anyone of claims 16 used to determine whether the sample under test deviates substantially from all of the other values in a selected portion of an array. |
| 10. | The method of any preceding claim wherein there are no replicates corresponding to the sample under test. |
Background of the Invention Array-based genetic analyses start with a large library of cDNAs or oligonucleotides (robes), immobilized on a substrate. The probes are hybridized with a single labeled sequence, or a labeled complex mixture derived from a tissue or cell line messenger RNA (target). As used herein, the term"probe"will therefore be understood to refer to material tethered to the array, and the term"target"will refer to material that is applied to the probes on the array, so that hybridization may occur.
The term"element"will refer to a spot on an array. Array elements reflect probe/target interactions. The term"background"will refer to area on the substrate outside of the elements.
The term"replicates"will refer to two or more measured values of the same probe/target interaction. Replicates may be within arrays, across arrays, within experiments, across experiments, or any combination thereof.
Measured values ofprobe/target interactions are a function of their true values and of measurement error. The term"outlier"will refer to an extreme value in a distribution of values. Outlier data often result from uncorrectable measurement errors and are typically deleted from further statistical analysis.
There are two kinds of error, random and systematic, which affect the extent to which observed (measured) values deviate from their true values. Random errors produce fluctuations in observed values of the same process or attribute. The extent and the distributional form of random errors can be detected by repeated measurements of the same process or attribute. Low random error corresponds to high precision. Systematic errors produce shifts (offsets) in measured values. Measured values with systematic errors are said to be"biased".
Systematic errors cannot be detected by repeated measurements of the same process or attribute because the bias affects the repeated measurements equally. Low systematic error corresponds to high accuracy. The terms"systematic error","bias", and"offset"will be used inter-changeably in the present document.
An invention for estimating random error present in replicate genomic samples composed of small numbers of data points has been described by Ramm and Nadon in"Process for Evaluating Chemical and Biological Assays", International ApplicationNo. PCT/IB99/00734, filed 22 April 1999, the entire disclosure of which is incorporated herein by reference. In a preferred embodiment, the process described therein assumed that, prior to conducting statistical tests, systematic error in the measurements had been removed and that outliers had been deleted.
Once systematic error has been removed, any remaining measurement error is, in theory, random. Random error reflects the expected statistical variation in a measured value.
A measured value may consist, for example, of a single value, a summary of values (mean, median), a difference between single or summary values, or a difference between differences.
In order for two values to be considered significantly different from each other, their difference must exceed a threshold defined jointly by the measurement error associated with the difference and by a specified probability of concluding erroneously that the two values differ (Type I error rate). Statistical tests are conducted to determine if values differ significantly from each other.
In addition, to correct removal of systematic error, many statistical tests require the assumption that residuals be normally distributed. When it is incorrectly assumed that residuals are normally distributed, the calculation of the residuals and of subsequent statistical tests is biased. Residuals reflect the difference between values'estimated true scores and their observed (measured) scores. If a residual score is extreme (relative to other scores in the
distribution), it is called an outlier. An outlier is typically removed from further statistical analysis because it generally indicates that the measured value contains excessive measurement error that cannot be corrected. In order to achieve normally distributed residuals, data transformation is often necessary (e. g., log transform).
Two approaches have been presented in prior art.
Pietu et al. (1996) observed in their study that a histogram of probe intensities presented a bimodal distribution. They observed further that the distribution of smaller values appeared to follow a Gaussian distribution. In a manner not described in their publication, they "fitted"the distribution of smaller values to a Gaussian curve and used a threshold of 1.96 standard deviations above the mean of the Gaussian curve to distinguish nonsignals (smaller than the threshold) from signals (larger than the threshold). Based on calculation of residuals, the present invention also provides for threshold estimations. However, the present invention differs from Pietu et al. (1996) in that it: uses replicates; uses formal statistical methods to obtain threshold values; does not assume a Gaussian (or any other) distribution ; and can detect outlier values.
Chen, Dougherty, & Bittner have presented an analytical mathematical approach that estimates the distribution of non-replicated differential ratios under the null hypothesis. This approach is similar to the present invention in that it derives a method for obtaining confidence intervals and probability estimates for differences in probe intensities across different conditions.
It differs from the present invention in how it obtains these estimates. Unlike the present invention, the Chen et al. approach does not obtain measurement error estimates from replicate probe values. Instead, the measurement error associated with ratios of probe intensities between conditions is obtained via mathematical derivation of the null hypothesis distribution of ratios.
That is, Chen et al. derive what the distribution of ratios would be if none of the probes showed differences in measured values across conditions that were greater than would be expected by "chance."Based on this derivation, they establish thresholds for statistically reliable ratios of probe intensities across two conditions. The method, as derived, assumes that most genes do not show a treatment effect and that the measurement error associated with probe intensities is normally distributed (i. e., that the Treatment/Reference ratios are normally distributed around a ratio of approximately 1). The method, as derived, cannot accommodate other measurement error models (e. g., lognormal). It also assumes that all measured values are unbiased and reliable
estimates of the"true"probe intensity. That is, it is assumed that none of the probe intensities are "outlier"values that should be excluded from analysis. Indeed, outlier detection is not possible with the approach described by Chen et al. The present invention differs from Chen et al. (1997) in that it: uses replicates does not assume a Gaussian (or any other) distribution ; can be applied to single condition data (i. e., does not require 2 conditions to form ratios); does not require the assumption that most genes do not show a treatment effect; and can detect outliers.
In accordance with one aspect, the present invention is a process for estimating the extent of random error present in replicate genomic samples composed of small numbers of data points and for conducting a statistical test comparing expression level across conditions (e. g., diseased versus normal tissue). It is an alternative to the method described by Ramm and Nadon in"Process for Evaluating Chemical and Biological Assays", International Application No.
PCT/IB99/00734. As such, it can be used in addition to (or in place of) the procedures described by Ramm and Nadon. In accordance with another aspect, the present invention is a process for establishing thresholds within a single distribution of expression values obtained from one condition or from an arithmetic operation of values from two conditions (e. g., ratios, differences).
It is an alternative to the deconvolution process described in International Application No.
PCT/IB99/00734. In accordance with a third aspect, it is a process for detecting and deleting outliers.
Brief Description of the Drawings The foregoing brief description,. a well as further objects, features and advantages of the present invention will be understood more completely from the following detailed description of a presently preferred, but nonetheless illustrative embodiment, with reference being had to the accompanying drawings, in which: Figure 1 shows the results of residual estimation based on simulated data ; and Figures 2 and 3 shows results of residual estimation based on actual experimental data.
Description of the Preferred Embodiment
We assume throughout that we observe data y ;, with i =1,..., n and j =1,..., ni where: Yjjj =µi + #ij (1) and the E ii are independent and identically distributed. Our interest is in estimating the residual distribution, the distribution of the e. j. Let f, f * and F denote the density, characteristic function and cumulative distribution function of the E A tacit assumption is that n is large and ni is small, for instance 2 or 3.
Assumptions such as these arise naturally in measurement error models. While our interest in estimating the residual distribution arose in the analysis of gene expression data, we expect the methodology to be of broader applicability.
With m moderate to large, the usual estimate of the residual distribution is a discrete distribution that gives equal mass to each of the estimated residuals: This estimator is biased with the bias dependent up n the residual distribution. For instance, for a N (0, 1) residual distribution the expectation of Fe is the N (0, (m-1)/m) distribution. For a Cauchy residual distribution, the expectation is the distribution of a Cauchy random variable multiplied by 2-2/m. When the residual distribution has finite mean, the bias decreases with increasing m. With n large and in small however, the bias dominates the variance. In contrast the methods presented here give consistent (large n) estimates of the residual distribution. The basic idea uses the differences in observations, Yül-Yü2 which have distributions that depend, in a known way, upon the residual distribution alone. This differs from the usual way of calculating residuals. An example best illustrates this difference. Consider the three replicate values of 1, 2, and 3. The usual way of calculating residuals is to subtract the mean of the three values from each value in turn (1-2; 2-2; 3-2) yielding three residuals (-1,0,1). In the preferred form of the present process, the residuals are calculated instead by subtracting each replicate value from each of the other replicate values in all possible permutations. In the present example, this would be (Replicate 1-Replicate 2; Replicate 2-Replicate 1; Replicate 1-Replicate 3; Replicate 3- Replicate 1; Replicate 2-Replicate 3; Replicate 3-Replicate 2), that is, (1-2; 2-1; 1-3; 3-1; 2-3; 3-2) to yield six residuals (-1, 1,-2,2,-1,1). This approach has the advantage of not including
the potentially biasing effect of including the mean in the calculations. Alternatively, all possible combinations (rather than permutations) might be used.
Two methodologies are proposed: inversion of an estimate of the characteristic function of the residuals and an E-M algorithm approach that seeks a residual distribution that maximizes a pseudo-likelihood for the differenced data. A key reference for the characteristic function methodology is Zhang (1990). Background material for the E-M algorithm is available in Dempster, Laird and Rubin (1977) and McLachlan and Krishnan (1997).
Array Based Expression Data The estimation of residual distributions became of interest to us in the analysis of array based gene expression intensity data. Regardless of the technology used (macroarrays, microarrays, or biochips) or the labeling method (radio-is topic, fluorescent, or multi-fluorescent), the observed values reflect the total amount of hybridization (joining) of two complementary strands of DNA to form a double-stranded molecule. The log-transformed observations (radio-isotopic, fluorescent, fluorescent ratios) can be labeled y. in where g denotes the experimental condition that the observed values correspond to (for instance, drug versus control, different tissues, etc.).
The index !'indicates the genetic sequence tag used in the experiment and j indicates that the observation was the jth repeated measurement within the genetic sequence tag/condition. The model for they ; is: Y gij µ +o, eg, j where the Egi ; are assumed independent and identically distributed. Here the eg ; j are measurement errors; log ; is the true intensity value for the gth condition and ith tag. Primary interest is in gli- p2i the difference in the intensity values, for a given genetic sequence tag, between two different conditions. A gene's expression intensity reflects its activity at specific moments or circumstances according to the design of the study. A gene's activity is of interest in its own right and also because it usually reflects the production of protein, which has corollaries for the function and regulation of cells, tissues, and organs in the body. Differences in gene expression are of interest to the extent that they reflect differences across conditions of these biological processes. Gene expression data have been characterized by large measurement error variation, large numbers of comparisons (sequence tags) and small numbers of measurements for each sequence tag. The number of comparisons can range between a few hundred and hundreds of
thousands. The numbers of measurements for a given sequence tag and condition are often 2 or 3. Because the measurement error is non-negligible it is usually the case that confidence intervals for the differences, u, ;-, U2i are desired. One approach is to make the common assumption that the residuals are normally distributed, in which case (1-a) x 100% confidence intervals would be provided by Here #g2 is the measurement errior varaince for the gth condition. With known non-normal residual distributions different forms of confidence intervals wouild usually be considered but it would still be reasonable to consider intervals with center #kli - #2i and half-width a constant multiple T of What value of i to use depends upon the particular form of the residual distribution. For the normal distribution T is Za/2 for the double exponential exponential distribution it would be-log (a). Thus, for instance, to obtain a 95% confidence interval # = 1.96 would be used for a normal residual distribution and # = 3 would be used for the double exponential. These very different values of T indicate that the residual distribution for a given condition is important t the inferences of interest in the analysis of expression data. Because of the similarities in the measurement process across comparisons and the large number of comparisons, it should be possible to obtain estimates of the residual distribution with low variability. Because of the small number of measurements for each comparison, care has to be taken to avoid bias in estimation.
Characteristic Function Methodology One approach to estimation ofthe residual distribution is through the characteristic function for the Yy-Yij2. Since Yij1 - Yij2 = #ij1 - #ij2 this characteristic function is f*(t)f* (-t). The form of the characteristic function for the difference indicates several identifiability problems. If the residual distribution is not a symmetric distribution then the distribution of -#ij is not the same as the distribution ofs ; j. However, since the characteristic function of-£jj isf8 (-t), the characteristic function for the difference #ij - #ij2 is f* (t) f* (-t) whether the residual distribution is that of-e ; j or p, ij. Thus skewness in the residual distribution will not be recoverable from the distribution of the difference of two errors. A common assumption for measurement error models is that the residual
distribution is symmetric. Recognizing that we cannot detect skewness we will make this assumption here. In this case the characteristic function of the difference becomesf* (t) 2 This creates an additional difficulty in that one cannot discern the sign of the residual characteristic function from the characteristic function of the difference. To adjust for this we make the additional assumption thatf* (t) is everywhere non-negative. Examples of residual distributions that satisfy the assumptions include the normal, double exponential and Cauchy distributions.
Estimation of the Residual Characteristic Function A direct estimate of the characteristic function for the differences is available as, for instance, The estimate #e* (t) is unbiased but highly variable. Following Zhang (1990) it is valuable to consider a smoothed version of the characteristic function : #s*(t;c) = #e*(t)h*(t/c) where h * is a characteristic function in correspondence with density h. Since h * (t) < 1, C) is biased downwards. Small values of c tend to give smoother characteristic function estimates.
On the other hand as c # #, #s*(t ; c) # #e*(t). Since the characteristic function is assumed non-negative another reasonable estimate of/* (t) is . fZ'-fe* (t) if t < Z otherwise where Z is the smallest t> 0 such that #e(t) = 0.
Given a characteristic function estimate t (t) for the difference £ ü2, an estimate of the residual characteristic function is A density estimate is obtained by the inversion formula The cumulative distribution function estimate can be obtained by integration of the density estimate. The integration cannot be performed explicitly and must be done numerically.
We will refer to a density or cumulative distribution function estimate based on as a ICF (inverse characteristic function) density or cumulative distribution function estimate. The estimates vary depending upon which estimate for the characteristic function of the differences is used. We refer to the estimate based on (5) as the unsmoothed ICF estimate and an estimate based on (4) as a smoothed ICF estimate.
Rate of Convergence of the Density Estimates The use of characteristic functions for the estimation of a density of a random variable Y when Y + X is observed where X is a random variable with known density has been considered by Carroll and Hall (1988) and Zhang (1990). Here we wish to estimate the density of Y-Xwhere YandXboth have the same but unknown density. The problems are similar and we will use a modification of the results of Zhang (1990) to obtain upper bounds for the rate of convergence of the smoothed density estimates.
Theorem 1 let #(x) be the estimator off(x) given by (6) with #d*(t) given by #*(t;cn).
Suppose that h* satisfies that h (x) = h(-x), #x2h(x)dx < #, ##xh'(x)#dx < #, h*(t) = 0, ##t# > 1 (7) Suppose that the constants cn satisfjo that 1/n # c0##*(cn)#2 /cn3, c0 < #m, #f*(cn)# = ##f*(t)#
Then # E## - f#2 = 0, # f # #f# < # (9) and <BR> <BR> <BR> <BR> <BR> <BR> sup E## - f#2 # #c0C3 + (M1C1)2#/2#, #n# 1. (10)<BR> <BR> f#tf*(t)##M1 An example of a characteristic function satisfying (7) is the function h * (t) that is proportional to the density of the average of four uniform random variables but rescaled so that h * (0) = 1. We used this characteristic function for the simulations and examples in later sections.
Zhang (1990) shows that the normal, Cauchy, and double exponential distributions satisfy the assumptions of the theorem. The resulting rates of convergence are as follows: 1. Normal : With E## - f#2 = O([log(n)]-1).
2. Cauchy : f (x) = (1 + X2)/l. With cn = a log (n), as (0,1), 3. Double exponential: f (x) = exp (-) x))/2. With cn = αn1/7, α > 0 = O(n TheE-MAlgorithm for Estimation of the Residual Distribution As an alternative to the estimation using characteristic functions we consider estimation based upon maximization of a pseudo-loglikelihood E## - f#2 = O([log(n)]-2). where fd(y, µ,#, h) is calculated as the density of the difference of two random variables each having density
Here and the llj are fixed, equally spaced points symmetrically placed about 0. Let 7 : be the maximizer of pl (Z). Then f (x,, h) is used as the estimate of the residual density. We will refer to an estimate that maximizes (11) as a pseudo-likelihood estimator.
The form of the density given in (12) is flexible enough that almost any residual density should be identifiable with large enough T. This method of estimation avoids the numerical integration involved in the characteristic function approach but increases the computational cost by requiring that 7r be calculated as the solution of an optimization problem.
Indeed part of the reason for the form of the pseudo-loglikelihood is to simplify the estimation.
The E-Malgorithnl Maximization of pl (Tr) can be considered as a type of missing data problem and hence the E-M algorithm (Dempster, Laird and Rubin, 1977) can be used. The data points that we observe are the dij1j2 :=#ij1 - #ij2. These can be thought of as incomplete versions of #ij1, #ij2, i1, i2.
Here il and i2 are artificial random variables that do not have an explicit role in the algorithm. For each (i, j,, j2), the rth value in 1,..., T is assigned to il with probability gIr independently of i2 and #j2. The conditional distribution often, given ilis taken as N (µi1, h2). The generation of i2 is defined similarly. The complete data pseudo-loglikelihood is then The details are omitted but the E and M steps of the E-M algorithm can be shown to be: Given current estimates #j(k),
The constants of proportionality are determined by the constraints that the sums of the 7E (k), the p(kl) (j#i, jl, j2) and the p (J#i, j1, j2) all equal 1.
Examplesand Application to Expression Data A Simulated Data Example A brief example of the results of estimation when the true density is known is given in Figure 1. The data in this case were simulated from model (1) with n = 500, m = 2 and a standard normal residual density. Varying the smoothing parameters h in the case of the pseudo-likelihood estimate and c for the ICF estimate give significantly different estimates. Small values of h allow for more modes in the density estimate and consequently produce more variable estimates than larger values. Similarly small c tend to be associated with smooth density estimates and large c with density estimates with larger numbers of modes.
The smoothed ICF density estimates tend to underestimate the value of the density near 0. This is due to the smoothing factor h * (t/c) < 1 in the characteristic function estimates.
Smaller values of c are associated with greater bias in this region of the density. The pseudo-likelihood density estimates were better for these data. Generally the pseudo-likelihood estimates can be expected to perform well when the residual distribution is close to normal since the normal density is used as the kernel in (12).
The density estimates in Figure 1 are symmetric. Generally this will always be the case: The ICF estimates are symmetric since both the negative and positive differencesyi,-Yi2 and
Yra-Yar are included in the construction of (3) resulting in symmetric characteristic function estimates for (4) and (5). For the pseudo-likelihood estimates it can be shown that if the u are chosen to be symmetric about 0 and the initial weight 7rj for u. j is the same as the initial weight -pj in (14), then the final density estimate will be symmetric.
The Snzoothifzg Paranzeters The density estimates usually vary significantly with different smoothing parameters. The procedures for the selection of smoothing parameters discussed here were used for the expression data in the following sections relating to gene expression and simulations.
The multiplication ofthe characteristic function estimate (3) by h * (t/c) implies that the resultant characteristic function estimate will be 0 for Itl > c. Consequently a reasonable upper bound for the appropriate smoothing parameter c is Z, the smallest t > 0 such that fe (t) = 0. In our experience (see the simulations) we have found that even with values of c as large as Z there is significant bias in the distribution function estimates for the sample sizes of primary interest (n 2 00). For this reason we also consider the unsmoothed ICF density estimate.
For the pseudo-likelihood estimates we determine h using the 1. distance between (i) the unbiased estimate (3) of the distribution for the difference between two residuals and (ii) the cumulative distribution of the difference of two random variables resulting'from the residual density estimate (12) for the h under consideration. Since the variance for a random variable from (12) is at least h2, a reasonable upper bound ho for the smoothing parameter is the sample variance of the differences. We select a smoothing parameter/2 as the first h in {a'ho : 0 < a < 1} such that the lc distance for ak+'ho is greater than the loo for ak.
Gene Expression Data We illustrate the estimation of the residual distribution with the estimates obtained for gene expression data from brain tissue. The data are available at http ://idefix. upr420. vjf. cnrs. fr/hgi-bin/exgenx. sh ? CLNINDEX. html The expression values for brain tissue for n = 7483 genetic sequence tags were obtained as described in Pietu et. al., (1996). There were m = 2 repeated measurements for each sequence tag.
Plots of ICF densities with various smoothing parameters (c = oo gives the unsmoothed estimate) are given in Figure 2. The density estimates are all very similar in this case.
More important for calculating confidence intervals are the cumulative distributions which are given in Figure 3.
The 95\% confidence intervals for the differences µ1i - µ2i described previously would be obtained as where T is the 0.975th quantile of the residual distribution. The estimates of c for the ICF estimate of the residual distribution with c = 5, with no smoothing and pseudo-likelihood estimate are 2.37, 2.27 and 2.21 respectively. Thus one would construct a 95\% confidence interval from the unsmoothed ICF estimate as #1i. - #2fi. ~ 2.27 ##12/ m + #22 / m which would be larger than the conventional normal based interval: Yli.-Y2i- 1. 96V 1 2 Simulation Results To further evaluate the methodologies several simulations were considered. For each set of simulations, samples from (1) were generated from a given residual distribution with 71 = 500 and m = 2. The residual distributions considered were the normal, double exponential and Cauchy distributions. The estimators considered were (i) the unsmoothed ICF estimate resulting from (5) (ii) the smoothed ICF estimate resulting from (4) with c taken as Z, the smallest t > 0 such that fe (t) = 0, and (iii) the pseudo-likelihood estimate with the smoothing parameter h chosen using the/ criterion, discussed previously, with a = 0.8. For (i) and (ii) 10000 simulated samples were drawn. For (iii) the first 1000 samples were used. A summary of the results of the simulations are given in Tables 1-2. The estimates of the probabilities from (ii) are biased downwards. In contrast the estimates of the probabilities and the quantiles from (i) and (iii) are quite reasonable for these samples sizes.
The methodologies discussed in this article provide a means of estimating the residual distribution for models of the form (1). Such models arise in data settings, such as the analysis of gene expression data, where there are a large number of comparisons or mean estimations with a similar measurement error process. The purposes of obtaining density estimates
may vary. One could use them directly to adjust confidence intervals or to check a parametric residual distribution assumption.
Theorem 1 indicates that the ICF estimates provide for consistent residual distribution estimation. While the upper bounds on the rates of convergence given above suggest that a large number of observations are required for consistent estimation of the density function, the simulation results indicate that reasonable estimates ofthe cumulative distribution probability estimates can be obtained with n 2500, which is usually the situation for gene expression data..
The simulation results further favor less smoothing than one might expect. The pseudo-likelihood density estimates give reasonable density estimates as well. In contrast to the characteristic function based estimates however, more computational power is required to obtain them.
It should be appreciated that the outcome of the process of the invention can be applied to the original data set or array or it may be applied to a new one. Moreover, the process may be applied in three different ways: 1. It can be used to determine the reliability of differences across two different samples (obtained, say, from two different tissues), i. e. different outcomes of a physical measurement. This can be done with the original data set or array on which the process was applied. Since the original data set has repeated measurements, the process would typically be applied to the mean of the repeated measurements. It can also be applied to a new data set. The new data set may have only one measurement. Or in the case of repeated measurements in the new data set, the outcome of the original data set can be applied to the mean of the measurements or of course the process may be repeated with the new data set.
2. It can also be used to determine if a measured value deviates from all of the other measured values in the distribution. This is not the same as point 1. Here the comparison is not between two measured values but rather between one measured value and all of the others in a distribution. The idea here is that the measured values'"place"in the distribution is assessed relative to a threshold established by the random error estimation process. If the measured value exceeds the threshold, it is then said to represent a different physical measurement relative to the other values in the distribution. For example, most genes in an array may not be expressed above the background noise of the system. These genes would form the major portion of the distribution. Other genes may lie outside of this distribution
as indicated by their values exceeding a threshold determined by the random error estimation. These genes would be judged to represent a different physical process.
3. The process may also be used to establish"outlier"values. In the preceding description, they are also described as"an extreme value in a distribution of values."Outlier data often result from uncorrectable measurement errors and are typically deleted from further statistical analysis."Point 2, above, also refers to detecting an extreme value but in that case the extreme value is based on the intensity of the measurement. That is not an outlier as intended here. Here, outlier refers to an extreme residual value. An extreme residual value often reflects an uncorrectable measurement error.
Although preferred forms of the invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that many additions, modifications and substitutions are possible without departing from the scope and spirit of the invention as defined by the accompanying claims.
Table 1: The estimated mean F (qp) with n = 500, m = 2 based on simu- lation. Here qp is the pth quantile for the generating residual distribution.
Method (i) is the unsmoothed characteristic function based estimate (ii) the 'smoothed estimate with c = 5.0 and (iii) the E-M based estimate. Estimated standard deviations are given in parentheses. p Distribution Method 0. 75 0. 9 0. 95 0. 975 Normal (i) 0. 75 (0.02) 0.89 (0.01) 0.95 (0. 01) 0.98 (0. 01) (ii) 0.66 (0.02) 0.78 (0.02) 0.83 (0.02) 0.88 (0.02) (iii) 0.75 (0.01) 0.9 (0.01) 0.95 (0.01) 0.97 (0.01) Double (i) 0. 75 (0.02) 0.9 (0.01) 0.95 (0.01) 0.97 (0.01) Exponential (ii) 0.64 (0.02) 0.79 (0.01) 0.87 (0.01) 0.92 (0.01) (iii) 0.74 (0.02) 0.9 (0.01) 0.95 (0.01) 0.97 (0. 01) Cauchy (i) 0. 74 (0.02) 0.9 (0.01) 0.95 (0.01) 0.98 (0.00) (ii) 0.62 (0. 01) 0.8 (0.01) 0.9 (0. 01) 0.95 (0.01) (iii) 0.72 (0.06) 0.88 (0.07) 0.94 (0. 05) 0.97 (0.03) Table 2: The estimated mean pth quantile with n = 500, m = 2 based on simulation. Method (i) is the unsmoothed characteristic function based estimate (iii) the pseudo-likelihood estimate. Estimated standard deviations are given in parentheses. p Residual Distribution Method 0. 75 0. 9 0. 95 0. 975 Normal Actual 0. 67 1. 28 1. 64 1. 96 (i) 0.66 (0.07) 1.32 (0.08) 1.68 (0.10) 1.97 (0.15) (iii) 0.66 (0.05) 1.28 (0.05) 1.66 (0.07) 1.99 (0.09) Double Exponential Actual 0. 69 1. 61 2. 30 3. 00 (i) 0. 7 (0.08) 1.6 (0.12) 2.29 (0.2) 3.01 (0.32) (iii) 0.73 (0.08) 1.58 (0.11) 2.29 (0.18) 3.2 (0.28) Cauchy Actual 1. 00 3. 08 6. 31 12. 71 (i) 1.04 (0.11) 3.09 (0.39) 6.28 (0.83) 12.4 (2.01) (iii) 1.44 (1.18) 3.7 (2.02) 7.3 (2.31) 14.74 (3.7)
References Carrol, R. J. and Hall, P. (1988)."Optimal Rates of Convergence for Deconvolving a Density", Journal of the American Statistical Association, 83,1184-1186.
Chen, Dougherty, & Bittner, (1997)."Ratio-based Decisions and the Quantitative Aanalysis of cDNA Microarray Images", Journal of Biomedical Optics, 2,364-374.
Dempster, A. P., Laird, N. M. and Rubin, D. B., (1977)."Maximum Likelihood from Incomplete Data via the E-M Algorithm", Jourfzal of the Royal Statistical Society, Series B, 39,1-38.
McLachlan, G. and Krishnan, T. (1997). {\it The EMAlgorithm and Extensions}, Wiley, New York.
Pietu, G, Alibert, O., Guichard, V., Lamy, B., Bois, F., Leroy, E., Mariage-Smason, R., Houlgatte, R., Soularue, P. and Auffray, C. (1996)."Novel Gene Transcripts Preferentially Expressed in Human Muscles Revealed by Quantitative Hybridization of a High Density cDNA Array", Genome Research, 6,492-503.
Zhang, C. (1990)."Fourier Methods for Estimating Mixing Densities and Distributions", annals of Statistics}, 18,806-831.
The disclosures of the preceding references are incorporated herein in their entirty.
