Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CLASSIFICATION OF DISEASE STATES USING MASS SPECTROMETRY DATA
Document Type and Number:
WIPO Patent Application WO/2005/010492
Kind Code:
A2
Abstract:
A method for identification of biological characteristics is aclüeved by collecting a data set relating to individuals having known biological charactenistics and analyzing the data set to identify biomarkers potentially relating to selected biological state classes. A system for identification of biological characteristics is also provided. A methodology is also provided for utilizing mass spectroscopy data to identify peptide and protein biomarkers that can be used to optimally dischirruinate experimental from control samples - where the experimental samples may, for instance, be derived from patients with various diseases such as ovarian cancer.

Inventors:
ZHAO HONGYU (US)
WILLIAMS KENNETH R (US)
WU BAOLIN (US)
STONE KATHRYN (US)
MCMURRAY WALTER (US)
ABBOTT THOMAS (US)
Application Number:
PCT/US2004/023077
Publication Date:
February 03, 2005
Filing Date:
July 19, 2004
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV YALE (US)
ZHAO HONGYU (US)
WILLIAMS KENNETH R (US)
WU BAOLIN (US)
STONE KATHRYN (US)
MCMURRAY WALTER (US)
ABBOTT THOMAS (US)
International Classes:
C12Q1/68; G01N33/48; G01N33/50; G06F19/00; G16B40/10; H01J49/04; G01N; (IPC1-7): G01N/
Foreign References:
US20030017515A12003-01-23
US6009199A1999-12-28
Attorney, Agent or Firm:
Flaxman, Howard N. (2000 Duke Street Suite 10, Alexandrea VA, US)
Download PDF:
Claims:
CLAIMS
1. A method for identification of biological characteristics, comprising the following steps: collecting a data set relating to individuals having known biological characteristics; analyzing the data set to identify biomarkers potentially relating to selected biological state classes.
2. The method according to claim 1, wherein the step of collecting including creating a data set of mass spectrometry spectra.
3. The method according to claim 1, wherein the step of collecting includes preprocessing of the data set.
4. The method according to claim 3, wherein the step of preprocessing includes mass alignment, normalization, smoothing and peak identification.
5. The method according to claim 3, wherein the step of preprocessing includes mass alignment.
6. The method according to claim 3, wherein the step of preprocessing includes normalization.
7. The method according to claim 3, wherein the step of preprocessing includes smoothing.
8. The method according to claim 3, wherein the step of preprocessing includes peak identification.
9. The method according to claim 1, wherein the known biological characteristic is ovarian cancer.
10. The method according to claim 1, wherein the step of analyzing is performed through application of a Random Forest algorithm.
11. The method according to claim 10, wherein the step of analyzing further includes defining sensitivity and defining specificity.
12. The method according to claim 10, wherein the selected biological state classes are no cancer and cancer.
13. The method according to claim 12, wherein the biological state class for cancer relates to ovarian cancer.
14. A system for identification of biological characteristics, comprising: means for collecting a data set relating to individuals having known biological characteristics ; means for analyzing the data set to identify biomarkers potentially relating to selected biological state classes.
15. The system according to claim 14, wherein the means for collecting includes means for creating a data set of mass spectrometry spectra.
16. The system according to claim 15, wherein the means for collecting includes means for preprocessing of the data set.
17. The system according to claim 16, wherein the means for preprocessing includes means for mass alignment, normalization, smoothing and peak identification.
18. The system according to claim 16, wherein the means for preprocessing includes means for mass alignment.
19. The system according to claim 16, wherein the means for preprocessing includes means for normalization.
20. The system according to claim 16, wherein the means for preprocessing includes means for smoothing.
21. The system according to claim 16, wherein the means for preprocessing includes means for peak identification.
22. The system according to claim 16, wherein the known biological characteristic is ovarian cancer.
23. The system according to claim 16, wherein the means for analyzing is performed through application of a Random Forest algorithm.
24. The system according to claim 23, wherein the means for analyzing further includes means for defining sensitivity and defining specificity.
25. The system according to claim 23, wherein the means for classifying further includes means for defining sensitivity.
26. The system according to claim 23, wherein the means for classifying further includes means for defining specificity.
27. The system according to claim 23, wherein the selected biological state classes are no cancer and cancer.
28. The system according to claim 27, wherein the biological state class for cancer relates to ovarian cancer.
Description:
TITLE : CLASSIFICATION OF DISEASE STATES USING MASS SPECTROMETRY DATA BACKGROUND OF THE INVENTION 1. Field of the Invention The invention relates to a comprehensive statistical, computational, and visualization approach to identifying the naturally occurring forms of peptide and protein disease biomarkers from raw data collected from mass spectrometric (MS) instruments. More particularly, the invention employs background subtraction, spectrum alignment (registration), peak identification, normalization, and outlier detection. The disease biomarker identification uses a customized Random Forest algorithm to search for features that show distinct patterns among different classes of samples.

2. Description of the Prior Art DNA microarray analysis offers a breakthrough and massively parallel approach to genome-wide expression analysis that, for many purposes, is unfortunately directed at the wrong biological molecule. Differential rates of translation of mRNAs into protein and differential rates of protein degradation in U'W are two factors that confound the extrapolation of mRNA to protein expression profiles. For instance, Gygi et al. estimate the correlation between protein and mRNA abundance for yeast is only 0.4. Gygi, S. P., Rochon, Y, Franza, B. P. , and Aebersold, R, Correlation between protein and mRNA abundance in pst, Mol. Cell. Biol. 19,1720-1730 (1999). They found yeast genes with similar mRNA levels that had protein levels that differed by 20-fold. Conversely, they found invariant, steady-state levels of proteins which had mRNA levels that varied by 30-fold, similar to the >10-fold range observed byFutcher et al. Futcher, B. , Latte, G. I., Monardo, P., McLaughlin, GS., and Garrels, J. I., A san7p theyeast panP, Mol. Cell. Biol. 19,7357- 7368 (1999). Additionally, microarray analysis is unable to detect, identify or quantify post- translational protein modifications which often play a key role in modulating protein function. Protein expression analysis offers a potentially large advantage in that it measures the level of the biological effector protein molecule, not just that of its message.

Proteomics is an integral part of the process of understanding biological systems, pursuing drug discovery, and uncovering disease mechanisms. The identification of protein biomarkers correlating with specific diseases will permit earlier detection of diseases, allow more accurate classification of diseases based upon protein expression rather than just clinical and histological data, provide more effective means for following the course of disease and facilitate the identification of proteins involved in the disease process for improving the understanding of diseases and leading to new and more effective treatments.

Because of their importance and the very high level of variability and complexity, the analysis of protein expression is as potentially exciting as it is a challenging task in life science research. Proteomics. Saw294, 5549,2074-2085 (2001). Comparative profiling of protein extracts from normal versus experimental cells and tissues enables us to potentially discover novel proteins that play important roles in disease pathology, response to stimuli, and developmental regulation. However, to conduct massively parallel analysis of thousands of proteins, over a large number of samples, in a reproducible manner so that logical decisions can be made based on qualitative and quantitative differences in protein content is an extremely challenging endeavor.

The prior art does not make it currentlypossible to carryout a massively parallel, quantitative analysis of the level of expression of tens of thousands of proteins, over a large number of samples, in a reproducible manner that approaches that of DNA microarray technology for mRNA expression. Two approaches that have been used to quantitatively and simultaneously profile approximately 500-1, 000 proteins are isotope coded affinity tags (ICAT) coupled with liquid chromatography/mass spectrometry (LC/MS) and 2D differential (fluorescence) gel electrophoresis (DIGE). Han, D. K. , Eng, J. M. , Zhou, H, and Aebersold, R., Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry, Nature Biotechnology 19,946-951 (2001); Zhou, G. , Li, H., DeCamp, D. , Chen, S. , Shu, H, Gong, Y. , Flaig, M. , Gillespie, JW. , Hu N., Taylor, PR, Emmert-Buck, MR., Liotta, LA. , Petricoin, EF. , Zhao, Y. , 2D diffetiwl in-gd elethoresis for the identification of esophageal scans cell cancer-specific protein mankers, Molecular & Cellular Proteomics, 1 (2), 117-24 (2002). The ICAT study by Han et al compared protein expression in microsomal fractions of control versus in zitm differentiated human myeloid leukemia cells. In this study, the tryptic digest of the microsomal protein extract was separated into 30 fractions via cation exchange HPLC Each of these 30 fractions was then subjected to avidin affinitychromatographyfollowed byLC/MS/MS. During this study25, 892 individual MS/MS spectra were analyzed and subjected to database searching. More than 5,000 cysteine-containing peptides were identified with this massive effort which resulted in quantifying the relative level of expression of 491 proteins (which were also identified) in only one control versus experimental sample. In comparison, in the DIGE study of Zhou et al. , a single 2D gel containing a protein extract from laser capture microdissected esophageal cancer cells that was labeled with Cy5 and a similar extract from normal cells that was labeled with Cy3 resulted in quantifying the relative (spot volume) intensities of 1,264 fluorescent spots.

Both the ICAT/LC-MS and DIGE approaches to protein profiling share the commonality of trying to quantify the relative level of expression of as many proteins as possible to uncover the (perhaps) 5%, or so, of proteins which are the most substantially up or down-regulated. With this in mind, and as will be discussed below in the Description of the Preferred Embodiment, the peptide disease biomarker approach employed in accordance with the present invention provides a novel approach in that from the beginning it is directed at finding the peptides that are of the most interest; that is, the 5-40 or so peptides whose intensities can best differentiate all control from experimental spectra. And, in most instances, it is not necessary that the peptide biomarker peaks be completely resolved as it is possible to search at the level of individual mHz (mass charge ratio) versus intensity data points. In effect, peptide disease biomarker discovery in accordance with the present invention provides a"short-cut"approach to protein profiling that enables large numbers of raw and extremely complex spectra to be effectively analyzed, thus obviating challenges resulting from biological diversity within the control and experimental samples.

The relative simplicity of the peptide disease biomarker approach, the potential importance of the resulting biomarkers, and the availability of a commercial laser desorption ionization time-of-flight MS platform that provides a"single step"approach for desalting and spotting biological samples accounts for the rapidly increasing number of researchers using this technology. Surface enhanced laser desorption ionization time-of-flight mass spectrometry (SELDI-TOF-MS) involves the use of a 10 mm x 80 mm chip having eight or sixteen 2 mm spots comprised of specific chromatographic surfaces (e. g. , anionic, cationic, hydrophobic, hydrophilic, metal, etc). Issaq, IIJ., Veenstra, T. D. , Conrads, T. P. , Felschow, D. Breakthroughs and View ; The SELDI-TOFMS Approxh toPrms : PrateinPrilingand Biomanker Identification, Biochemical and Biophysical, Research Communications 292,587- 592 (2002). After spotting a few microliters of serum or other biological sample onto the chip surface, desalting is accomplished via washing with water prior to adding and then drying onto the target a solution of an energy absorbing reagent like «-cyano-4-hydroxr cinnamic acid (that is, the"matrix"in conventional matrix assisted laser desorption ionization mass spectrometry (MALDI-MS)).

One of the reports that has helped spur more widespread interest in SELDI based detection of peptide/protein disease biomarkers is the ovarian cancer study of Petricoin et al. In this study, SELDI-MS analysis of sera from 50 control and 50 case samples from patients with ovarian cancer resulted in identifying 5 peptide biomarkers that ranged in size <BR> <BR> <BR> from 534 to 2,465 Da. Petricoin, E. F. , Ardekani, A. M. , Hitt, B. A, Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simine, C, Fishman, D. A., Kohn, E. C, and Liotta, L. A., Use of proteomic patterns in serum to identify ovrian cancer, The Lancet 359,572-77 (2002); U. S. Patent Application Publication No. 2003/0004402 to Hitt et al. The pattern formed bythese markers was then used to correctly classify all 50 ovarian cancer samples in a masked set of serum samples from 116 patients who included 50 patients with ovarian cancer and 66 unaffected women or those with non-malignant disorders. Of the latter samples, 63 were correctly recognized as not being from cancer patients thus providing 100% sensitivity (50/50) for detecting cancer, 95% specificity (63/66) for detecting controls, and a positive predictive value of 94% (50/53). That is, if the 5 peptide"ovarian cancer"biomarker pattern was identified in the sample, there was a 94% probability that the patient indeed has ovarian cancer.

Similar promising results have been reported recently in two other reasonably large scale studies of serum samples from breast and prostrate cancer patients. In the case of breast cancer, Li et al. identified three biomarkers (mHz = 4,300, 8,100 and 8,900), which together demonstrated a sensitivity of 93% for 103 breast cancer patients and a specificity of 91% for 66 controls that included 41 healthywomen and 25 patients with benign breast diseases. Li, J., Zhang, Z., Rosenzweig, J., Wang, Y.Y., Chan, D.W., Proteomics and Bioinformatics Approaohes for Identification of Serum Biomarkers to Detect Breast Cancer, Clinical Chemistry 48 : 8,1296-1304 (2002). In the case of prostrate cancer, Adam et al identified nine nAz between 4,475 and 9,656 Da that demonstrated a sensitivity of 83%, a specificity of 97% and a positive predictive value of 96% based on the analysis of serum samples from 167 patients with prostrate cancer and 159 patients who were either healthy or had benign prostrate hyperplasia. Adam, B. L. , Vlahou, A., Semmes, J. O., Wright, Jr. G. L., P7ofeonic approaches to biomarker discovery in prostate and bladder cancer, Proteomics 1,1264-1270 (2001).

Finally, Vlahou et al. used a similar SELDI-MS approach to identify two biomarkers (mHz = 3,300/3, 400 and 9,500) and a protein"cluster" (which had Hz ranging from 85,000 to 92, 000) in urine which together provided a sensitivity of 87% for detecting transitional cell carcinoma of the bladder. In this latter study, a total of 94 urine samples were analyzed and the corresponding specificity was 66% and the positive predictive value was 54%. Vlahou, A. , Schellhammer, P. F. , Mendrinos, S. , Patel, K., Kondylis, F. I. , Gong, L. , Nasim, S., Wright, Jr. G. L., Deuioprlra of a Novel Proteomic Approach for the Detection of Transitional Cell CamoftheBladdErinUrzne, AmericanJournalPathologyl58 : 4,1491-1502 (2001). Taken together, these studies certainly seem sufficiently promising to warrant larger scale studies and extension of similar approaches to the study of other cancers and disease states.

Despite some of the results discussed above, traditional statistical methods for classification are not optimal or even appropriate for biomarker identification using mass spectrometrydata. As the data is very high dimensional, dimension reduction is necessary before using these methods for biomarker identification. Principal component analysis (PCA) is a common method for dimension reduction. PCA is based on SVD (singular value decomposition), and has been applied in microarraydata analysis. However, the interpretation of PCA is not straightforward. In the microarray data analysis context, Alter et al. use'Eigengenes'to interpret the results of SVD analysis, however, this is not intuitive.

Alter, O., Brown, P. O., and Botstein, D. Singular value decorrpositing for genome wide expression data processit and nzxlding, PNAS 97,18 (2000), 10101-10106. Some traditional discriminant analysis techniques, e. g. LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis), are model-dependent. Fisher RA. (1936). The use of multiple measurements in taxonomic problems. Annal of Eugenics, 7: 179-188. They make strong assumptions about the underlying data distribution, which may rarely hold for complex data. As a result, they can be biased for large complex datasets. On the other hand, model independent methods, e. g.

CART (classification and regression trees), maybe highly variable due to the high <BR> <BR> <BR> dimensionality of the mass spectrometry data. Breiman L. , Friedman, J. H, Olshen, R A. and Stone, CJ. Classification and Regression Trees (1983).

As the previous discussion shows, mass spectrometry (MS) is increasingly being used for rapid identification and characterization of protein populations. There have been tremendous research efforts recently trying to utilize mass spectrometry technology to build molecular diagnosis and prognosis tools for cancers. Petricoin et al.; Adam et al.; Li et al.

Most of the papers have claimed 290% sensitivityand specificityusing a subset of selected biomarkers; some of them even report achieving perfect classification. Zhu, W., Wang, X., Ma, Y. , Rao, M., Glimm, J. , and Kovach, J. S., Detczon of cancer specific mankers amid massive nuss spatral data, PNAS 100, 25,14666-14671 (2003). But upon our closer inspection of these studies, many of the identified biomarkers actually appear to arise from background noise, which suggests some systematic bias from non-biological variation in the dataset.

Additionally, all these studies reflect the neglected importance of data preprocessing and of appropriately Interpreting large mass spectrometrydatasets. Anothercommonlyneglected fact is the correct way of using cross-validation.

As discussed in Ambroise et al., it is important to do an external cross-validation, whereby at each stage of the validation process one must not use any information from the testing set to build the classifier from the training set. Ambroise, C and McLachlan, G. J., Selection bias in gene extraction on the basis of microarray gene-expression data, PNAS 99,10 (2002), 6562-6566. Internal cross-validation is used in most current disease biomarker mass spectrometry studies, whereby the selection of biomarkers has utilized information from all the samples, which will significantly (e. g. , see below) under-estimate classification error.

We previously studied the relative performance of popular classification methods in the context of a mass spectrometry ovarian cancer dataset and published our results. Wu, B. , Abbott, T. , Fishman, D. , McMurray, W., Mor, G. , Stone, K., Ward, D. , Williams, K. , and Zhao, H, Companison of statistical methods for dasification of ovrian cancer using mass spectrometry data, Bioinformatics 19,13, 1636-1643 (2003a).

Our re-examination of data used in the Petricoin et al. study illustrates the importance of visualization tools and some of the unique challenges of analyzing mass spectrometrydata sets. Petricoin et al. employed Genetic Algorithms and Self-Organizing Maps to analyze SELDI spectra obtained on serum to identifypeptide biomarkers to distinguish ovarian cancer patients from normal individuals. David E. Goldberg, Gec Algorithms in Search, Optimization, and Machine Learring, Addison-Wesley Pub Co. (1989); Teuvo Kohonen, T. S. Huang, M. R Schroeder, SefOrgarazingMaps, Springer-Verlag (2000). However, visualization of the Wz regions around each of the 5 ovarian cancer biomarkers identified in their study suggests that many of their biomarkers may derive from variations In background noise (see Figure 2) rather than from peptide ionization. With so many (typically >90, 000 in the present study using only reflectron acquired data) data points being analyzed in each spectrum there is a reasonable probability that at least a few of theses points will (by chance alone) be able to"differentiate"cases from controls in the training sets. Obviously, however, the latter will have little subsequent value. Figure 3, which shows the 800-3500 nzlz region for two representative normal and ovarian cancer serum spectra, demonstrates the comparatively low signal/noise ratio of data in this region that was obtained by the instrumentation used byPetricoin et al. As was shown in Fig. 1, a much higher signal/noise ratio can be obtained over this region from desalted serum that is analyzed on a conventional Micromass MALDI-MS instrument equipped with a reflectron analyzer. Obviously, in this instance, the ability to easily visualize the Hz regions around biomarkers that have been selected by sophisticated statistical approaches adds substantial value to the overall analysis. In the following section, we describe robust statistical methods that address the issues discussed above, and then apply these methods to analyze on a conventional MALDI mass spectrometer an ovarian cancer data set similar to that analyzed by Petricoin et al.

More particularly, in the Petricoin et al study, SELDI-MS analysis of serum from 50 control and 50 case samples from patients with ovarian cancer resulted in identifying 5 peptide biomarkers that ranged in size from 534 to 2,465 Da. The pattern formed by these biomarkers was then used to correctly classify all 50 ovarian cancer samples in a masked set of serum samples from 116 patients who included 50 ovarian cancer patients and 66 unaffected women or those with non-malignant disorders. Of the latter samples, 63 were correctly recognized as not being from cancer patients-thus providing 100% sensitivity (50/50) for detecting cancer, 95% specificity (63/66) for detecting controls, and a positive predictive value of 94% (50/53) for this population. That is, if the 5 peptide"ovarian cancer"biomarker pattern was identified in the sample, there was a 94% probability that the patient indeed has ovarian cancer. Although similar promising results have been reported recentlyin other reasonably large-scale studies of serum samples from breast and prostrate cancer patients (Li, J. , Zhang, Z. , Rosenzweig, J. , Wang, Y. Y. , Chan, D. W., Pra 7acs ard Bioinformatics Approaches for Identification of Serum Biomarkers to Detect Breast Cancer, Clinical Chemistry 48 : 8,1296-1304 (2002); Bao-Ling Adam, Ynsheng Qu, John W. Davis, Michael D. Ward, Mary Ann Clements, Lisa H Cazares, O. John Semmes, Paul F. Schellhammer, Yutaka Yasui, Ziding Feng, and George L. Wright, Jr., Serum Protein Fingerprinting Coupled with a Patt stdving A li Distirishes Prostate Caerf mm Beniggi Pnustate Hpplasia artl HealthyMen, Cancer Res. 62: 3609-3614 (2002) ), we would like to raise two concerns about the Petricoin et al study. The first is an issue that was raised by Rockville and others and that is the very high positive predictive value (PPV) of 94% reported by Petricoin et al applies only to their artificial population of 116 patients, 50 of whom had ovarian cancer. When their estimates of sensitivity (100%) and specificity (95%) are applied to an average population of post-menopausal women with an incidence of ovarian cancer of 50 per 100,000, the PPV is reduced to a clinically insignificant value of only 1%. Rockhill, B, Proteomic patterns in serum and identification of ovarian cancer, The Lancet 360,169-170 (2002).

The second caution with regard to the Petricoin et al. study is that (as shown below) closer examination of the mass spectra around their"biomarkers"suggests strongly that the latter do not arise from biologically significant peptides.

The deceptively straightforward approaches now being used (often by non-mass spectroscopists) to uncover naturally occurring peptide. and protein biomarkers of disease hold enormous promise for bringing the power of mass spectrometryto bear on the challenge of protein profiling the large numbers of samples needed to obviate biological diversity. However, challenging statistical issues remain that often have not been well addressed in the existing work The present method and system provides a straightforward methodology that allows for application of peptide disease biomarker discovery on a far wider range of mass spectrometric instrumentation. The present method and system provides a refined statistical method to address a range of important issues including background subtraction, peak identification, and normalization of spectra; and then, we introduce visualization tools, and a new algorithmic approach to uncovering peptide and protein biomarkers of disease. Using previously published and newly acquired data on serum from control versus ovarian cancer patients, the present method provides practical guidelines for using this technology and suggest how it might be applied in the future to the far more daunting challenge of analyzing multiple spectra/sample and of proteome profiling. Our study supports the superior performance of the Random Forest approach.

We use Random Forest to estimate the unbiased classification error for our ovarian cancer mass spectrometry data. In the meantime we also empirically evaluate the impacts of a number of selected biomarkers and the sample size on classification error. Our analysis framework will provide a general guideline for the practice of utilizing mass spectrometry for cancer and other disease molecular diagnosis and prognosis.

As such, the present method and system provide an advanced mechanism whereby various diseases may be identified based upon the analysis of irregularities found in protein analysis. In accordance with the present invention, we provide an improved method for identifying various biomarkers, for example, those associated with ovarian cancer. In doing so, the present *invention overcomes some of the challenges of statistically analyzing MALDI-MS datasets that inherently are noisy and have a very high ratio of variables (ie, n-tlz vs. intensity data points) to samples. The present invention also demonstrates how the serum disease biomarker discovery approach can be extended to more commonly available "MALDI-MS"instrument platforms, customizes a Random Forest algorithm for identifying biomarkers, and suggests how the disease biomarker strategy might be extended to even more sophisticated mass spectrometry platforms, to the analysis of multiple spectra/sample, and to proteome-level profiling.

SUMMARY OF THE INVENTION It is, therefore, an object of the present invention to provide a method for identification of biological characteristics that is achieved by collecting a data set relating to individuals having known biological characteristics and analyzing the data set to identify biomarkers potentially relating to selected biological state classes.

It is also an object of the present invention to provide a system for identification of biological characteristics which includes means for collecting a data set relating to individuals having known biological characteristics and means for classifying the data set to identify biomarkers potentially relating to selected biological state classes.

It is another object of the present invention to provide methodology for utilizing mass spectroscopy data to identify peptide and protein biomarkers that can be used to optimally discriminate experimental from control samples-where the experimental samples may, for instance, be derived from patients with various diseases such as ovarian cancer.

Other objects and advantages of the present invention will become apparent from the following detailed description when viewed in conjunction with the accompanying drawings, which set forth certain embodiments of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 shows mass spectrometry spectra (obtained with a reflectron analyzer on a Micromass M@LDI-R mass spectrometer) for 4 selected samples. Sample 1 & 2 are normal subjects, sample 3 & 4 are cancer subjects. The x-axis is the mass-to-charge (m/z) measurements that range from 800 Da to 3500 Da and the y-axis is the measured raw intensities that have a wide dynamic range for different samples. Viewing these spectra (e. g., spectra 2-4) one can also see the characteristic decreasing trend in the measured intensities obtained with a reflectron analyzer as the rtilz ratio increases.

Figure 2 shows regions around 5 identified biomarkers from the Petricoin et al. study. There are a total of 50 case samples and 50 control samples. Instead of overlaying 100 samples in each plot, we plotted several quantiles for the case/control group. In the plot, qO. 25 is the 25th percentile, and q0. 75 is the 75th percentile. We plotted-50 measurements around each biomarker. One can clearly see that at least 3 of these 5 biomarkers are very likely to arise from background noise as there do not appear to be any discemable peptide peaks at positions corresponding to the 534,989 and 2464 biomarkers.

In addition, Petricoin et al. attempt to identify biomarkers within the range of Hz <650 Da where those skilled in the art will appreciate that results are highly unreliable due to overwhelming noise within this range. The latter results from the chemical matrix that must be added to the samples to induce peptide and protein ionization.

Figure 2.1 illustrate SELDI mass spectrometry spectra for 4 selected samples from Petricoin et al. within the range extending from 800 Da to 3500 Da. Samples 1 & 2 are normal subjects and samples 3 & 4 are cancer subjects. The y axis is the normalized intensity using the method described in Petricoin et al. Compared to Figure 1 from the Micromass M@LDI-R instrument, these SELDI-MS spectra have considerably less resolution.

Figure 3 shows the estimated background for 4 previously selected samples. Due to the wide dynamic range of the intensity measurements, we take the logarithm of the intensities to reduce the numerical variation. After taking the log we estimate the background for each sample and subtract these background intensities. In terms of the raw intensities, we are actually dividing each sample by our estimated background. In this log scale plot, the decreasing trend of intensitywith increasing Hz is more obvious.

Figure 4 shows the reproducibility of spectra obtained from individual MALDI-MS laser shots. This plot compares the coefficient of variation for 130 selected peaks from the serum of one subject across 40 individual laser shots before/after taking the log transformation. We can clearly see that taking the log has substantially reduced the noise level.

Figures 5.1, 5.2 and 5.3 plot the mean intensities of manually processed samples vs. the mean intensities of robotically processed samples.

Figure 6 shows case/control median plots for 175 samples without any preprocessing. The first two panels are the median intensities across all cases/controls. The third panel shows the difference of case/control medians.

Figure 7 shows case/control median plots for 175 samples after all preprocessing.

The first two panels show the median intensities across all cases/controls. The third panel shows the difference of case/control medians.

Figure 8 shows the distribution of peaks for all samples at each point.

Figure 9 shows the ranking measures of selected peaks.

Figure 10 shows five-fold cross-validation estimation of Err (N, M) for the ovarian cancer data. The left panel is based on reflectron analyzer data only while the right panel is based on the reflectron + linear analyzer data-where the latter two spectra have been joined together.

Figure 11 shows classification error extrapolation for reflectron+linear analyzer data.

Figures 12 to 15 show local exploration of identified biomarkers.

Figure 16 is a schematic of the system employed in accordance with the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENT The detailed embodiment of the present invention is disclosed herein. It should be understood, however, that the disclosed embodiment is merely exemplary of the invention, which may be embodied in various forms. Therefore, the details disclosed herein are not to be interpreted as limited, but merely as the basis for the claims and as a basis for teaching one skilled in the art how to make and/or use the invention.

The present invention provides a method and system for the identification of biological characteristics. Briefly, the method is achieved by collecting data sets relating to individuals having known biological characteristics and analyzing the data sets to identify biomarkers potentially relating to selected biological state classes. Collection of the data set is achieved by the creation (or collection of previously created) of mass spectrometry spectra having perceived particular relevance. Thereafter, the data set is preprocessed through mass alignment, normalization, smoothing and peak identification. The step of classifying is preferably performed through application of a Random Forest algorithm that allows for optimization of the classifiers sensitivity and specificity.

With reference to Figure 16, the identification system 10 employed in accordance with the present invention maybe highly automated and generally Includes a mechanism for collecting data sets 12 relating to individuals having known biological characteristics, for example, ovarian cancer, and an analyzing (or classifying) assembly 14 for analyzing data sets to identify biomarkers potentially relating to selected biological state classes. As will be discussed below in greater detail, a variety of automated systems known to those skilled in the art may be employed in the practice of the present invention.

The mechanism for collecting 12 includes means for creating a data set of mass spectrometry spectra 16 and means for preprocessing of the data set 18. Preprocessing includes mass alignment, normalization, smoothing and peak identification.

In accordance with a preferred embodiment, the analyzing assembly 14 includes means for classifying through application of a Random Forest algorithm 20. The analyzing assembly also includes means for defining sensitivity and defining specificity.

More particularly, the present invention provides a comprehensive statistical, computational, and visualization approach to identifying the nAz values for naturally occurring forms of peptide and protein disease biomarkers from raw data collected from mass spectrometric instruments. Although the methodology has been developed based on MALDI-MS spectra, a similar methodology could also be used to analyze electrospray ionization (ESI) mass spectra. The latter might be produced by nanospray or liquid chromatography/MS approaches. Similarly, the methodologythat is described would also be suitable for analyzing spectra obtained from state-of-the-art instrumentation such as MALDI and/or ESI equipped Fourier Transform Ion Cyclotron Resonance (FTICR) mass spectrometers.

Mass spectrometric measurements are carried out in the gas phase on ionized samples. There are three basic components in all mass spectrometers. First an ion source ionizes the molecule of interest, e. g. peptides/proteins, then a mass analyzer differentiates the ions according to their mass-to-charge ratio and finally, a detector measures the abundance of ions. Sample ionization is the process of placing charges on neutral molecules. Among ionization methods, electrosprayionization (ESI) and MALDI are the two most commonly used techniques to volatize and ionize the proteins or peptides. ESI ionizes the samples out of a solution and MALDI sublimates and ionizes the samples out of a dry, crystalline matrix via laser pulses.

A mass analyzer is used to separate ions within a selected range of mass-to-charge ratios.

Ions are typically separated by magnetic fields, electric fields, or by the time it takes an ion to travel a fixed distance. There are four basic types of mass analyzer currentlyused in proteomics research: ion trap, time-of-flight (TOF), quadrupole, and Fourier transform ion cyclotron (FT- MS) analyzers. Among them, the TOF mass analyzer is one of the simples and is commonly used with MALDI. It is based on accelerating a set of ions to a detector with each ion having the same amount of energy. Because the ions have the same energy, yet different masses, they reach the detector at different times. Smaller ions reach the detector first because of their greater velocityand larger ions take longer time, thus the analyzer is called TOF and the mass is determined by the time required for each ion to travel from the source to the detector.

The ion detector allows a mass spectrometer to generate a signal current from incident ions by generating secondary electrons, which are further amplified. Alternatively, some detectors operate byinducing a current generated bya moving charge. Electron multipliers and scintillation counters are the most commonly used and they convert the kinetic energy of incident ions into a cascade of secondary electrons.

The relationship that allows the mass/charge (nzlz) ratio to be determined for an individual ion is : (1. 1) E ='h (mlz) v2 In this equation, E is the energy imparted to the charged ions as a result of the voltage that is applied by the instrument and v is the velocity of the ions down the flight path. Because all of the ions are exposed to the same electric field, all similarly charged ions will have similar energies. Therefore, based on the above equation, ions that have larger mass must have lower velocities and hence will require longer times to reach the detector, thus forming the basis for m/z determination by a mass spectrometer equipped with a TOF detector. A mass spectrum is created by recording electrical currents produced by different ions reaching the detector with different traveling times. The resulting data format is very simple: paired mass-to-charge ratio (m/z) versus intensities.

The present method and system employ many novel steps in data preprocessing and disease biomarker identification. As briefly mentioned above, data preprocessing includes background subtraction, spectrum alignment (registration), peak identification, normalization, and outlier detection. Disease biomarker identification in accordance with the present invention uses a customized Random Forest algorithm as disclosed by L.

Breiman. Breiman L. , RaN7nFo7est, Technical Report, Statistics Dept. UCB (2001). The algorithm is specially designed for the purpose of parallel computing, e. g. , on a 128 node IBM Beowulf cluster. The latter feature is critical for expansion of the dynamic range of the analyses by obtaining and analyzing multiple spectra/sample. The latter might be produced by LC/MS that is carried out either"off-line"or via a liquid chromatograph that is directly coupled to an ESI source of a mass spectrometer. Although a preferred embodiment is disclosed in accordance with the present disclosure, other algorithms are contemplated for searching for features showing distinct patterns among different classes (that is, those samples exhibiting specific biological characteristics) of samples. The present method is built on sound statistical principles and integrates efficient and powerful statistical tools to allow researchers to fully utilize information in the data sets for biomarker identification purposes.

In accordance with a preferred embodiment of the present invention, the present method and system is employed in the identification of peptide/protein disease biomarkers in sera from mass spectrometry data. The mass spectrometry data is preferably obtained from a mass spectrometer equipped with a matrix assisted laser desorption ionization (MALDI) source and time-of-flight linear and/or reflectron analyzer.

However, those skilled in the art will appreciate the underlying concepts are not limited to this specific application area. For example, the present method and system may be used to analyze multiple spectra per sample obtained from other types of mass spectrometers (for example, mass spectrometers equipped with liquid chromatographs and electrosprayion sources), to carry out comparative proteome profiling (for example, following tryptic digestion of serum), to analyze all other types of biological samples (for example, tissue and cell extracts), and to analyze data from other types of biomolecule profiling (for example, mass spectrometry-based lipid profiling data). In addition, the preprocessing procedures that have been developed can be applied to other types of experiments where curved data are generated, for example, time-course experiments in microarray studies. As such, it is contemplated that the biomarker identification algorithm of the present invention can be applied to extract useful features from virtually any type of data sets which have a large number of features. In addition, the integrated system can be easily modified for other biomedical applications.

The present method and system has been shown to outperform other existing methods. The present method and system employs a customized Random Forest algorithm having many unique features ideally suited to data sets generated from a wide range of genomic and proteomic studies, which usually have a very large number of features (attributes) but a relatively small number of samples. The underlying computer code employed in accordance with the present invention has been optimized for use on a parallel, cluster computer which will be essential as this biomarker discovery approach is applied to the analysis of multiple spectra/sample following LC fractionation. In this regard, the Random Forest approach has been found to be ideally suited for use on cluster computers which will provide the compute power needed to analyze tens of individual spectra from hundreds of samples in a reasonable time frame.

The present method and system also provides a simple methodology that allows application of proteome analysis to be used on a far wider range of mass spectrometric instrumentation than just a SELDI mass spectrometer. The present method and system refines statistical methods to address a range of important issues including background subtraction, peak identification, and normalization of spectra. The present method and system also introduces visualization tools and a new algorithmic approach to uncovering peptide and protein biomarkers of disease. Using previously published and newly acquired data on sera from control versus ovarian cancer patients, the present disclosure provides practical guidelines for using the underlying concepts of the present invention and suggests how they might be applied in the future to the far more daunting challenge of proteome profiling.

The experimental procedures employed in accordance with the present invention are outlined below. With regard to the collection of mass spectrometry data, and in accordance with a preferred embodiment of the present invention, it is collected in the following manner : Automated C-18 ZIPTIP desalting and spotting onto MALDI-MS target plates of serum and other * uiTs on a PA CKARD MA SSPREP sale handler After aliquoting 10 1 of each sample into a 96 well plate, each is acidified by the addition of 5p1 0. 1% TFA. The robot then picks up the first set of 4 C-18 ZIPTIPS (Waters Corporation), which are laboratory pipette tips, and washes them with 50% acetonitrile, 0.1% TFA (trifluoroacetic acid); followed by 0. 1% TFA. After repeatedly (8x) pulling each sample up into a C18 ZIPTIP and expelling it back into the original sample well, the C18 ZIPTIP is washed 5x with 20 nul 0. 1% TFA. Bound peptides/proteins are eluted from the C18 ZIPTIP with 10 1ll of 50% acetonitrile, 0.1% formic acid into a new 96 well plate. A 2 pl aliquot of each sample eluent is removed, mixed with 0. 5 ul alpha-cyano- 4-hydroxycinnainic acid matrix in 50% acetonitrile, 0.05% TFA containing an internal standard of 25 fmol bradykinin (M+H C12 mono-isotopic mass: 1060.569), and then subjected to automated MALDI-MS on a Micromass M@LDI-R or M@ALDI-L/R mass spectrometer.

Automated MALDI-MS data acquisition The M@LDI-L/R mass spectrometer automatically acquires data in positive ion detection over a mass range currently set at 800-3,500 Da using its reflectron analyzer and 3,450 to 28,000 Da using its linear analyzer. Although the mass range is adjustable, it is difficult to acquire meaningful data below about 800 Da due to interference from the matrix and with a reflectron analyzer, the ionization response drops off substantially as the mass range is increased above about 3,500. Hence, by also analyzing the sample in linear mode, the mass range may be extended to 28,000 Da (with alpha-cyano-4-hydroxy cinnamic acid matrix). Following acquisition of the reflectron and linear spectra they are joined together to form a continuous spectrum spanning from 800 to 28,000 Da. The mass of 28,000 Da is the upper mass limit for the alpha-cyano-4-hydroxy cinnamic acid matrix. This mass range could be extended up to >100,000 Da if the sample was re-spotted using a matrix suitable for large MW proteins, such as sinapinic acid.

Currently, the M@LDI-L/R sums 10 individual laser shots into one spectra with the laser operating at 10 Hz. The laser moves in a random walk around the target well, acquiring data from a maximum of 20 different locations within each 2 mm diameter well.

A spectra is considered"acceptable"if it has a signal that is >2% above background noise, less than 95% of saturation, and in the case of the reflectron spectrum, if there is at least one Hz detected between 1,125 Da and 3,500 Da. The M@LDI-L/R is programmed to retain up to 40 acceptable spectra, but if it sequentially acquires 4 unacceptable spectra, it will move to another location within the same target well. The instrument uses an incrementally increasing laser percentage to heat up the target spot to acquire acceptable spectra, while still having the lowest possible laser energy, which provides the best possible mass resolution. If the M) LDI-L/R acquires 20 acceptable spectra at one position, it will then move to another position in the same sample well, and will acquire another 20 acceptable spectra, unless interrupted by 4 unacceptable spectra. Once the M@LDI-L/R has shot (not acquired) 40 acceptable spectra, it will move to the next sample well. This means there can be a maximum of 40 acceptable spectra acquired for each sample, and that if at no point it acquires acceptable data, it will try up to 10 different locations within the same sample target well before moving on to the next sample. Typically, the resulting spectrum represents the average of 20-40 spectra. The expected mass resolution is 14,000 at M+H 2,465 and mass accuracy is better than 70 ppm. Each (averaged reflectron and linear) MALDI-MS spectrum is converted to a text file listing of 91,400 m/z versus intensity data points spanning the rnlz range from 800-3500 Da and nearly40, 000 data points spanning from 3500 Da to 28,000 Da which is then suitable for further analysis.

Additional information on both automated desalting of serum samples and MALDI- MS data acquisition can be found in Appendix A, which is attached hereto The data that results from MALDI-MS analysis has a very simple format consisting entirely of paired intensityversus mass/charge data points. Because MALDI-MS of peptides primarily produces singly charged species, the mass/charge ratio is usually equal to the mass. Figure 1 shows raw MALDI-MS spectra acquired as described above on four serum samples from ovarian cancer patients in the National Ovarian Cancer Early Detection Program clinic at Northwestern University. Perhaps the most apparent feature of these spectra is their diversity both with respect to the peptides that are present in each and their relative MALDI-MS response, which is indicated also by the variations in the intensity scales on the y-axis. This high level of diversity suggests that reasonably large numbers of samples will need to be analyzed to find commonalities that might be used to differentiate serum from ovarian cancer versus normal patients and that individual biomarkers are likely to have modest predictive value.

A less apparent challenge presented by the data in Figure 1 is that each reflectron spectrum is composed of 91,400 individual data points. This means that if the entire spectrum is used in the search for biomarkers, there will be a very large ratio of data points/samples. This presents unique challenges as will be described in more detail below.

Statistical issues in the analysis of mass spectrometry data can be broadly classified into three categories: preprocessing, peak identification, and biomarker identification. Data visualization is an important element in biomarker identification. Data preprocessing includes mass alignment, normalization, background subtraction, smoothing and peak identification. Appropriate normalization methods are needed to ensure that all samples contribute reasonably equally to the analysis.

Background subtraction removes noise, which actually accounts for most data points.

Moreover, the observed mass spectrometry intensity has a wide dynamic range (0 to 20,000 in the case of reflectron spectra). This further challenges statistical analysis of mass spectrometry data. Peak identification is important so that biomarker identification is focused on those regions of the spectra that result from ionization of peptides as opposed, for instance, to differences in baselines. Since each peptide that ionizes produces several data points/peak and with a reflectron analyzer, multiple isotope peaks, it is important that only one (that is, the best in terms of discriminating control from experimental samples) n ; lz versus intensity data point be chosen for each peptide biomarker.

Statistical approaches designed to analyze data sets that contain a much smaller number of features compared to the 91,400 m/z versus intensity data points that compose each of the spectra in Figure 1, cannot be applied to mass spectrometry-based biomarker discovery due to challenges that arise from the large data point/sample number ratio.

Instead, the present method and system employ techniques that are not compromised by this feature which is inherent to mass spectrometrydata sets. Although statistical methods are essential for preprocessing mass spectrometry spectra and for identifying biomarkers that can best discriminate large numbers of control from experimental samples, it is equally important that visualization tools be developed that can effectively identify possible anomalies in the data set and provide a final confirmation that the selected biomarkers appear to be reasonable and to derive from peptide ionization.

As discussed above, preprocessing of mass spectrometry data aids in the effectiveness of the present invention. In accordance with a preferred embodiment of the present invention, prior to identifying peaks and initiating the search for potential biomarkers, each raw MS data set is subjected to four sequential procedures (mass alignment, logarithmic transformation, background subtraction, and normalization) that are designed to optimize it for biomarkers based on a customized Random Forest algorithm as will be summarized below in detail.

Mass alignment. In an ideal experiment, all ions will have the same kinetic energyE and will travel through the exact same drift region length. However, some initial kinetic energy distribution will be present in the ion population and there will be slight spatial variations in the travel length from the target plate which will produce corresponding variations in the traveling time and thus the measured Hz ratio for ions with exactly the same mass. This problem is partially solved by using time delayed ion extraction (Randy M. Whittal and Liang Li, High- Resolution Matrix-Assisted Laser Desorption/Ionization in a Linear Time-of-Flight Mass Spectrometer, Anal. Chem 67,1950-54 (1995); Robert S. Brown and John J. Lennon, Mass Resolsion Improvement by Incorporation of Pulsed Ion Extraction in a Matrix-Assisted Laser Desorption/Ionization Linear TineqrFliitMass Spfser, Anal. Chem. 67,1998-2003 (1995) ) in MALDI-TOF, but <BR> <BR> <BR> as a side effect it also changes the linear relationship between yz and t2 (i. e. , v2= D2/t2where D is the distance traveled) in equation (1.1). A first order approximation can be used: (1.2) mlz=a+bt2 where a and b are constants for a given set of instrument conditions and are determined experimentally from flight times of ions of at least two known masses (calibrants). In practice, higher order approximations have been proposed to achieve higher accuracy. Johan Gobom, Martin Mueller, Volker Egelhofer, Dorothea Theiss, Hans Lehrach, and Eckhard Nordhoff, A Calibration Method That Simplifies and Improves Aocurate Determination of Peptide Molecular Masses by MALDI-TOF MS, Anal. Chem. 74,3915-3923 (2202). Even with the use of internal calibration the maximum observed intensity for an internal calibrant may not occur at exactly the same corresponding Hz value in all spectra.

For this reason, spectra can be further aligned based on the maximum observed intensity of the internal calibrant, after which there are still some problems with local peak shifting.

Useful statistical methods need to be developed to address this problem.

Although spectra obtained from the M@LDI-L/R instrument used in this study were internally calibrated by adding bradykinin to all samples, slight variations (that is, within the expected mass accuracy of <70 ppm) were seen in mass values for the same relative data points in different spectra. To circumvent this challenge, data points are numbered consecutively by assigning the observed mass measurement value that is closest to the expected MH+ for the C12 isotope of bradykinin, which is 1060.569, as data point zero.

Logarithmic transformation. Measured protein/peptide concentrations in samples like human serum have a vast dynamic range (more than 1010-fold) that spans from 35-50 mg/ml for serum albumin down to at least 0-5 pg/ml for interleukin 6. Anderson, NH and Anderson, NG, The human plasma proteome, Mi & Q Paoc 1,845-867 (2002).

Although mass aligned spectra of serum and other biological samples can be directly analyzed, the relatively large variations in the measured intensities are likely to make most statistical procedures unstable, thus making it more difficult to extract information from the MS dataset. In addition, the large magnitude of the intensities will make most numerical programs unstable.

Although mass aligned spectra can be directly analyzed, the relatively large variations in the measured intensities are likely to make most statistical procedures unstable, thus making it more difficult to extract information from the mass spectrometry data set. In addition, the large magnitude of the intensities will make most numerical programs unstable.

As a straightforward approach to minimize these challenges, we take the logarithms of the intensities to reduce the variation of the raw dataset. Therefore, the numerical variations in the intensities across the spectrum and all the samples are substantially reduced.

Background subtraction. Chemical and electronic noise produce a background intensity that typically decreases with increasing m/z values and that is present regardless of whether or not a sample has been deposited onto the target. To minimize the impact of noise and the overall downward sloping baseline trend, we estimate the background intensity level by assuming that nearby mass spectrometry points share common background information. This is achieved by using the Robust locally Weighted Regression and Smoothing Scatterplots (also known as'lowess') method to estimate local background levels by performing a robust linear regression using a sliding window across each spectrum.

Cleveland, W. S. Lowess : A program for smoothing scatterplots by robust locally weighted regression; The American Statistician 35,1981, 54. Although one skilled in the art could carry out such a procedure, it must be optimized for MS data by choosing the proper size window. Other approaches such as quantile regression and wavelet transformations are also being explored for their relative usefulness in estimating background levels and removing noise from MS data. Figure 3 illustrates the result of this background estimation method using lowess for several samples.

Smoothing. High frequencynoise is one contribution to the background that is apparent in MALDI-MS spectra. Smoothing functions can also be used to reduce high- frequency noise, thus minimizing noise spikes and aiding interpretation.

Normalization. To obviate differences in the overall level of intensities that are recorded for a given sample and that might result from experimental variables such as pipetting or uneven sample deposition/matrix crystallization on the target, each spectrum is linearly normalized to try to ensure that all samples contribute as equally as possible to the search for biomarkers. Since each data point in each spectrum is normalized with the same factor, this procedure does not change the observed peak-to-peak ratios in a spectrum; that is, both the raw and normalized spectra will have exactly the same overall m/z versus intensity profile. Normalization is accomplished by assuming there are n samples : (Xl, X2, ..., Xn), each having-100, 000 intensities, and that we would like to find n normalization factors: (fl, to make (Xilfl, X21p,..., X** as comparable to each other as possible. Those skilled in the art will readily appreciate the complete normalization process.

To estimate each factor we first calculate for each data point the overall median intensity, which is noted as Xn for that m/z value across all samples. For each spectrum we then fit the ordinary least square regression of Xm-Xj without intercept, denote the regression coefficient by 9, and we usef =9 as the normalization factor for each of the data points that together make up that sample's spectrum. We exclude those samples with cj >2 or cj <1/2 for further analysis.

Although several normalization approaches are possible, one straightforward approach is to determine a linear normalization factor that will minimize the summed difference between all observed intensities in an individual spectrum and the calculated median spectra for all of the samples. However, the validity of such approaches needs to be rigorously investigated.

Once the raw mass spectrometry data is preprocessed as described above, the spectra are analyzed for peak identification. Intensity measurements from current mass spectrometry technology tend to be quite noisy with approximately 80% of the data points in spectra like those in Figure 1 deriving from both electrical and chemical noise. Therefore, noise filtering is a necessary and indispensable step to allow biomarker identification to be concentrated on those data points that derive from peptide/protein ionization and that might represent useful biomarkers. Although the following procedure has been adopted in accordance with the currently preferred embodiment of the present invention for peak identification, other methods for peak identification and alignment are contemplated for use in accordance with the spirit of the present invention. In the present embodiment, the following three criteria are used to define peaks Noise Filtering. In accordance with a preferred embodiment of the present invention, we take advantage of our finding that approximately 80% of MALDI-MS data points acquired on serum samples result from noise and set a minimum intensity level that can serve as an effective and simple global noise filter. Hence, the assumption is made that only the top 20% of the observed intensities of each linearly normalized spectrum are likely to contain useful biomarkers (that is, only the top 20% of the observed intensities are likely to result from ionization of peptides).

We note that the 20% value is only an example. In practice, this parameter can be adjusted based on the quality of the spectra. That is, this represents a global criterion that be easily adjusted for different data sets and easily confirmed as being reasonable by plotting the top 20% of intensities for some of the higher intensity spectra obtained and confirming that no significant peaks have been filtered out as noise. Alternative approaches might rely on criteria based on local measures and treating different regions of the mass range differently. High-frequency noise filtering also mayimprove upon this global criterion.

Peak Test. The assumption is made that only data points in completely or partially resolved peaks (that is, data points in partially resolved peaks may represent the intensity sum of a useful biomarker superimposed on an unrelated, non-biomarker peptide ion) result from peptide ions and are likely to be useful. To pass this test, at least 3 out of 4 successive data point intensities before or after each candidate biomarker data point must show a progressive increase or decrease in background corrected, normalized peak intensity. The basic concept is to search for local maximum and that by putting some constraints on the data it is also possible to filter out some noise spikes. Additional work is being carried out to further improve the peak detection methodology. A few plots of high and low intensity spectra that are made before and after imposition of the peak test serve as a quick visual confirmation of the suggested stringency, which can be easily altered as needed for different types of data sets. To further narrow our focus to peaks that are found in a reasonable fraction of samples, we require that at least 10% of the cases or controls need to pass the peak test for any peak to be considered a useful biomarker. While the value of 10% constraint appears to work well for the serum samples used in the present study, this parameter may need to be adjusted for different data sets (e. g. for cell extracts and for data acquired with other MS sources).

Unique Peptide Ion Test. Following peak identification, it is important that multiple biomarkers that arise from the same peptide are eliminated as there is no benefit in having multiple biomarkers that all originate from different isotopes of the same peptide ion. To accomplish this objective we require that all potential biomarkers must have rnlz values that differ from each other by at least 3. 1. This criterion will thus eliminate multiple biomarkers that all derive from the monoisotopic [C12] and the first two higher isotopic peaks (containing, for instance, one and two C13 atoms respectively) in an envelope that derives from the same peptide. Since it is quite possible (for example, if there are incompletely resolved, unrelated peptide ions that overlap with the C12 isotope peak of a biomarker peptide ion) that the"best"isotopic representative of a biomarker ion is not the C12 isotope, we would not want to limit our search to only the monoisotopic ion. Given the potential for overlapping peptide ions, we also would not want to merge the isotope peaks and represent the biomarker as the sum of the component contributions of its individual isotopes. Rather, when multiple biomarkers are found that arise from a common peptide ion, we need to define statistical criteria for selecting the best biomarker for that peptide.

Our current strategy is to rank all biomarkers that appear to derive from the same peptide based on their ability to differentiate cases from controls and to then select the best one. In accordance with a preferred embodiment of the present invention, the rank is based on F-statistics for testing differences. However, those skilled in the art will certainly appreciate the other test statistics that could also be used for this purpose without departing from the spirit of the present invention.

Once the data sets are collected and processed, biomarker identification may then take place. As discussed above, and in accordance with a preferred embodiment of the present invention, a customized Random Forest program is used as a classifier in biomarker identification. The Random Forest algorithm in accordance with the present invention is used to identify approximately 20-40 biomarkers whose intensities can best discriminate all cases from control samples in a training set. As will be best appreciated from the following disclosure, biomarker selection is ultimately optimized by increasing the training set size until the ability of the resulting biomarkers to classify one or more testing sets is maximized.

If the resulting classification error is too high, the next logical step would be to fractionate the sample (e. g. , by liquid chromatography) and utilize a similar strategy to optimize the number of fractions that should be analyzed by MALDI-MS for each sample.

This customized Random Forest program employs appealing features in that it combines bagging with random feature selection. Bagging results in pooling multiple classifiers from perturbed versions of the original dataset to increase predictive accuracy.

For our data set, the number of nzlz versus intensity variables is large compared to the number of samples, so it is not surprising that each individual variable has small predictive power. Under these conditions it is unwise to just select a single or even a few"best" variables for classification. Using the random feature selection will increase our predictive accuracy. A side product of bagging is out-of-bag prediction for each sample, which provides a very accurate estimate of the relative importance of each variable (that is, biomarker) that is similar to cross-validation. Breiman, L. Random forests. MaJ LearrE, 45, 1 (2001), 5-32.

Enhanced accuracy of the classifier may be achieved by setting minimum importance values criteria for use of each biomarker, thus ultimately improving predictive ability. In addition, a minimum confidence level for classified samples may also be set in an effort to further improve the results. Those samples not meeting the minimum confidence level could then be re-analyzed multiple times with the resulting spectra being averaged which might then allow them to meet the minimum confidence level.

In particular, and in accordance with a preferred embodiment of the present invention, a Random Forest algorithm as disclosed byBreiman is utilized. Breiman, L.

Random forests. MadinLearni45, 1 (2001), 5-32. Random forest combines two powerful ideas in machine learning techniques: bagging and random feature selection. Bagging stands for bootstrap aggregating, which uses resampling to produce pseudo-replicates to improve predictive accuracy. By using random feature selections, we can significantly improve our predictive accuracy. It works as follows: (1) Sample with replacement to form N bootstrap samples {Bi... BN}.

(2) Use each sample Bt to construct a Tree classifier Tk to predict those samples that are not in Bt (called out-of-bag samples). These predictions are called out-of-bag estimators.

(3) Before using Tk to predict out-of-bag samples, if we randomly permute the value for one variable for these out-of-bag samples, intuitively the prediction error is going to increase and the amount of increase will reflect the importance of this variable.

(4) When constructing Tk, at each node splitting we first randomly select m variables, then we choose one best split from these mvariables.

(5) Final prediction is the average of out-of-bag estimators over all Bootstrap samples.

Currently we are exploring the use of weighted sampling at each split so that more informative features maybe sampled. This approach is highly compute intensive and requires the use of parallel computing.

The present method and system provides an effective visualization method appropriate for comparing large numbers of complex mass spectrometry datasets and the regions around selected biomarkers. In accordance with the application of the present method, it is believed that a plot can reveal critical underlying features of the dataset that might otherwise be missed and a plot also can serve as a visual control for a complex statistical analysis. Obviously, if one of the best biomarkers selected by an algorithm is not "visible"on an overall median difference plot comparing all case to all control samples, then it might be appropriate to further examine whythis particular m/z versus intensity data point was selected by the algorithm as a biomarker. In the ovarian cancer biomarker analysis that follows, several types of plots will be shown that provide effective visualization of MALDI-MS datasets.

Reproduability fMALDI-MS Spe-.

There are several steps in the overall procedure outlined in accordance with the present method that would be expected to have a certain level of variability that would manifest in the resulting mass spectrometry spectra as overall differences in intensity and/or differences in relative intensities of individual peaks. These steps include the robotic liquid handling, C-18 ZIPTIP deslating, spotting onto the MALDI target, and the actual data acquisition itself. We have examined the reproducibility of the last step by analyzing individual spectra obtained from the same spotted MALDI-MS target and we have examined the robotic processing steps bycomparing summed MALDI-MS spectra acquired on aliquots of the same sample that have been individually desalted manually and/or spotted by the MassPrep robot.

As will be discussed below in greater detail, the present method and system provides enhanced reproducibility improving efficacy. In particular, the present method and system provides for reproducibility of the whole process including ZIPTIP/spotting/data acquisition, reproducibility of spotting/data acquisition and reproducibility of individual spectra acquired on a sample and that are summed together to give the output.

It is further contemplated that the present method and system may be employed with the introduction of 10% Intensity peak, expansion of the training set from 24 to 48 etc., graphs of the impact of increasing the training set size and the number of biomarkers on the success rate at classifying 2x24 testing sets. The latter is perhaps the most important element as the graph of the size of the training set as a function of the success rate at classifying two known test sets (each of which contain approximately equal numbers of control and disease samples) provides a very facile means to determine how large the training set needs to be to obtain biomarkers that can optimally classify test samples. Once the training set size has been optimized (at the lowest number of samples that provides biomarkers with the highest success rate at classifying the"unknown"test set), then the number of biomarkers included can then be similarly optimized.

To increase the probability of detecting more peptides and to improve the accuracy of the intensity measurements, Micromass'MaLDITM systems automatically acquire up to 40 individual spectra on each target with the final reported intensity being the sum of these individual spectra. Each individual spectrum in turn is the summed ion intensity detected from 10 laser shots at a given position on the target. As a result of variation in automated sample aliquoting and desalting, deposition on the target, matrix crystallization, and ion detection; the overall intensity measurements between two different aliquots of the same sample often vary by at least 4-fold. To assess the extent of this variability that may result from acquiring multiple spectra from the same target, we examined the variability among the 40 individual spectra acquired from one target that had been robotically spotted with a serum sample from a control patient. Each reflectron spectrum contains 91,268 Hz versus intensity data points that cover the range extending from 800Da-3500 Da.. Based on the minimum intensity level test (that is, noise filtering) and the peak test for the summed intensities, 130 peaks were selected for analysis. For every peak, there are 40 intensity measurements from 40 spectra, thus we calculated the coefficient of variation and standard deviation for these 40 measurements before/after log-transformation. Hence, there are 130 standard deviation and coefficient of variations for these 130 peaks.

Basically, we want the standard deviation to be small so the intensity measured for each peak will be as accurate as possible. Standard deviation and mean are unit dependent while the coefficient of variation is independent of the units of measurement. We use the relative variation, i. e. , coefficient of variation, to measure the variation in the measurements taken for each peak with a smaller coefficient of variation resulting in a more accurate measurement. We can see from Figure 4 that taking log of the intensities significantly reduces the variation as measure by the coefficient of variation.

We have examined data from 4 robotically and 2 manually processed and spotted aliquot of 7 samples and 4 robotically and 1 manually processed aliquot of another sample.

In Figures 5.1, 5.2 and 5.3 we plot the mean intensities of manuallyprocessed samples vs. the mean intensities of robotically processed samples. In the plot we compare the log intensities (LI) and background-subtracted log intensities (BSL1), and we include a best fit diagonal line. We can see that overall they agree well after background subtraction.

For these 47 replicate samples, we further identified 49 peaks. In the following plot, we further compare manual vs. robotic procedures at these 49 points, and we also calculate the coefficient of variation at these 49 peaks for 4 robot measurements.

EXAMPLE 1 - Biomarker analysis of serum samples from owrian cancer versus control pattents.

The 95 ovarian cancer and 92 control serum samples used in our analysis were obtained from the National Ovarian Cancer Early Detection Program at Northwestern University Hospital and correspond with some of the same samples that were used previously by Petricoin et al. As described above with reference to the experimental procedures, all samples were desalted via adsorption/elution from C18 ZipTips and were then subjected to MALDI-MS on a Micromass M@LDI-R instrument (note that at the time this data was acquired the Micromass M@LDI-R instrument had not yet been upgraded to the linear/reflectron (L/R) version) with all procedures being highly automated. The detailed protocol can be found in Appendix.

This data set consists of mass spectrometry spectra that were obtained on serum samples from 95 patients with ovarian cancer and 92 normal patients. These spectra extend from 800 to 3500Da and were acquired with the reflectron analyzer of a Micromass M@LDI-R instrument. Twelve samples had poor spectra and they were excluded from further analysis.

We then preprocessed the raw data sets. Our first step is mass alignment ; the resulting dataset has 91254 m/z measurements. Figure 6 shows the overall case and control median log intensities based on these samples. Figure 7 shows the median intensity after preprocessing (background subtraction and normalization). For these normalized samples, we apply our peak identification procedure and find the peak distribution for each data point. Figure 8 shows the distribution of peaks for all samples at each point. It can be seen that the identified peaks are only found in a small proportion of the cases and controls.

There is not a single peak that is found in all cases or controls which confirms the need for multiple biomarkers.

For these identified peaks, we calculate the two-sample T-statistics, and rank them based on their absolute values. The top 3500 peaks are used in Random Forest analysis in accordance with the present invention. We can vary the number of peaks used in Random Forest analysis for different datasets. For our dataset, 3500 seems to lead to represent an optimum number.

We applied the Random Forest program to the normalized dataset with selected peaks and have an 8% error rate for 89 cancer samples, a similar 8% error rate for 86 normal samples and thus an overall 8% error rate. The error rate is based on out-of-bag estimation. It is important to point out that these numbers are somewhat misleading in that they are based on internal CV and under-estimate the true error rate. In our later analysis, we have applied CV with feature selection within each training set, and the error rate is higher, about 25%. We expect this error rate will be substantially decreased as we acquire and merge together both reflectron and linear spectra for each sample (thus extending the analysis range up to 28,000 Da) and as we begin to fractionate samples and analyze multiple spectra/sample.

The Random Forest algorithm also produces variable importance measures that reflect the relative importance of each variable for prediction. We can compare these measures for different peaks to the ranks of these peaks based on their T-statistics. Figure 9 plots the ranking measures of selected peaks based on T-statistics and the importance measures. We can see that while both measures will be able to capture a common set of variables, there do exist discrepancies between these two measures.

EXAMPLE 2 In accordance with a preferred embodiment, the principles outlined above were applied. In particular, ovarian cancer and control serum samples were obtained from the National Ovarian Cancer Early Detection Program at Northwestern University Hospital.

The Keck Laboratory then subjected these samples to automated desalting and MALDI-MS on a Micromass M@LDI-L/R instrument (as opposed to the Micromass M@LDI-R instrument used in Example 1) as described generally in Appendix A.

The WLDI-L/R mass spectrometer automatically acquires two sets of data in positive ion detection mode. The mass range acquired is dependent on the mass analyzer being used, with 700-3500 Da for reflectron and 3450-28000 Da for linear. This dataset consists of merged mass spectrometry spectra that extend from 700 to 28000 Da and that were obtained on serum samples from 93 patients with ovarian cancer and 77 normal patients.

As mentioned above, Random Forest combines two powerful features: Bootstrap to produce pseudo-replicates and random feature selection to improve prediction accuracy.

Breiman, L. Random Forests. Madvine Lfflming 45, 1 (2001), 5-32. Random Forest can also estimate the importance of features according to their contribution to the resulting classification. (For a more detailed description of the algorithm see Wu, B. , Abbott, T., Fishman, D. , McMurray, W., Mor, G. , Stone, K. , Ward, D., Williams, K. , and Zhao, H.

Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, 13 (2003a), 1636-1643, which is included as Appendix B. ) From Random Forest program we can get the posterior probability of belonging to each class for each sample. Based on these posterior probabilities we evaluate the sensitivity, specificity and classification errors.

We summarize our mass spectrometry dataset for n samples in a p by n + 1 matrix: (sur, X,) (mz, X1,... , X@) where p is the number of mHz ratios observed, Hz is a column vector denoting the measured nzlz ratios, and the xi are the corresponding intensities for the i-th sample. We use vector Y = (y@) to denote the sample cancer status. Our goal is to predict y1 based on the intensity profile X'i = (x1i, x2i,..., xpi). Assume that we have classes.

Random Forest classifier partitions the space X of protein intensity profiles into g disjoint subsets, Al,..., Ag, such that for a sample with intensity profile X = (x1, ..., xp.,) # Aj the predicted class is j.

Classifiers are built from observations with known classes, which comprise the learring set (LS) L = { (X1, y1), ..., (XnL, ynL)}. Classifiers can then be applied to a test set (TS) T = {X1,... , Ont}, to predict the class for each observation. If the true classes y are known, they can be compared with the predicted classes to estimate the error rate of the classifiers.

We denote the Random Forest classifier built from a learning set L by C (., L).

Given a new sample (X, y), we can represent C(X, L) by a g-element vector (Ci,..., Cg). If we want a hard-decision classifier, we will have Ck=l and Ck=O, that is, it predicts sample (X, y) to belong to class k. Or we can have a probability output, Pr (Q=l) =Pi [0, l] and #i=1,... , g Pi =1, that is, it predicts the probabilitythat sample (X, y) belongs to class k is Pk.

For the ovarian cancer data set considered in accordance with this example we only have two classes, cancer (y=1) and normal (y=2) samples. For two-class classification problems we can define sensitivity (0) and specificity (<). They are inherently related to classification errors. The relationship between sensitivityand 1-specificityis well known as ROC curve in medical research. Sensitivity is also known as true positive rate, which is the probability of classifying a sample as cancer when it actually derives from a patient who has the cancer, i. e. Pr (C (X, L) =l j y=l). Specificity is also known as the true negative rate, which is the probability of classifying a sample as normal when it is actually normal, i. e.

Pr (C (X L) =2 l y=2) If QX, L) is a hard-decision classifier, we can estimate sensitivity and specificity using sample proportions, The most commonly used classification error (Err) is estimated as where ni and n2 are sample size for cancer and normal groups. 1-0 is classification error for cancer group, and 1-ruz is classification error for normal group. If we have a very un- balanced sample set, i. e. ni » n2 or m » n2, we can see that the previous definition of Err will encourage classifying all samples into the group with the larger sample size. To avoid this problem we can use a balanced classification error definition This error definition assigns equal weights to two groups.

In case we have a probability output, we first select a threshold a and then define the hard-decision classifier as l if Pj ; >a Cf';, L) = 9 otherwise We can then estimate 0 and Err similarly as before Relationship between is the commonly used ROC curve. Minimum classification error can be estimated as Preprocessing is arguably the most important step in mass spectrometry data analysis to reduce the effects of noisy features and to appropriately interpret the mass spectrometry dataset. Before we submit the dataset to our final classifier, we carry out the following preprocessing steps: mass alignment, normalization, smoothing and peak identification.

These detailed preprocessing steps are discussed briefly in Wu, B., Williams, K. , and Zhao, H Statistical challenges in proteomics research in postgenomics era. Institute of Mathematical Statistics Series IMS Lecture Notes-Monograph Series, 2003b, submutted; whch is included herewith as Appendix C Since we did not have a true test set, cross-validation was utilized to provide a nearly unbiased estimate of the classification error. The idea of cross-validation is to randomly partition the original data into two parts: training set used to build the classifier and a testing set used to estimate the performance of the classifier. The commonlyused"leave-one-out" cross-validation approach has high variance. Ambroise, C, and MacLachlan, G. J. Selection bias in gene extraction on the basis of microarraygene-expression data. PNAS 99, 10 (2002), 6562-6566. M-fold cross-validation is recommended, whereby M is usuallytaken to be around 5,10. In our study we use 5-fold cross-validation to estimate classification errors.

It is important to carry out peak identification and biomarker selection inside each cross- validation to avoid selection bias and to obtain and unbiased classification error estimation.

It is obvious that Err depends on the underlying classifier, sample size N and the number of selected biomarkers M. In this study we fix the classifier to be RF, and evaluate the impacts of N and M on Err. Our strategy is to empirically model the functional relationship Err (N, M) for a grid of values of N, M. For mass spectrometry data the total number of features is usually very large, there are total p=130, 000 nzlz ratios for our ovarian cancer dataset which consists of one reflectron and one linear spectrum for each sample.

The total number of selected biomarkers is usually in the range of 10-100. In our study we evaluate Err forM ranging from 5 to 100. The total number of samples is usually very small compared to the total number of features. There are total n=170 samples in our current ovarian cancer data set. We need to extrapolate to estimate the impacts of N on Err. An inverse-power-law learning curve relationship between Err and N, Err (N) =ß0+ ß1N-a is approximately true for large sample size dataset (usually about tens of thousands of samples), a is the asymptotic classification error and (po, pi, i) are positive constants.

C Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, and J. S. Denker. Learning Curves: Asymptotic Values and Rate of Convergence. A hnces in Nal Irfamsion Prtin « g Systx, 6 : 327-334, 1994.

Our current dataset has relatively very small sample size (n =170) compared to high- dimension feature space (p=130, 000 for datasets containing merged reflectron + linear analyzer spectra). Under this situation it is not appropriate to rely on the learning curve model to extrapolate to an infinite training sample size N=oo. But within a limited range we can still rely on this model to extrapolate the classification error to full sample size n=170.

To estimate parameters (a, Po, Pi), we need to obtain at least three observations. As discussed before we will use 5-fold cross-validation to estimate classification errors. We first use one of the groups as testing set, which will produce a training set of N= 170/5 F4=136 samples. We then use two, three and four of the groups as a testing set, which will give N=102,68, 34. For each N we will estimate classification errors with M=5, 6,..., 100 biomarkers. And based on these classification errors we can estimate the learning curve.

Figure 10 displays the 5-fold cross-validation classification error estimations for this ovarian cancer data set. After merging the linear analyzer data, the best classification error achieved drops from about 25% to 20% and the classification error estimation is also more stable. The large fluctuations in classification error estimations in the Reflectron data are probably due at least in part to the influence of noise. Overall we can clearly see the trend that a larger training set has smaller classification errors. And for a fixed training set, classification error drops significantly from 5 to 20 biomarkers and then it levels off at about 20-40 biomarkers for the combined Reflectron+Linear data. With 136 samples in the training set, we can achieve about 20% classification error. Next we will use a learning curve to extrapolate Err (170, M) for each M.

Figure 11 displays the estimated classification for total sample size M=170. We can see that there is a significant improvement when the sample size increases from 34 to 68 and then to 102. But there is not too much further improvement from 136 samples to 170 samples. Overall the classification error levels off after 20 to 40 biomarkers. And the optimal classification error we can achieve is about 19%.

One of the major current interests in obtaining mass spectrometry data on patient samples is in identifying important biomarkers to build molecular diagnosis and prognosis tools. As discussed in Wu et al. , the Random Forest program has some significant advantages over traditional T-statistic for biomarker identification in terms of minimizing classification errors. Here we apply Random Forest to our 170 ovarian cancer samples to rank important biomarkers. To guard against false positives, it is very important to explore the local behavior of the identified biomarkers. To explore the intensity of all samples in one figure will make the plot obscure. Instead we visually compare median, first and third quartile intensities of normal and cancer groups in one plot. In the following several biomarker exploration plots, qo. 25 is the first quartile intensity, qo. 5 the median intensity and qo. 7s the third quartile intensity. Referring to Figures 12-15, we can clearly see the difference between cancer and normal groups. But there is no single biomarker that can completely distinguish cancer from normal groups; there are considerable overlaps between the two groups. For some biomarkers the normal group has higher intensities, while the cancer group dominates at other biomarkers.

We estimate the unbiased classification error rates for the ovarian cancer datasets.

With reflectron data alone, we can achieve about 25% classification error. After expanding the mass range of mass spectrometry data with the use of a linear analyzer, the optimal classification error we can achieve with 170 samples is about 19% for the merged linear + reflectron spectra. While some other cancer studies using mass spectrometry data have reported nearly perfect classifications, they are usually based on internal CV that will produce serious under-estimations of the actual error, e. g. in our previous study, the optimal internal classification error is about 8% compared to the"real"classification error 25%. Wu et al. Another neglected aspect in most current studies is the lack of visualization tools to analyze the regions around the identified biomarkers and to verify that they might actually result from peptide ionization.

While the preferred embodiments have been shown and described, it will be understood that there is no intent to limit the invention by such disclosure, but rather, it is intended to cover all modifications and alternate constructions falling within the spirit and scope of the invention as defined in the appended claims.

Appendix A W. M. Keck Facility at Yale-MALDI-MS Peptide Disease Biomarker Screening /, I Search Keck Sites : . M . K E C IC F O U N D AT I O N BIOTECHNOL13GY RESOURCE LABORATORY i=i BtaTECHNaLaGY RESOURCE LABORATORY Keck Home Page > Protein CheMistM > MALDI-MS Peptide Disease Biomarker Screening i MALDI-MS PEPTIDE DISEASE BIOMARKER SCREENING __ A recent publication (Petricoin et al, 2002) suggests that naturally occurring peptide disease biomarkers that bind to C16 supports may be identified via comparative analysis of MALDI-MS spectra acquired from comparatively large numbers of disease versus control serum samples. Since preliminary data obtained in the Keck Laboratory confirms this approach has merit, a new service is being offered to bring this technology within reach of our users. This new Keck Laboratory service includes robotic desalting of the serum and other biological samples followed by automated MALDI-MS data acquisition-with the resulting data being made available initially as a hard copy. Serum or plasma samples (10 microliters each) should be submitted in a Marsh Bioproducts AB0800 PCR plate and should be ordered by columns. Specifically, samples should fill the wells Al, B1, Cl, etc., before proceeding to A2.. We suggest that experimental and control samples be placed in alternating order and that at least some samples be run in triplicate (widely dispersed amongst the sample wells) so that experimental variability may be evaluated. A minimum of 8 total samples (experimental and control) per plate is required for this service which Yale University begins by the robotic addition of 5 microliters 0. 1% TFA to each sample. After Yale University 9 Y P 301 BCMM repeatedly (8x) pulling each sample up into a C18 ZipTip and expelling it back into P. O. Box 9812 the original sample well, the C18 ZipTip is washed 5x with 20 microliters 0. 1% TFA. New Haven CT Bound peptides are then eluted from the C18 ZipTip with 10 microliters of 50% ossss-oaiz P P P P acetonitrile, 0. 1% formic acid into a new 96 well plate. 2 microliters of the eluent Contact us are then removed, mixed with 0. 5 microliters of alpha-cyano-4-hydroxycinnamic acid matrix in 50% acetonitrile, 0. 05% TFA containing an internal standard of 25 moles of bradykinin (M+H C12 monoisotopic mass is 1060. 569), and subjected to automated MALDI-MS on a Micromass M@LDI-R mass spectrometer. The internal bradyknin standard is then used as a"lock mass"to calibrate the spectrum. The MALDI-R mass spectrometer automatically acquires data in positive ion detection. The current mass range being acquired is 800-3, 500 Da. Although the latter mass range is adjustable, it is difficult to acquire meaningful data below about 800 Da due to interference from the matrix and, especially on the current reflectron httD ://info. med. vale. edu/wmkeck/Drochem/biomarker. htm (onl6) equipped instrument, the ionization response drops off substantially as the<BR> mass range is increased above about 2,500. Currently, the MALDI-R sums 10<BR> individual laser shots into one spectra with the laser operating at 20 Hz (i.e., 20<BR> shots/second). Thus, a new spectra is acquired every ½ second. The laser moves in<BR> a random walk around the target well, acquiring data from a maximum of 20<BR> different locations within each 2 mm diameter well. A spectra is considered<BR> "acceptable" if it has a signal of greater than 2% above background noise less than<BR> 95% of saturation, and if there is at least one m/z detected between 1,125 Da and<BR> 3,500 Da. The MALDI-R is programmed to accept a maximum of 40 acceptable<BR> spectra, but if it sequentially acquires 4 unacceptable spectra, it will move on to<BR> another location within the same target well. The instrument uses an incrementally<BR> increasing laser percentage to heat up the target spot to acquire acceptable<BR> spectra, while still having the lowest possible laser energy, which will provide the<BR> best possible mass resolution. If the MALDI acquires 20 acceptable spectra at one<BR> position, it will then move to another position in the same sample well, and will<BR> acquire another 20 acceptable spectra, unless interrupted by 4 unacceptable<BR> spectra. Once the MALDI-R has shot (not acquired) 40 acceptable spectra, it will<BR> move to the next sample well. This means there can be a maximum of 40<BR> acceptable spectra acquired for each sample, and that if at no point it acquires<BR> acceptable data, it will try up to 10 different locations within the same sample<BR> target well before moving on to the next sampe. Typically, the resulting hard copy<BR> spectrum will represent the average of 20 to 40 individually acquired spectra.<BR> <P>The expected mass resolution is 14,000 at M+H 2,465 and the mass accuracy is<BR> better than approximately +70 ppm. We recommend that reasonably large numbers<BR> (e.g., >20) of experimental (i.e., disease = case) and control samples from<BR> different patients be analyzed to "average out" biological diversity so that disease-<BR> specific peptide markers may be identified with higher confidence. Although not yet<BR> tested, we believe this service may be equally applicable to other biological fluids<BR> (e.g., amniotic fluid, urine, etc). Please especially note that samples will be<BR> identified only by the plate number (which is assigned by the Keck Laboratory), the<BR> plate name (which is assigned by the user and which may contain up to 8 letters),<BR> and the well identification (e.g, C1). This service will be further expanded by the<BR> Spring of 2003 when our current M@LDI-R instrument is replaced by a MALDI-L/R<BR> instrument which will enable the use of both linear and reflectron modes of<BR> operation - with the linear mode allowing the mass range analyzed to be extended<BR> to 250,000 (depending on the matrix used). Upon request, the entire (averaged)<BR> MALDI-MS spectrum can be provided as a text fil listing of approximately 91,400<BR> m/z versus intensity data points that is suitable for further analysis - including<BR> importing into software programs like Excel. Algorithms that may be suitable for identifying peptide disease markers and other analysis and visualization tools are<BR> now being designd and written by the Keck Biostatistics Resource, which is<BR> Directed by Dr. Hongyu Zhao, and all users of this service will have access to these<BR> programs as they become available.<BR> <P>Biomarker Submission Workbook (36 KB Excel Workbook)<BR> This workbook must be submitted as described on the form before plates of<BR> biomarker samples are submitted along with one of the forms below.<BR> <P>Sample submission forms for the Peptide Disease Marker Screening<BR> Service:<BR> # Yale (81 KB Acrobat PDF file)<BR> # Non-Yale (91 KB Acrobat PDF file) Medical Center Yale-New Haven Hsopital Yale University Home Copyright # 2003, Yale University, New Haven, Connecticut, USA. All rights reserved.<BR> <P>Comments or suggestions to site editor.<BR> <P>Last modified: 05-Dec-2003 (GB) Appendix B Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data Baolin Wu, Tom Abbott2, David Fishman5, Wålter McMurray2, Gil Mor3, Kathryn Stone2, David Ward4, Kenneth Williams2 and HongyuZhao * Department of Epidemiology and Public Health, 2HHMI Biopolymer/Keck Laboratory, 3Department of Obstetrics and Gynecology, 4Department of Genetics, Yale University School of Medicine, New Haven, CT, USA and 5Department of OB/GYN, Northwestem University School of Medicine, Chicago, IL, USA Received on December 18, 2001 ; revised on March 3, 2002 ; accepted on June 6, 2002 ABSTRACT Motivation: Novel methods, both molecular and statistical, are urgently needed to take advantage of recent advances in biotechnology and the human genome project for disease diagnosis and prognosis. Mass spectrometry (MS) holds great promise for biomarker identification and genome-wide protein profiling. it has been demonstrated in the literature that bio- markers can be identified to distinguish normal individuals from cancer patients using MS data. Such progress is espe- cially exciting for the detection of early-stage ovarian cancer patients. Although various statistical methods have been util- ized to identify biomarkers from MS data, there has been no systematic comparison among these approaches in their relative ability to analyze MS data.

Results : We compare the performance of several classes of statistical methods for the classification of cancer based on MS spectra. These methods include : linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor classi- fier, bagging and boosting classification trees, support vector machine, and random forest (RF). The methods are applied to ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program clinic at Northwest- ern University Hospital. We found that RF outperforms other methods in the analysis of MS data.

Contact: baolin. wu@yale. edu Supplementary information: http://bioinformatics. med. yale. edu/proteomics/BioSupp1. html 1 INTRODUCTION In the past several years, microarray technology has attracted tremendous interest as it provides the potential ability to monitor the expression of an entire genome on a single "To whom corresponclence shoulcl be addressed. chip. Thus, researcherscan generate a'snap shot'view of the expression level of thousands of genes simultaneously (Nature Genetics, 1999). However, microarray technology is inherently limited. It is directed at analyzing mRNA rather than the actual biological effector, which usually is the res- ulting protein molecule ; it is blind to the array of protein post-translational modifications (e. g., phosphorylation) which often modulate protein function; levels of mRNA expression often correlate poorly with the actual in vivo protein con- centration due to differential rates of mRNA translation and varying protein half-lives.

Proteins, which carry out and modulate the vast majority of chemical reactions which together constitute'life', are really our targets of interest. Proteomics is an integral part of the process of understanding biological systems, pursuing drug discovery, and uncovering disease mechanisms. Because of their importance and very high level of variability and com- plexity, the analysis of protein expression and protein : protein interactions is as potentially exciting as it is a challenging task in life science research (Science, 2001). Comparative profiting of protein extracts from normal versus experimental cells and tissues enables us to potentially discover novel proteins that play important roles in disease pathology, response to stimuli, and developmental regulation. However, to conduct massively parallel analysis of thousands of proteins, over a large number of samples, in a reproducible manner so that logical decisions can be made based on qualitative and quantitative differences in protein content is an extremely challenging endeavor.

Mass spectrometry (MS) is increasingly being used for rapid identification and characterization of protein populations. The relative ease of operation of matrix assisted laser desorption ionization (MALDI) coupled with time-of-flightdetection and its characteristic generation of (mostly) singly charged peptide and protein ions makes this MS platform the current method of choice for disease biomarker discovery. MALDI-MS uses a nitrogen UV laser (337 nm) to generate ions from high mass, non-volatile samples such as peptides and proteins. The key to this technique, which was discovered several years ago, is that in the presence of an energy absorbing matrix like a-cyano-4- hydroxy cinnamic acid (CHCA), large molecules like peptides ionize instead of decomposing. In this technique, purified or partially purified proteins are mixed with a crystal-forming matrix, placed on an inert metal target, and subjected to a pulsed laser beam to produce gas phase ions that traverse a field-free flight tube and then are separated according to their mass/charge ratio ('H/z). The key is that the time is propor- tional to the square root of m/z. Since MALDI has the inherent advantage that most ions are singly charged, the mass of the analyte usually is equal to m/z and each analyte usually pro- duces only a single ion type. By recording the time-of-flight, we can measure the mass of the peptide/protein ions. The res- ulting data format is very simple : paired mass/charge ratio versus intensities. (See http ://info. med. yale. edu/wmkeck/ prochem/procmald. htm for more information on MS.) MS data sets are increasingly used for protein profiling in cancer research. An important goal of this endeavor is the ability to predict cancer on the basis of peptide/protein intens- ities. The identification of phenotypic expression patterns that correlate strongly to a defined pathological condition might well represent a significant step towards early detection and/or the development of novel therapies in which these molecules might serve as clinical targets (Fung et al., 2000).

As MS is increasingly used for protein profiling, signific- ant challenges have arisen with regard to analyzing the data sets. These include peak identification and alignment, MS spectrum normalization, and data set visualization, among others. These pre-processing steps are arguably critical and we are currently evaluating them carefully. The final and most important step is the classification of disease status based on MS data. Recent publications on cancer classification using MS data sets have mainly focused on identifying biomarkers in serum to distinguish between cancer and normal samples.

Basically these studies first use approximate criteria to select a subset of variables acquired on a training set'of samples, stat- istical methods are then applied to this subset to identify the most important variables as biomarkers. Finally, the perform- ance of these biomarkers is determined based on their ability to classify samples. The statistical methods used to select biomarkers include T-statistics (Guoan etal., 2002), classific- ation methods such as trees (Bao-Ling et al., 2002), genetic algorithms and self-organizing-maps (SOM) (Petricoin et al., 2002), and artificial neural network (Ball et al., 2002). There is a rapidly growing literature on the use of MS to identify peptide and protein biomarkers Virtually all of these reports are based on MS data obtained via surface enhanced laser desorption ionization (SELDI) and the use of only a single methodology to identify the biomarkers. Here we utilize MALDI MS to obtain the data set and then use it to compare the performance of several well-I<nown classification methods Fig. 1. MALDI-MS sample plots. which can be reasonably easily adapted to the analysis of MS data sets. It is our ongoing endeavor to evaluate their relat- ive performance in analyzing MS data sets. Although it is a very important aspect of biomarker discovery, this report does not cover various options for biomarker selections. Rather, we will compare two criteria for biomarker selection and then use selected biomarkers to compare several classification methods. We review below some classification methods and assess their performance on an ovarian cancer MALDI-MS data set obtained by the Keck Laboratory at Yale University (http ://info. med. yale. edu/wmkeck/) as described in Section 4.

Figure I shows the MALDI-MS data from one cancer and one normal serum sample.

2 MATERIALS AND METHODS 2.1 Discriminant methods For our following discussion, we can summarize our MS data set for n samples in a p x (n + 1) matrix : (ItZ, X) = (mz, X1,..., Xn) where p is the number of rn/z ratios observed, riiz is a column vector denoting the measured in/z ratios, and the Xi are the corresponding intensities for the ith sample. We also have a vector Y = (y,.... y,,) to denote the sample cancer status. Our goal is to predict vu based on the intensity profile X _ (xi ;, x2 ;,..., xp ;). For our ovarian cancer data set, there are two classes, cancer or normal, and the class labels y ; can be defined as I or 2, respectively. Gener- ally we can have g classes, and the goal of statistical analysis is to use the class information to reveal the structures of the data. A predictor or classifier partitions the space X of protein intensity profiles into two disjoint subsets, A I and A2, such that for a sample with intensity profile X = (x1,..., Xi,) E A. i the predicted class is j.

Classifiers are built from observations with known classes, which comprise the learning set (LS) L = { (Xi, y1),...,(Xn1,yn1)). Clasifiers can then be applied to a test set (7S) T = (Xi,..., X".,. f, to predict the class for each observation. If the true classes in are known, they can be com- pared with the predicted classes to estimate the error rate of the classifiers.

We denote a classifier built from a learning set L by C (, L) ; the predicted class for an observation x is C (x, L). Below we briefly review several well-known discrimination methods which are compared in our study. Most of the methods dis- cussed below have also been compared in the context of using microarray data to distinguish various cancer types (Dudoit et al., 2002). General references on the topic of discriminant analysis include Mardia et al. 1979), McLachlan (1992), and Ripley (1996).

Linear discriminant analysis (LDA), quadratic discriminant analysis (QDA). LDA (linear discriminant analysis) was first described by Fisher (1936). It seeks a linearcombination xa of the sample intensity X = (xi,..., x,,) which has a maximal ratio of the separation of the class means to the within-class variance, that is, maximizing the ratio aTBa/aTWa, where W denotes the within-class covariance matrix, i. e. the covari- ance matrix of the variables centered on the class mean, and B denotes the between-classes covariance matrix. These two matrices can be calculated as follows. Let M be the n x p matrix of class means, and C be the n x g matrix of class indicator variables (so g =) <== case ;' is assigned to class j). Let x be the means of the variables over the whole sample, then the sample covariance matrices are Different denominators have been used in covariance matrices. Here we follow the notation in Venables and Ripley (2002). The criterion used in LDA is very intuitive. LDA is a non-parametric method that is also a special form of a max- imum likelihood discriminant rule for multivariate normal class densities with the same covariance matrix. An alterna- tive approach to discrimination is via probability models. Let #c denote the prior probabilities of the classes, and p (xic) the densities of distributions of the observations for class c. Then the posterior distribution of the classes after observing x is The allocation rule which makes the smallest expected number of errors chooses the class with maximal p (c#x) ; this is known as the Bayes rule. Now suppose the distribution for class c is multivariate normal with mean, and covariance #c. Then the Bayes rule minimizes <BR> <BR> <BR> <BR> <BR> #c = -2log(p(x#c)) - 2 log(#c)<BR> <BR> <BR> <BR> <BR> <BR> <BR> = (x - µc)#c-1(x - µc)T + log(##c#) - 2 log(#c) The difference between the Qe for two classes is a quad- ratic function of x, so the method is known as QDA and the boundaries of the decision regions are quadratic surfaces in the x space. LDA is a special case of QDA where classes have common covariance matrix. k-Nearest neighbor (KNN). KNN classifiers are based on finding the k nearest examples in some reference set, and tak- ing a majority vote among the classes of these k examples, or, equivalently, estimating the posterior probability p (cl. r) by the proportions of the classes among the k examples. We can measure'nearest'by Euclidean distance or by one minus correlation. Here we consider using k = 1, 3 under Euclidean distance.

Bagging, boosting classification trees. Constructing classification trees may be seen as a type of variable selection.

Possible interactions between variables are handled automa- tically, and so is monotonic transformation of the variables.

These issues are reduced to which variables to divide on, and how to achieve the split in building a clasification tree.

Specifically we construct trees by recursive splits of subsets of the samples into two child subsets, starting with all the samples. Each terminal node is assigned a class label and the resulting partition corresponds to a final classifier. There are several forms of trees. Here we use the CARTclassification and regression trees. For a detailed technical discussion of CART, see Breiman et al. (1983).

Aggregating classifiers could dramatically improve predict- ive accuracy (Breiman, 1996, 1998). In classification, the multiple classifiers are aggregated by majority votes, i. e. the final class is the one predicted by the majority of the predictors.

Breiman (1998) studied the bias and variance properties of the aggregated predictors. The key is the possible instability of the prediction method, i. e. whether small changes in the learning set result in large changes in the predictor. CART is an unstable classifier that can benefit from aggregation. Here we aggregate trees which are grown until they perfectly fit the data. The simplet form of bagging is using bootstrap to produce pseudo-replicates. In our study, we aggregated 50 bootstrap samples to produce a pool of classification trees.

This algorithm works in the following way (suppose our sample is S having n samples) : Algorithm 1, Bagging (I) Sample with replacement to form N bootstrap samples (bol.. Blvi.

(2) Use Bk to construct Tree classifier Tk, and predict S using Tk.

(3) Final prediction is un-weighted average.

Boosting was first proposed by Freund and Schapire (1995), and it was also called arcing and studied by Breiman (1998).

The basic idea is to adaptively resample the original data so that the weights are increased for those most frequently mis- classified samples. The final prediction is based on weighted or un-weighted voting. It is conjectured that boosting is a special form of RF (Breiman, 2001) (see below). Here we consider two special forms of Boosting : arc-x4 and arc-fs, following the descriptions of Breiman (1998).

Algorithm 2, Arc-fs details (I) At first step, initialize p I/it.

Let P(1) = {p1(1),..., pn(1)}.

(2) At kth step, using the current probabilities P ('), sample with replacement from sample S to get the training set Sk and construct tree classifier Tk using Sk.

(3) Run S down Tk and let d(i) = 1 if ith case is classified incorrectly, otherwise zero.

(4) Define Ex = Eipi(k), ßk = (1 - #k)/#k, update k + I step probabilities by If #k = 0, #k # 1/2, re-initialize pi(k+1) = 1/n.

(5) After K steps, (T1,..., TK} are combined using weighted voting with TA. having weight t log (ßk).

Algorithm 3, Arc-x4 details (I) Same as Arc-fs (2) Same as Arc-fs (3) Run S down tree classifier Tk and let m (i) be the num- ber of misclassifications the ith case by {T1,..., Tk}.

(4) Update k + I step probabilities defined by p''= 17 (1 + m(i)4)/#jpj(k) (1 + m(j)4) (5) After K steps, {T1,..., Tk} are combined by un-weighted voting.

Support vector machine (SVM). The observed nt/2 ratio for the i th subject Xi can be thought of as a point in RP.

An intuitive binary classifier would be to construct a hyper- plane separating cancer subjects from normal subjects in this IEC/'space. But for most problems, there is no hyper- plane which can successfully separate different classes. The idea of SVM is to map the data into a higher dimension space and separate them there. For technical details, please see Vapnik (1998) and Burges (1998). The algorithm used here is described at http ://www. csie. ntu. edu. tw/-cjlin/libsvm/ index. html Random forest (RF). RF (Breiman, 2001) combines two powerful ideas in machine learning techniques: bagging and random feature selection. Bagging, as described above, stands for bootstrap aggregating, which uses resampling to pro- duce pseudo-replicates to improve predictive accuracy. By using random feature selections, we can significantly improve our predictive accuracy. Here we use the RF program from (Breiman, 2001), and it works as follows.

Algorithm 4, RF (I) Sample with replacement to form N bootstrap samples {B,,.... BN} (2) Use each sample Bk to construct a Tree classifier 7 to predict those samples that are not in Bk (called out-of bag samples). These predictions are called out-of-bag estimators.

(3) Before using T to predict out-of bag samples, if we randomly permute the value for one variable for these out-of-bag samples, intuitively the prediction error is going to increase. And the amount of increase will reflect the importance of this variable.

(4) When constructing Tk, at each node splitting we first randomly select rn variables, then we choose one best split from these ni variables. <BR> <BR> <BR> <P>(5) Final prediction is theaverageof out-of bag estimators over all Bootstrap samples.

3 DATA SET AND PRE-PROCESSING 3.1 Data set We have obtained ovarian cancer and control serum samples from the National Ovarian Cancer Early Detection Program at Northwestern University Hospital. The Keck Laboratory then subjected these samples to automated desalt- ing and MALDI-MS on a Micromass M@LDI-R instru- ment http ://www. micromass. co. uk as described generally at http ://info. med. yale. edu/wmkeck/prochem/biomarker. htn.

This data set consists of MS spectra that extend from 800 to 3500 Da and that were obtained on serum samples from 47 patients with ovarian cancer and 44 normal patients. Based on ourevaluation, two of the normal spectra are of poorquality and are excluded in our analyses. Figure 2 shows the overall case and control median log intensities based on 89 samples.

3.2 Pre-processing Due to the noisy nature of the data set, pre-processing is an important step in the analysis of MS data. The raw intensities Fig. 2. Median log intensity for 89 samples.

Fig. 3. Media log intensity after pre-processing. have a wide dynamic range. Taking the log of the intensit- ies decrcases the magnitude and variation within this range.

Before we submit the data set to our classifiers, we have to carry out some pre-processing (e. g. background subtraction to remove the effect of chemical and electronic noise, peak identification, etc.) which will be described in a subsequent publication. Figure 3 shows the median intensities after all pre-processing.

3.3 Peptide/protein marker selection For some discriminant analysis methods discussed above, the number of features that can be handled has to be smaller than the number of observations, e. g. the LDA and QDA meth- ods. Therefore, we cannot use all the intensity values from an MS data set for these classification methods. Instead, we have to identify certain nt/z ratios as inputs to these methods, and it is apparent that this feature selection step is critical in the analysis of MS data and comparison of various methods.

To make the comparison as valid as possible, we feed the same set of m/z ratios to all classification methods and com- pare their performance on our data. This practice will likely penalize those methods that can utilize as many features as possible in classifications. In our analysis, we use two meth- ods to select nt/z ratios used in classification analysis. For the first method, we rank the variables, i. e. m/z ratios, based on normalized difference between two groups (cancer group and normal group), which is the T-statistic, and then we select variables based on the absolute values of the t-statistics. In our study, we evaluate the effects of selecting 15 and 25 markers.

In order to evaluate the effects of LDA and QDA, we must verify that there is a sufficient number of samples. So there is a practical limit on the number of markers that we can use.

For the second method of choosing variables in classification analysis, we use the by-product of the RF program. The RF program outputs a variable importance measure. This measure is derived from assessing the decrease in prediction accuracy after random permutation of each variable in the feature set.

The idea is that if we randomly permute the observed values of an important variable, this will result in substantially decreas- ing our ability to classify each individual in the sample set. In our analysis, we also select 15 and 25 markers from a custom- ized RF algorithm which will be described in a subsequent publication. We also compare marker selection based on RF and the normalized difference between groups.

3.4 Study design Here we want to compare the performance of the classifiers discussed above based on their prediction error rate. Since a test data set was not available, cross-validation within the original data set was utilized to provide a nearly unbiased estimate of the prediction error rate. Breiman and Spector (1992) demonstrated that leave-one-out cross-validation has high variance if the prediction rule is unstable, because the leave-one-out training sets are too similar to the full data set.

5-fold or 10-fold cross-validation displayed lower variance.

Efron and Tibershirani (1997) proposed a 0. 632+ bootstrap method, which is a bootstrap smoothing version of cross- validation and has less variation. We applied both methods to LDA, QDA and NN classifiers. We ran 100 cycles of 10-fold cross-validation and 0.632+ bootstrap error rate estimation.

In the 0.632+ rule, we used 100 bootstrap samples. In estimat- ing the 0.632+ error of QDA, bootstrap samples often caused the covariance matrix to be singular if we used 25 markers, Fig. 4. Error summary for T-statistics marker selection.

Fig. 5. Error summary for RF marker selection. so we calculated the 0.632+ error rate of QDA only for ) 5markers.

For bagging and boosting, we used a random split to parti- tion the observed data into a training set (59 samples) and a testing set (30 samples) to estimate error rate. We repeated this 100times, and the RF prediction error estimate is based on out- of-bag estimation, which we believe is reasonably accurate.

To assess the error rate variation, we repeated the whole pro- cedure ! 00 times, with each error estimate based on 100 trees.

These calculations were carried out for selected markers using RF and the normalized difference between groups.

4 RESULTS 4.1 Prediction error rates We use boxplots to summarize the error rates. Figure 4 sum- marizes the errors for using T-statistics to select markers and Figure 5 summarizes the errors for using RF to select markers.

In these plots, the postfix'cvl0'means estimating error using 10-fold cross-validation,'0. 632+'means estimating error using 0.632+ bootstrap method, NN I for k-nearest neighbor with k = 1, NN3 for k-nearest neighbor with k = 3.

First, for the estimates of error rates based on different methods (cross-validation or 0.632+ rule), we can see that the 0.632+ rule provides a more stable estimate of the error rate than 10-fold cross-validation for LDA, QDA, kNN, and SVM classifiers. The error results for bagging and boosting trees are highly variable. Although the variance of the error estimates for RF is not as small as those based on the 0. 632+ rule, it is certainly quite comparable and much less than those based on bagging and boosting.

As for error rate, RF consistently performs well among all the scenarios considered. When a total of 15 markers selected through t-statistics are used, SVM has the lowest error rate among all classifiers, whereas the error rate based on LDA is the second lowest one among all the classifiers. The error rate based on RF closely follows the top two methods. As the number of markers selected increases from 15 to 25, the relative advantage of LDA over RF no longer holds. SVM has the lowest error rate and RF has close performance. In Fig. 6. Variable ranking comparison. addition, error rates based on RF have consistent low vari- ation, which suggests that the error rate from RF is very reliable.

When the variables selected are derived from importance measures based on RF, it is not surprising that RF outper- forms all other methods. Based on these sets of variables, the relative performance LDA stays the same. But the QDA becomes worse when 25 markers are used, which is due to the unstable estimate of the covariance matrix in QDA. Because all variables, instead of a pre-selected subset of variables, can be utilized in RF as well as bagging and boosting methods considered here, the ability to incorporate many more vari- ables to build classifiers represents a distinct advantage over the methods that are limited by the number of variables that can be considered in an analysis. As a result, the error rate based on RF using the variables selected by RF has lower prediction error rate than the minimum error rate achieved using variables selected through T-statistics.

4.2 Choice of predictor variables LDA and QDA are not stable using a large number of variables. ln using bootstrap to estimate the error rate, the covariance matrix for the bootstrap samples was often singular.

4.3 Variables identified from T-statistics and RF Here we compare the variable selection based on T-statistics and RF program. Figure 6 ptots the ranking measures of selec- ted peaks based on T-statistics and the importance measures from RF. Wecan seethatboth measures will beabletocapture a common set of variables, i. e. the variables corresponding to the points in the upper tight region of this figure. How- ever. there do exist discrepancies between these two measures, resulting in different performance of various classifiers based on the selected variables.

5 DISCUSSION In this report, we have compared results obtained with several well-known classification methods to distinguish ovarian can- cer patients from normal individuals based on MS data obtained on serum samples. Overall, we have found that the RF approach both leads to an overall lower misclassification rate as well as to a more stable assessment of classification errors. Therefore, our preliminary analyses suggest that RF and methods similar in nature to RF may be more useful than other methods to classify samples basedon MS data. This con- clusion has been confirmed by applying these classification methods to. a completely different data set on autism which yielded similar results (data not shown). Compared to LDA and QDA methods, RF has the advantage of not requiring the number of variables used to be less than the number of sub- jects in the study, which is a clear advantage for the analysis of MS data as the number of Ml/Z versus intensity data points is very large. In addition, RP is able to handle interactions among variables. Although many methods have been com- pared in this report, there also are some additional methods, e. g. neural networks, that we have not yet compared. This is an ongoing endeavor, and we are in the process of evaluating these other methods as well.

The pre-processing of MS outputs is a very critical step in the overall analysis of MS data set. Peak identification, spectrum alignment, as well as normalization undoubtedly all affect the performance of classifications. Because the focus of this report is on comparing various classifiers, and we believe it is likely that the relative performance of these classifiers will not be diffci-cntially affected by pre-processing the data, we have not discussed in detail the specific steps we have taken to pre-process MS data. The effects of pre-processing on classification analysis and biomarker identification will be reported elsewhere.

In this report, we use T-statistics to pre-select a set of variables as inputs for various classifiers. There are some limitations to this approach as it does not take into account interactions among variables, and more importantly, it is not stable when sample sizes are relatively small. We could con- sider a more robust form of variation estimation, and utilize the global variation to improve the variance estimates. Variance shrinkage is a very good strategy to improve the estimation of variance (Long et al., 2001). As our current sample size is relatively small, we are considering a more robust approach to estimation of variations. Although NN, RF and other tree- based methods are able to analyze many variables, we still believe variable selection is a critical issue. For example, the mlz ratios corresponding to background levels should not be considered in classification analysis, and keeping these back- ground noises in the data will likely reduce the performance of any classifier. Therefore, in addition to data pre-processing, variable selection for classification analysis may represent another challenge for MS data analysis.

ACKNOWLEDGEMENTS We thank reviewers for their constructive comments. This research is supported by NIH grant NHLBI NOI-HV- 28186, R01 GM59507, RR015837, NCI-EDRN, NCI UOI CA-98-028, and DOE grant DE-FG02-02ER63462.

REFERENCES (1999). The chipping forecast. Supplement to Nature Genetics, 21.

(2001). Proteomics. Science, 294, 2074-2085.

Ball, G., Mian,S., Holding,F., Allibone, R. O., Lowe,J., Al.i,S., Li,G., McCardie. S., Ellis, I. O., Creaser, C. and Rees, R. C. (2002). An integrated approach utilizing artifical neural networks and seldi mass specuometry for the classification of human tumors and rapid identification of potential blomarkers. Bioinformatics, 18, 395-404.

Bao-Ling, A., Yinsheng, Q. , John, W. D., Michael, D. W., Mais, A. C., Lisa, H. C., John, O. S., PauI, F. S., Yutaka, Y., Ziding, F. and George, L. W. J. (2002). Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res., 62, 3609-3614.

Breiman,L. (1996). Bagging predictors. Mach. Learning, 24, 123-140.

Breiman,L. (1998). Arcing classiers. Ann. Statistics, 26, 801-824.

Breiman,L. (2001). Randomforest. Technical Report Stat.

Dept. UCB. AO : Pt.-. upd. ttc. Breiman,L., Friedman,J.H., Olshen,R. A. and Stone,C. (1983).

Classification and Regression Trees. Brciman, L. and Spcctor. P. (1992). Submodel selection and eval- uation in regression: the x-random case. Int. Stat. Rev., 60, 291-319.

Burges. C. J. C. (1998. lune). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2 (2), 121-167.

Dudoit, S., Fridlyand, l. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc., 97 (457), 77-87.

Efron, B. and TibershiranisR. (1997). Improvements on cross- validation : the. 632+ bootstrap method. J. Am. Stat. Assoc., 92, 548-560. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. An. Eugenics, 7, 179-188.

Freund, Y. and Schapire, R. (1995). A decision-theoretic generaliza- tion of on-line learning and an application to boosting. to appear.

J. Compul. Syst. Sci. AQ : Pl. AQ : Pl,.

Fung, E. T., Wright, G. L. J. and Dalmasso, E. A. (2000). Proteomic strategies for biomarker identification: progress and challenges.

Curr. Opin. Mol. Therap., 2, 643-650.

Guoan, C. , Tarek, G. G., Chiang-Ching, H., Dafydd, G. T., Kerby,A. S. , Jeremy, M. G. T., Sharon. L. R. K., David,E.M., Thomas,J.G., Mark, D. I., Mark, B. O., Samir, M. H. and David, G. B. (2002).

Proteomic analysis of lung adenocancinoma : identification of a highly expressed set of proteins in tumors. Clin. Cancer Res., 8 2298-2305.

Long, A. D., Mangalam, H. , Chan, B. Y.P., Tolleri,L., Hatfield, G. W. and Baldi, P. (2001). Genome expression profiling in Escherichia coli k 12 : improved statistical inference from DNA microarraydata AO : 1'1>, AQ : Pb. using analysis of variance and a bayesian statistical framework.

J. Biol. Chem. in-press.

Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, Inc. , San Diego.

McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York.

Petricoin, E. F., Ardekani,A.M., Hitt,B.A., Levine,P.J., Fusaro,V. A. , Stcinberg, S. M. , Mills, G. B., Simine, C., Fishman, D. A., Kohn, E. C. and Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The Lancet, 359, 572-577.

Ripley, B. D. (1996). Pattes Recognition and Neural Networks, Cambridge University Press, Cambridge, New York.

Vapnik, V. (1998). Statistical Learning Theory. Wiley New York.

Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th edn. Springer, Berlin Appendix C Statistical Challenges in Proteomics Research in the Post-Genomic Era ABSTRACT Proteome, first coined at the first 2-DE (two-dimensional gel electrophoresis) meeting in Siena, Italy in 1994, was defined as the protein complement of the genome [1] and its study, proleo7lics. was initially directed at annotating functions of all expressed proteins. With rapid technical advances and changes in biological experimentation. proteomics today has evolved to a more ambitious goal : studying not only all the proteins expressed in any given cell, but also all of their protein isofbrms and modifications, the interactions between. them, their three dimensional structures and higher-order complexes, and for that matter, almost everything"post-genomic'' [2]. Therefore, it's not possible, nor is it our intention, to try to give a comprehensive review of statistical issues involved in thc whole proteomics field. Instead, we will focus on one emerging area of intense interest, biomarker discovery using Mass Spectrometric (MS) -based approaches to identifying peptides and proteins whose relative level of expressions can be used to classify individuals into either normal. or different disease categories. The very high potentia) of this approach to diagnose, classify and understand at the molecular level, guide treatment and to provide more accurate prognoses for a very targe range of human disease accounts for the recent and very rapidly increasing level of interest in this approach in the biomedical and statistical research community [3-7]. In this review, we will approach proteomics. and particularly, biomarker discovery, from an infolmatics perspective by identifying and providing some suggestions for addressing statistical challenges that need to be solved to better fulfill the promise of the new and evolving frontier of proteomics.

1. Proteomics Research There are broadly three current active areas in proteomics : two dimensional gel el. ecirophoresis (2-DE) and MS-based proteomc profiting, proteome-wide biochemical assays, and systematic structural biology and imaging techniques, as reviewed recently in a series of articles in Nature Insight [2,8-10], with the latter reviews focusing on technological developments and also extending to applications of proteomics and related issues [11, 12].

A primary driving force of proteomics is the increasing ability of MS to quantify the level of expression and to identity ever smaller amounts of protein from increasingly complex mixtures [2]. Traditionally MS has been coupled with 2-DE for identification of the protein (s) contained within selected (often visually) Coomassie Blue, silver or otherwise stained protein spots that have been quantified by densitometry. Although conventional 2-DE is the oldest and probably stil. l the most widely used approach for comparative protein profiting, it has several disadvantages including its very limited, approximately 10-fotd dynamic range ; inability to detect low abundant proteins (e. g. few if any proteins are detected that are expressed at less than 1,000 copies/cell); and the difficulty to identify the same spots in two or more gels and to carry out high throughput analyses [13, 14]. Recent developments in 2-DE include differentia) fluorescence 2-DE. which overcomes many of the challenges of conventional 2-DE analysis [15- 17], and isotope coded affinity tag and other multi-dimensional MS-based approaches for quantifying the relative level of expression of 500-1, 500 proteins in complex cell extracts [13, 18, 191. One major statistical challenge in 2-DE analysis is the identification of proteins from tandem-MS data through data base searches [20]. Better statistical and computational algorithms need to be developed to accommodate the ever-increasing data bases and to reduce both fatse- negative and false-positive rates.

Proteome-\vide biochemical arrays try to interrogate protein activity on a proteomic scale [2].

One approach is to generate sets of clones that express a representative of each protein of a proteome in a useFul format. followed by the analysis of these sets on a genome-wide basis [9].

Such studies enable genetic. biochemical and cell biological technotogies to be appned on a systematic) evel, leading to the assignment of biochemical activities, the construction of protein arrays, the identification of interactions, and the localization of proteins within cellular compartments [9]. The availability of such data poses statistical challenges to integrate data of various types at the genomics scale to address specific biological questions. For example. in the study of protein-protein interactions, data integration and visualization is essential due to high l : alse-positive and false-negative rates of each individual data set [21 1.

As described in [10]. the goal of structural biology is to integrate structural information gathered at multiple levels of the biological hierarchy - from atoms to cells - into a common framework. The goal is a comprehensive description of the multitude of interactions between molecular entities, which in turn is a prerequisite for the discovery of general structural principles that underlie all cellular processes. This area of research can benefit greatly from statistical thinking with protein structure predications from sequence data and other relevant information as an excellent example of the usefulness of statistical methods [22].

Proteomics has the potential to have a profoundly positive impact on biology and medicine especial clinica) diagnostics and drug discovery [H]. How to efficiently organize, evatuate, archive, exchange, and extract useful information from large proteomic datasets pose great challenges [12] that must be met if the promise of proteomics is to be fulfilled.

In the rest of this review, we focus on analyzing MS datasets for the purpose of peptide/protein biomarker discovery. We touch on issues of statistical problems and data visualization. related to this intriguing approach to proteome profiling and we attempt to provide some perspective (often gleaned from microan'ay studies of comparative mRNA expression) on this rapidly evolving area of research.

2. Biomarker Discovery using MS-based Proteomics The increasing use of MS to uncover naturally occurring forms of peptide and protein markers of disease holds enormous promise for bringing the power of MS to bear on the challenge of protein profiling of the large numbers of samples needed to obviate biological diversity. However, challenging statistical issues remain that have not yet been well addressed i. n existing studies. In the following we will review a range of important issues in analyzing MS data using a matrix assisted laser desorption ionization (MALDI) source and time-of-flight (TOF) mass analyzer as a sample proteomics platform.

2.1 Mass Spectrometer Mass spectrometric measurements are carried out in the gas phase on ionized samples. There are three basic components in all mass spectrometers. First an ion source ionizes the molecule of interest, e. g. peptides/proteins. then a mass analyzer differentiates the ions according to their mass-to-charge ratio and finally, a detector measures the abundance of ions.

Sample ionization is the process of placing charges on. neutral. molecules. Among ionization methods, electrospray ionization (ES !) [23] and MALDI [24] are the two most commonly used techniques to volatize and ionize the proteins or peptides [25, 26]. ESI ionizes the samples out of a solution and MA. LDI sublimates and ionizes the samples out of a dry, crystalline matrix via laser pulses.

A mass analyzer is used to separate ions within a selected range of mass-to-charge ratios. Ions are typically separated by magnetic fields, electric fields, or by the time it takes an ion to travel a fixed distance. There are four basic types of mass anaiyzer currendy used in proteomics research : ion trap, time-of-flight (TOF), quadrupole, and Fourier transform ion cyclotron (FT-MS) analyzers [25, 26]. Among them, the TOF mass analyzer is one of the simplest and is commonly used with MALDI. It is based on accelerating a set of ions to a detector with each ion having the same amount of energy. Because the ions have the same energy, yet different masses, they reach the detector at different times. Smaller ions reach the detector first because of their greater velocity and larger ions take longer time, thus the analyzer is called TOF and the mass is determined by the time required for each ion to travel from the source to the detector.

The ion detector allows a. mass spectrometer to generate a signa current from incident ions by generating secondary electrons, which are further amplified. Alternatively, some detectors operate by inducing a current generated by a moving charge. Electron multipliers and scintillation counters are the most commonly used and they convert the kinetic energy of incident ions into a cascade of secondary electrons [25, 26].

The relationship that allows the mass/charge (m/z) ratio to be determined for an individual ion is: (2. 1) E = ½(m/z)v2.

In this equation. E is the energy imparted. to the charged ions as a result of the voltage that is applied by the instrument and v is the velocity of the ions down the flight path. Because all of the ions are exposed to the same electric field, all similarly charged ions will have similar energies.

Therefore, based on the above equation, ions that have larger mass must have lower velocities and hence will require lon ; er times to reach the detector, thus forming the basis for m/z determination by a mass spectrometer equipped with a TOF detector. A mass spectrum is created by recording electrical currents produced by different ions reaching the detector with different traveling times. The resulting data format is very simple: paired mass-to-charge ratio ratio versus intensities. Figure 1 shows two sample spectra as described in [6].

Figure t : Sample MS Spectra (See http ://info.med.yale.edu/wmkeck/prochem/procmald.htm for more information on MS.) 2.2 Application of MS to Disease Biomarker Discovery Most recent disease biomarker studies have reiied on surface enhanced laser desorption ionization time-of-flight mass spectrometry (SELDI-TOF-MS) [4, 27]. The SEI. DI approach [27] uses a gold coated chip that has been modified with a chromatographic surface (e.g., anionic, cationic, hydrophobic, etc). After spotting a few microliters of serum or other biological samples. contaminants and salt (which often interfere with MALDI-MS) are removed by washing with water or other volatile solvent, and the target dried after adding a MALDI matrix solution like a-cyano-4-hydroxy-cinnamic acid. The relative ease of carrying out the"one-step" SELDI-TOF-MS approach and recent papers like that by Petricoin et al [3] on ovarian cancer have spurred interest in uncovering peptide/protein disease biomarkers that might, for instance. play an important role in early detection. In the Petricoin et al study [3] SELDI-MS analysis of serum from 50 control and 50 case samples from patients with ovarian cancer resulted in identifying 5 peptide biomarkers that ranged in size from 534 to 2 465 Da. The pattern formed by these markers was then used to correctly classify all 50 ovarian cancer samples in a masked set of serum samples from 116 patients who included 50 ovarian cancer patients and 66 unaffected women or those with non-malignant disorders. Of the latter samples, 63 were correctly recognized as not being from cancer patients-thus providing 100% sensitivity (50/50) for detecting cancer, 95% specificity (63/66) for detecting controls, and a positive predictive value of 94% (50/53) for this popu) ation. That is, if the 5 peptide"ovarian cancer"biomarker pattern was identified in the sample, there was a 94% probability that the patient indeed has ovarian cancer. Although similar promising results have been reported recently in other reasonably large-scale studies of serum samples from breast and prostrate cancer patients [5,7], we would like to raise two concerns about the Petricoin et al [3] study. The first is an issue that was raised by Rockville [28] and others and that is the very high positive predictive value (PPV) of 94% reported by Petricoin et al [3] applies only to their artificial population of 116 patients, 50 of whom had ovarian cancer. When their estimates of sensitivity (100%) and specificity (95%) are applied to an average population of post-menopausal women with an incidence of ovarian cancer of 50 per 100,000, the PPV is reduced to a clinically insignificant value of only 1% [28]. The second caution with regard to the Petricoin et al [3] study is that (as shown below) closer examination of the mass spectra around their"biomarkers"suggests strongly that the latter do not arise from biologically significant peptides.

3. Issues in Analyzing MS datasets An An inherent advantage of MALDI-MS is that it primarily produces sin£tly charred ions, i. e. in equation (2. 1) is usually equal to one. So we can use the measured traveling time t to directly calculate the mass of the ions. Data resulting from MALDI and other types of MS sources have a very simple format consisting entirely of paired m/versus intensity data points. The objective is simple : finding potential peptide/protein biomarkers to distinguish cases from controls and to enable classification of future samples. While sample size is usually in the magnitude of hundreds, the total number of measured data points is hundreds of thousands and it keeps increasing with rapid advances (e. g., Fourier transform ion cyclotron resonance mass analyzers) in this field, posing a substantial challenge for statistical analysis. Statistical issues in the analysis of MS data can be broadly classified into three categories: pre-processing. feature selection and sample classification, with data visualization being an important part of the overall analysis.

3.1 Pre-processing 3. 1. 1 Mass Alignment In an ideal experiment, all ions will have the same kinetic energy E and will travel through the exact same drift region length. However, some initial kinetic energy distribution wili be present in the ion population and there will be slight spatial variations in the travel length from the target plate which will produce corresponding variations in the traveling time and thus the measured /M/Z ratio for ions with exactly the same mass. This problem is partially solved by using time delayed ion extraction [29, 30J in MALDI-TOF, but as a side effect it also changes the linear relationship between m/z and t2 (i.e., v2 = D2/t2 where D is the distance traveled) in equation (2. 1). A first order approximation can be used : m/z = a + bt2, where et and b are constants for a given set of instrument conditions and are determined experimentally from flight times of ions of at least two known masses (calibrants). In practice. higher order approximations have been proposed to achieve higher accuracy [31]. Even with the use of internal calibration the maximum observed intensity tor an internal calibrant may not occur at exactly the same corresponding m/z value in all spectra. For this reason, spectra can be further aligned based on the maximum observed. intensity of the internal calibranl, after which there are still some problems with local peak shifting. Useful statistical methods need to be developed to address this problem.

3.1. 2 Wide Dynamic Range of Protein/Peptide Intensities Measured protein/peptide concentrations in samples like human serum have a vast dynamic range (more than 10")-fold) that spans from 35-50 mg/ml for serum albumin down to at least 0-5 pg/mt for interleukin 6 [32]. Although mass aligned spectra of serum and othel biolooical samples can be directly analyzed, the relatively large variations in the measured intensities are likely to make most statistical procedures unstable, thus making it more difficult to extract information from the MS dataset. In addition, the large magnitude of the intensities will make most numerical programs unstable.

3. 1.3 Background Noise Chemical and electronic noise produce background fluctuations that are apparent in Figure 1 where they produce a relatively"thick"baseline whicl i. s especially noticeable at lower míz.. I. t's important to remove this background before further analysis. Some local regression methods could be utilized here to estimate the background [33J.

3. 1. 4 Smoothing High frequency noise is one contribution to the background that is apparent in MALDI-MS spectra. Smoothing functions can be used to reduce high-frequency noise, thus minimizing noise spikes and aiding interpretation.

Figure 2: High Frequency Noise in MS 3. 1. t Normalization Normalization of the overall intensities of individual spectra can be used to help ensure that all samples contribute as equally as possible to the search for biomarkers. While normalization has been fully addressed in the context of microarray research [34], most MS-based proteomic analyses do not fully address this problem [3, 4, 5, 7]. Although several normalization approaches are possible. one straightforward approach is to determine a linear normalization factor that will minimize the summed difference between alt observed intensities in an individuat spectrum and the calculated median spectra tor all of the samples. However, the validity of such approaches needs to be rigorously investigated.

3. 1. 6 Peak Identification Intensity measurements from current MS technology tend to be quite noisy-with one source stating that approximately 80% of the data points in spectra derivin. from noise [33]. Therefore. noise tittering is a necessary and indispensable step to allow biomarker identification to focus on those data points that derive from peptide/protein ionization and that might represent useful markers. Some published analysis methods for MS data do not consider peak identification [3], while others only briefly mention this challenge [7].

In summary, pre-processing of raw MS data is arguably a critical step in data analysis.

Statistically sound methods need to be developed to restrict the search for biomarkers to only those regions of each spectrum which actually result from peptide and protein ionization and to better register and normalize among different spectra.

3. 2 Feature Selection and Sample Classification Traditional statistical methods for classification may not be optimal or even appropriate for biomarker identification using MS data because the data is very high-dimensional, with the number of features much larger than the number of samples. As described above, peak identification and other preprocessing procedures can help to reduce the high feature-to-sample ratio that is characteristic of MS data. Our goal is to then find a subset of these features which will provide minimum prediction error. To aid biological interpretation, the size of the selected biomarker subset is usually in the range of five to fifty. The ideal solution to this problem is a combinatorial approach to evaluate all possible feature combinations, however this is not feasible for MS data. Instead, several heuristic methods have been proposed which we believe provide suboptimal solutions. One commonly used method is a two-step approach. The first step involves dimension reduction-which requires using relatively crude criteria to select a smaller subset of "important"features for further analysis. During the second step the data sets that only include "important"features can then be subjected to analysis based on established statistical methods [71. Another direct combinatorial approach using Monte Carlo approximation is proposed in [3].

But upon closer inspection (see Figure 4). the identified subset of 5"biomarkers" [] likely results from background noise rather than from meaningful biological differences between sera from normal versus ovarian cancer patients. The two-step approach is likely to miss important <BR> <BR> <BR> <BR> interactions between features that may offer more accurate predictions, while results from the combinatorial approach are likely to be compromised by random noise due to the large number of feature combinations being examined. Novel statistical methods are needed to address the unique challenges posed by the ever-increasing size of MS datasets-with RandomForest based approaches appearing to show great promise in this regard [6].

Figure 4: SELDI-MS Spectra for randomly selected 8 samples Around 5 identified markers from Petricoin etal (2002) 3.3 Data Visualization There is a great need to develop efLective visualization methods appropriate for comparing large numbers of complex MS datasets and the regions around selected markers. Data visualization is also a challenge for genomics research where this important issue continues to be addressed [35].

Our opinion is that a plot can reveal critical underlying features of the dataset that might otherwise be missed and can also serve as a visual control for a complex statistical analysis.

Certainty, Figure 4 above, which clearly raises serious doubts about the biological validity upon which the Petricoin et al study [3] rests, provides a powerful justification for the complementarity and need for both statistical and visual approaches in biomarker discovery. In this regard, one approach that was used in [6] to quickly explore overall relationship between case and control group is the following median and corresponding difference plot: Figure 5. Median Exploration in MS This median plot immediately reveals potential differences between the control versus case groups-which should correlate well with the n/z identities of the biomarkers identified by statistical approaches.

The utilization of color in visualizing data has proved to be a very effective approach in genomics studies [36], which can also be used as a diagnostics tool in the context of biomarker discovery using MS data [37]. Figure 6 displays a plot of 50 pre-selected important features of the ovarian cancer dataset as described in [6]. In the plot, red dots are used to code positives vahjes, green dots for negative values and black dots for zero values, with their saturations further defining relative magnitudes. Figure 7 displays the same plot for 50 randomly selected features. Comparing these two plots. the value of those pre-selected features in Figure 6 is obvious and they are worth further checking.

Figure 6: pre-selected 50 improtant features Figure 7: random-selected 50 features It is also important to explore the local behavior of selected biomarkers to guard against ta) se positives, as illustrated in Figure 4.

Therefore, there is a great need for novel visualization approaches, many of which are likely to come from other fields that are also actively devising realistic approaches to deal with intbrmation overload.

4. Discussion The pending completion of the human and many other genomes as well as substantial recent progress in both MS instrumentation and methodologies for quantitative proteome profiting provides enormous opportunities for bringing the power of this nascent technology to bear on biological and biomedical research. It is perhaps worthwhile, however, to remind that the challenges (e. g. , virtually all mammalian proteins are subject at some time in their"life span"to a myriad possible array of posttranslational modifications which often play a critical role in protein function and which cannot be predicted from mRNA expression analysis) posed by proteomics far supersede those of mRNA expression analysis and the current capabilities of the latter technology far supersede those of proteomics. Hence, while it is now possible to routinely assay fol the relative level of expression of 30, 000 or more human genes at the mRNA level. the larrest number of human proteins profiled by the isotope coded affinity tag (ICAT)/mass spectrometric approach is the 491 proteins contained in microsomal fractions of naïve and in vitro differentiated human myeloid leukemia cells [18]. In this regard the disease biomarker approach, which concentrates on identifying on) y those peptides and proteins that are differentially expressed in large numbers of control versus disease samples, would seem to offer an attractive approach for bridging the currently large gap between the capabilities of proteome versus mRNA expression technology. Certainly, mRNA-based disease biomarker studies like that carried out on patients with breast cancer [16] provide important lessons for protein disease biomarker discovery. That is, to identify a gene expression signature that was strongly predictive of a poor prognosis required that DNA microarray analysis be carried out on primary breast tumors from 117 patients and that the"signature"represent the relative level of expression of 70 genes. Perhaps most importantly, it should be noted that these mRNA biomarkers were identified from direct analysis of primary breast tumors-not from that of a distant tissue.

Three implicit "lessons" from the breast cancer mRNA gene expression study for protein/peptide biomarker discovery are that it is likely that: 1) Relatively large numbers of control and disease samples will need to be analyzed to obviate biologica. l diversity.

2) Accurate classification of control versus disease samples will require the use of several biomarkers.

It is likely to be far easier to identify protein/peptide biomarkers in the cells and tissues in which they originate rather than a (ter they have been diluted substantially and mixed with the >10 m. illion protein sequences (i. e., including immunoglobulins) and the 10 order of magnitude range of protein concentrations expected in human plasma and serum.

Given the latter hypothesis and the perhaps l00-fold dynamic range of a single MALDI spectrum of unfi-actionated serum, we believe that most naturally occurring serum protein/peptide biomarkers of human disease will escape detection-as we believe has happened in the Petricoin et al study [19]. One obvious approach to substantially increase the likelihood of finding useful biomarkers is to look for them in their cell or tissue of origin rather than in serum and indeed, Yanagisawa et al [) 7] report finding useful markers for non-small-cell lung cancer by clii-eci MALDI-MS analysis of tissue sections. Ultimately, however. to reach far deeper into the serum (which is the most complex human proteome) and other proteomes will require substantial extension of the dynamic range of the MS analyses that : form the basis for biomarker discovery.

This, in turn, will require substantial fractionation of disease biomarker samples which will result in having to analyze many MS spectra/sample - thus multiplying manifold over the challenges outlined above in extracting usefu) biomarker information from an already large number of data points in even a single MS spectrum. While similar approaches may be used to reprocess and to analyze 100 (e. g. as might be produced by subjecting a sample of serum to reverse phase LC followed by automated spotting by an LC/MALDI spotter of the resuming ! 00 fractions onto a MALDI-MS target plate) as opposed to only a single MALDI-MS spectra'sample, the latter task may well require considerable optimization of the underlying program codes and perhaps, the use of a parallel computing solution.

As indicated at the beginning of this review. we have only focused on a very specific topic in proteomics research. Even for this narrowly defined area, there are many statistical challenges that need to be addressed through concerted efforts between statisticians, mass spectroscopists and biomedical researchers. We believe that proteomics is destined. to play a central role in advancing our knowledge of very complex biological systems, and that rigorous and powerful statistical methods need to be developed to fully utilize the large amounts of information arising from proteomics studies.

Acknowledgments. This project has been funded in part with Federal funds from the National Heart. Lung, and Blood Institute. National Institutes of Health, under contract No. NOl-HV- 28186, and the NSF grant DMS 0241160.

References I. M. R. Wilkins, I. C. Sanchez, A. A. Gooley, R. O. Appel, I. Humphrey-Smith, D. F. Hochstrasser and K. L. Williams (1995), Progress with Proteome Projects: Why All Proteins Expressed by a Genome Should be identified and How to Do It. Biotech. Gen. Eng. Rev. 13, 19-50 2. MikeTyers and Matthias Mann (2003). From genomics to proteomics. Nature Insight : Proteomics 422, 193-197 3. Pelricoin, E. F., Ardekani, A. M. , Hitt, B. A, Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B..

Simine, C. , Fishman, D. A., Kohn, E. C., and Liotta, L. A. (2002) Use ofproteomic patterns in serum to identify ovarian cancer. The Lancet 359, 572-77 4. Adam, B. L., Vlahou, A., Semmes, J. O., Wright,. Ir. G. L. (2001) Proteomic approaches to biomarker discovery in prostate and bladder cancers. Proteomics 1, 1264-1270 5. Li, J., Zhang, Z., Rosenzweig, J., Wang, Y. Y.. Chan, D. W. (2002) Proteomics and Bioinformatics Approaches for Identification of Serum Bioma. rkers to Detect Breast Cancer. Clinical chemistry 48:8, 1296-1304 6. Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K. , and Zhao, H. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, 19, 1636-1643 7. Bao-Ling Adam, Yinsheng Qu, John W. Davis, Michael D. Ward, Mary Ann Clements, Lisa H.

Cazares, U.. lohn Semmes, Paul F. Schellhammer. Yuta. ka Yasui, Ziding Feng, and George L. Wright, . Ir (2002). Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. Cancer Res. 62 : 3609-3614 8. Ruedi Aebersold and Matthias Mann (2003). Mass spectrometry-based proteomics. Nature Insight: Proteomics 422, 198-207 9. Eric Hizicky, Philippe I. H. Bastiaens, Heng Zhu, Michael Snyder and Stanley Fields (2003). Protein analysis on a proteomic scale. Nature Insight: Proteomics 422, 208-215 ! 0. Andrej Sati, Robert Gfaeser, Thomas Earnest and Wolfgang Baumeister (2003). From words to literature in structural proteomics. Nature Insight: Proteomics 422, 216-225 I I. Sam Hanash. Disease proteomics. Nature Insight: Proteomics 422, 226-232 1 Mark S. Boy, ski and Martin W. Mcintosh (2003). Biomedical informatics for proteomics. Nature Insight: Proteomics 422, 226-232 13. Gygi, SP, Rochon, Y, Franza, BR & Aebersold, R (1999). Correlation between protein and mRNA abundance in yeast.. kfol Cell Biol 19, 1720-30 14. Gygi, SP, Corthals, GL, Zhang, Y, Rochon, Y & Aebersotd, R (2000) Evaluation ofi two-dimeosional gel electrophoresis-based proteome analysis technology. Proc. Natl Acad Sci USA, 97, 9390-5 15. Tonge, R, Shaw, J, Middleton, B, Rowlinson, R, Rayner, S, Young, J, Pognan, F, Hawkins, E, Currie, I & Davison, M (2000) : Validation and development of fluorescence two-dimensional differential gel etectrophoresis proteomics technology. Proteomics 1, 377-96.

16. Gharbi,S, Gaffney, P, Yang, A, Zvelebil. MJ, Cramer, R, Waterfield, MD & Timms,.) JF (2002) Evaluation of two-dimensional differential gel electrophoresis for proteomic expression analysis of a model breast cancer cell system. Mol Cell Proteomics 1, 91-8 17. Zhou. G, Li, H, DeCamp, D, Chen, S, Shu, H, Gong, Y, Flaig, M, Gillespie, JW, Hu, N, Taylor, PR et al. (2002) 2D differential in-gel electrophoresis for the identification of esophageal scans cell cancer-specific protein markers. Mol. Cell Proteomics 1,117-24 IS. Han DK, Eng, J, Zhou, H & Aebersold, R (2001) Quantitative profiling of differentiation-induced microsoma) proteins using isotope-coded affinity tags and mass spectrometry. Nat Biotechnol 19, 946-5) 19. Washburn, MP, Ulaszek, R, Deciu, C, Schieltz, DM & Yates, JR, 3rd (2002): Analysis of quantitative proteomic data generated via multidimensional protein identification technology. Anal Chem 74, 1650-7 20..). «. Eng, A. L. McCormack and J. R. Yates, III (1994) An Approach to Correlate Tandem Mass Spectra. l Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass Spectrom. 5, 976-989 <BR> <BR> <BR> <BR> 21. Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T. , Donders, R., Brazma, A. , and Holstege, F. C. P.

(2002). Protein interaction verification and functional annotation by integrated analysis of genome- scale data. Mol. Cell 9, 1133-1143 W G Krebs. J Tsai, V Alexandrov, N Echols, J Junker, R Jansen, M Gestein. (2003). Studying Protein Flexibility in a Statistica. l Framework Tools and Databases for Analyzing Structures and Approaches for Mapping this onto Sequences. Methods in Enzymology (in press) 23. Fenn, J.B., Mann, M., Meng, C. K., Wong, S. F. & Whitehouse, C. M (1989). Electrospray ionization for the mass spectromelry of large biomolecules. Science 246, 64-71 4. Karas, M. & Hillenkamp, F (1988). La. ser desorption ionization of proteins with molecular mass exceeding 10000 daltons. Anal. Chem. 60, 2299-2301 25. Siuzdak G. (1996). Mass spectrometry for Biotechnology, Academic Press 6. Siuzdak G (2003). The Expanding Role of Mass Spectrometry in Biotechnology, MCC Press. San Diego 27. Issaq, HJ, Veenstra, TD. Conrads, TP & Felschow, D (2002): The SELDI-TOF MS approach to proteomics: protein profiling and biomarker identification. Biochem Biophys Res Commun 292, 587- 92 28. Rockhill, B (2002). Proteomic pattens in serum and identification of ovarian cancer. The Lancet 360, 169-170 29. Randy M. Whittal and Liang Li (1995). High-Resolution Matrix-Assisted Laser Desorptioolonization in a Lineal~ Time-of Flight Mass Spectrometer. Anal. Chem. 67, 1950-54 30. Robert S. Brown and John J. Lennon (1995). Mass Resolution Improvementt by Incorporation of Pulsed lon Extraction in a Matrix-Assisted Laser Desorption/Ionization Linear Time-of-Flight Mass Spectrometer. Anal. Chem. 67, 1998-2003 31. Johan Gobom, Martin Mueller, Volker Egelhofer, Dorothea Theiss, Hans Lehrach, and Eckhard Nordhoff (2002). A Calibration Method That Simplifies and Improves Accurate Determination of Peptide Molecular Masses by MALDI-TOF MS. Anal. Chem. 74, 3915-3923 Anderson, NH and Anderson, NG (2002): The human plasma proteome. Mol. & Cell. Proteomics 1, 845-867 33. Micromass#MASSLYNXTM 3.5 user guide 34. S. Dudoit. Y. H. Yang, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed (2002). Normalization for cDNA microarray data: a robust composite method addressing single and Inultiple slide systematic variation. Nucleic Acids Research, Vol. 30, No. 4, el 5 35. Balls P (2002). Data visualization : picture this. Nature 418, 11-13 36. Eisen MB, Spellman PT, Brown PO and Botstein D (1998). Cluster analysis and disp ! ay ofgenome- wide expression patterns. Proc Nall Acad Sci USA 95, 14863-68 37. Kiyoshi Yanagisawa, Yu Shyr, Baogang J Xu, Pierre P Massion, Paul 1-l Larsen, Bill C White,. tohn R Robots, Mary Edgerton, Adriana Gonzalez, Sorena Nadaf, Jason H Moore, Richard M Caprioli and David I'Carbone (20 (13). Proteomic patterns of tumour subsets in non-small-cell lung cancer. Lancet 362, 433-439