Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
INTEGRATED SPECTRAL DATA PROCESSING, DATA MINING, AND MODELING SYSTEM FOR USE IN DIVERSE SCREENING AND BIOMARKER DISCOVERY APPLICATIONS
Document Type and Number:
WIPO Patent Application WO/2004/038602
Kind Code:
A1
Abstract:
An integrated, modular, automated computer software based system and method for drug discovery, biomarker discovery and drug screening, and other diverse application, including proteomics and metabonomics. The system provides for automated processing of raw spectral data (10), data standardization, reduction to data to modeling form (14), and unsupervised and supervised model building, visualization, analysis and prediction (15). The system incorporates data visualization tools and enables the user to perform visual data mining, statistical analysis and features extraction. The system fully integrates with other laboratory computer systems that may be present in the laboratory, including instrumentation control and raw data storage systems, laboratory information management systems, and off-the-shelf third party modeling and statistical analysis software (17).

Inventors:
BAKER J DAVID (US)
Application Number:
PCT/US2003/026346
Publication Date:
May 06, 2004
Filing Date:
August 22, 2003
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
WARNER LAMBERT CO (US)
BAKER J DAVID (US)
International Classes:
G06F19/00; G06F19/24; G06F19/28; (IPC1-7): G06F17/10
Foreign References:
US5339034A1994-08-16
Other References:
PORTER D.A. ET AL.: "A model fitting approach to baseline distortion in the frequency domain analysis of mr spectra", IEEE COLOQUIUM ON TECHNICAL DEVELOPMENTS RELATING TO CLINICAL NMR IN THE UK, January 1991 (1991-01-01), pages 13/1 - 13/3, XP002974781
LI Y. ET AL.: "A high-resolution technique for multidimensional NMR spectroscopy", IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, vol. 45, no. 1, January 1998 (1998-01-01), pages 78 - 86, XP000727430
Attorney, Agent or Firm:
Fairhall, Thomas A. (300 South Wacker Drive Suite 320, Chicage IL, US)
Download PDF:
Claims:
CLAIMS I claim:
1. An integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications, said system for use in conjunction with an analytical spectrographic instrument collecting data from a chemical or biological sample, the system comprising: a general purpose computer system; and a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions integrating the following 4 modules into an integrated spectral data processing system, comprising: (1) a module operating on raw data from files created by said analytical spectrographic instrument and storing raw processed data in a file; (2) a module operating on said raw processed data and containing instructions for providing data standardization of said raw data and storing standardized individualized spectral data in a file and/or a library of files; (3) a module operating on said standardized individualized spectral data and containing instructions for responsively reducing said individualized spectral data into a modeling form and storing said modeling form of said data in a file ; (4) a module operative on said data reduced to modeling form and containing instructions providing a user of said system with tools for performing model building, visualization, analysis and/or prediction of said data; and a tracking database containing the results of said model building, visualization, analysis, and/or prediction of said data.
2. The system of claim 1, wherein said machine readable storage medium further comprises a set of instructions for exporting data reduced to modeling form to commercially available modeling and analysis software in a format compatible with said commercially available modeling and analysis software.
3. The system of claim 1, wherein said set of instructions further comprises a process refinement module operative on said processed raw data and responsively generating sample and processing parameters, and wherein said tracking database further comprises a file storing said sample and processing parameters.
4. The system of claim 1, wherein said system further comprises a visual data mining, statistical analysis and feature extraction module providing a user of said system with a set of one or more visual tools displayable on a user interface of said general purpose computer system for visualizing said standardized individual spectra.
5. The system of claim 4, wherein said tracking database further comprises a file storing visualization results from the use of said visual data mining, statistical analysis and feature extraction module.
6. The system of claim 1, wherein said system further comprises a laboratory information management system containing a database containing at least one of the following: outcomes and/or endpoints associated with said chemical or biological sample, and wherein said module (4) is provided with access to said laboratory information management system database.
7. The system of claim 1, wherein said module (1) comprises a first set of routines for automated processing of nuclear magnetic resonance spectroscopy data and a second set of routines for processing mass spectrometry data.
8. The system for claim 7, wherein said first set of routines comprises a routine converting nuclear magnetic resonance (NMR) data to frequency domain data and a phasing routine comprising an algorithm performing a deconvolution of arbitrary frequency phase shift and associated frequency dependent distortions in said frequency domain data.
9. The system for claim 7, wherein said first set of routines comprises a routine setting NMR processing parameters, a routine converting raw NMR data to a system format and storing formatted NMR data to a sample and processing parameters file in said tracking database, a routine converting nuclear magnetic resonance data to frequency domain data, a phasing routine comprising an algorithm performing a deconvolution of arbitrary frequency phase shift and associated frequency dependent distortions in said frequency domain data, a routine applying frequency domain corrections, and a save routine saving processed form of raw data in a processed raw data file.
10. The system of claim 7, wherein said second set of routines comprises a routine converting raw data to a system format and a linearization routine linearizing the mass axis of said spectroscopic data.
11. The system of claim 10, further comprising a routine applying corrections to spectroscopic data in which the mass axis has been linearized and a save routine saving process mass spectroscopic data in a processed raw data file.
12. The system of claim 1, wherein said data standardization module comprises a set of instructions comprising resampling/resolution matching algorithm correlating measured intensities in said raw data at discrete values.
13. The system of claim 12, wherein said data standardization module further comprises a routine applying baseline and/or smoothing corrections to said raw data after processing by said resampling/resolution matching routine.
14. The system of claim 12, wherein said data standardization module comprises a set of instructions applying intensity normalization to said raw data.
15. The system of claim 12, wherein said data standardization module further comprises one or more routines generating a library of standardized spectral data.
16. The system of claim 15, wherein said one or more routines comprises a routine grouping data and generating feature selection tables for inclusion in said library of standardized spectral data.
17. The system of claim 1, wherein said module (3) reducing said individualized spectral data to modeling form comprises a routine for selection of a modeling form; and at least one of the following routines depending on the selection of a modeling form: a data segmentation routine, a peak selection and deconvolution routine, a wavelet decomposition routine, and component selection routine.
18. The system of claim 17, wherein said module (3) further comprises a routine applying fuzzy clustering methods reducing spectra into subgroups of memberships.
19. The system of claim 17, wherein said module (3) further comprises a multiway and evolving factor analysis routine for samples subjected to combination separationspectroscopy techniques, or for samples that evolve in time with spectra subsequently taken at designated time intervals.
20. The system of claim 1, wherein said model building, visualization, analysis and prediction module (4) comprises a routine prompting a user to select a screening prediction option or model building option for said data reduced to modeling form, a first set of routines for screening prediction and a second set of routines for model building.
21. The system of claim 20, wherein said first set of routines for screening prediction comprises one or more routines supplying a classification or prediction model from a file storing one or more classification or prediction models, applying the stored classification or prediction model to said data reduced to modeling form, and saving the results of said application in said tracking database.
22. The system of claim 20, wherein said second set of routines for model building comprises a routine prompting a user to select either supervised or unsupervised model building techniques to be used for said data reduced to modeling form, a third set of routines for applying supervised model building algorithms to said data reduced to modeling form and a fourth set of routines applying unsupervised modeling building algorithms to said data reduced to modeling form.
23. The system of claim 22, wherein said third set of routines applying supervised model building algorithms to said data reduced to modeling form comprises one or more of the following routines: principal components regression, least squares regression, partial least squares regression, discriminate analysis, soft independent modeling by class analogy (SIMCA); hierarchical principal components analysis, neural network classification and prediction, K nearest neighbors analysis, Kernel methods for very large datasets generated for highthroughput screening applications, and hybrid methods, and wherein said one or more routines in said third set of routines are selectable by the user of said system.
24. The system of claim 22, wherein said fourth set of routines applying unsupervised modeling building algorithms to said data reduced to modeling form comprises one or more of the following routines: principal components analysis, hierarchical cluster analysis, self organization mapping, nonlinear mapping, evolving factor, batch or curve resolution analysis, and hybrid methods, and wherein said one or more routines in said fourth set of routines are selectable by the user of said system.
25. The system of claim 4, wherein said visual data mining, statistical analysis and feature extraction module comprises a routine prompting a user to select a visualization mode, and wherein said one or more visualization tools comprise an outlier analysis visual tool, a stack plot visualization analysis tool, an image view tool, and a residual magnitude tool.
26. In an integrated spectral data processing system for use in conjunction with an analytical spectrographic instrument collecting spectrographic data from a chemical or biological sample and storing said data in one or more output files, a system for automated processing of raw spectral data, comprising: a general purpose computer system; and a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising a first set of routines for automated processing of nuclear magnetic resonance spectroscopy data and a second set of routines for automated processing of mass spectrometry data, wherein said first set of routines comprises a routine converting nuclear magnetic resonance (NMR) data to frequency domain data and a phasing routine comprising an algorithm performing a deconvolution of arbitrary frequency phase shift and associated frequency dependent distortions in said frequency domain data, and a tracking database containing the sample processing data and processing parameters.
27. The system for claim 26, wherein said first set of routines further comprises a routine setting NMR processing parameters, a routine converting raw NMR data to a system format and storing formatted NMR data to a sample and processing parameters file in said tracking database, a routine applying frequency domain corrections, and a save routine saving processed form of raw data in a processed raw data file.
28. The system of claim 26, wherein said second set of routines comprises a routine converting raw mass spectroscopic data to a system format and a linearization routine linearizing the mass axis of said mass spectroscopic data.
29. The system of claim 28, further comprising a routine applying corrections to spectroscopic data in which the mass axis has been linearized and a save routine saving process mass spectroscopic data in a processed raw data file.
30. In an integrated spectral data processing system for use in conjunction with an analytical spectrographic instrument collecting spectroscopic data from a chemical or biological sample, a system for data standardization of raw spectral data, comprising: a general purpose computer system; memory accessible to said general purpose computer system storing raw processed spectral data, and a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising a resampling/resolution matching algorithm correlating measured intensities in said raw spectral data at discrete values, a routine applying baseline and/or smoothing corrections to said raw data after processing by said resampling/resolution matching routine, and a routine storing standardized individual spectral data in a file and/or a library; and one or more databases containing the sample data and processing parameters.
31. The system of claim 30, wherein said data standardization module further comprises a set of instructions applying intensity normalization to said raw data.
32. The system of claim 30 wherein said data standardization module further comprises one or more routines generating a library of standardized spectral data.
33. The system of claim 32, wherein said one or more routines comprises a routine grouping data and generating feature selection tables for inclusion in said library of standardized spectral data.
34. In an integrated spectral data processing system for use in conjunction with an analytical spectrographic instrument collecting data from a chemical or biological sample, a system for visualization and analysis of standardized individual spectra, comprising: a general purpose computer system; a memory storing standardized individual spectral data and sample and processing parameters, a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising a routine prompting a user to select a visualization mode, and routines comprising one or more visualization tools, said visualization tools selected from the group of tools consisting of an outlier analysis visual tool, a stack plot visualization analysis tool, an image view tool, and a residual magnitude tool, and a routine for storing results of use of said tool in a visualization results database.
35. In an integrated spectral data processing system for use in conjunction with an analytical spectrographic instrument collecting data from a chemical or biological sample, a system for reduction of standardized spectral data to a modeling form, comprising: a general purpose computer system; a memory accessible to said general purpose computer system storing standardized spectral data; a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising a routine for selection of a modeling form, and at least one of the following routines depending on the selection of modeling form: a data segmentation routine, a peak selection and deconvolution routine, a wavelet decomposition routine, and component selection routine; and a database storing said data reduced to modeling form.
36. The system of claim 35, further comprising a routine applying fuzzy clustering methods reducing spectra into subgroups of memberships.
37. The system of claim 35, further comprising a multiway and evolving factor analysis routine for samples subjected to combination separationspectroscopy techniques, or for samples that evolve in time with spectra subsequently taken at designated time intervals.
38. In an integrated spectral data processing system for use in conjunction with an analytical spectrographic instrument collecting data from a chemical or biological sample, a system for model building, data visualization, data analysis and prediction, comprising : a general purpose computer system; a memory storing data reduced to a modeling form; a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising a routine prompting a user to select a screening prediction option or model building option for said data reduced to modeling form, a first set of routines for screening prediction and a second set of routines for model building; wherein said routines screen said data or build a model from said data, and wherein the system further comprises a tracking database storing the results of the screening of said data and/or the models constructed by said routines for model building.
39. The system of claim 38, wherein said first set of routines comprises one or more routines supplying a classification or prediction model from a file storing one or more classification or prediction models, applying the stored classification or prediction model to said data reduced to modeling form, and saving the results of said application in a tracking database.
40. The system of claim 39, wherein said second set of routines comprises a routine prompting a user to select either supervised or unsupervised model building techniques to be used for said data reduced to modeling form, a third set of routines for applying supervised model building algorithms to said data reduced to modeling form and a fourth set of routines applying unsupervised modeling building algorithms to said data reduced to modeling form.
41. The system of claim 40, wherein said third set of routines applying supervised model building algorithms to said data reduced to modeling form comprises one or more of the following routines : principal components regression, least squares' regression, partial least squares regression, discriminate analysis, soft independent modeling by class analogy (SIMCA) ; hierarchical principal components analysis, neural network classification and prediction, K nearest neighbors. analysis, Kernel methods for very large datasets generated for highthroughput screening applications, and hybrid methods, and wherein said one or more routines'in said third set of routines are selectable by the user of said system.
42. The system of claim 40, wherein said fourth set of routines applying unsupervised modeling building algorithms to said data reduced to modeling form comprises one or more of the following routines : principal components analysis, hierarchical cluster. analysis, self organization mapping, nonlinear mapping, evolving factor, batch or curve resolution analysis, and hybrid methods, and wherein said one or more routines in said fourth set of routines are selectable by the user of said system. , ..
43. 'The system'of claim 1, wherein said data standardization module further comprises a data reconstruction routine for reconstructing missing or corrupted data.
44. The system of claim 43, wherein said data reconstruction routine applies a predictive modeling algorithm to reconstruct said missing or corrupted data.
45. The system of claim 44, wherein said data reconstruction routine comprises: a) a first routine collecting standardized spectral data having a corrupted or missing spectral region; and b) a second routine applying a multivariate regression model to said standardized spectral data to predict said missing or corrupted data based on available data.
46. 46. The system of claim 30, wherein said data standardization module further' comprises a data reconstruction routine for reconstructing missing or corrupted data.<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <P> 47. The system of claim 46, wherein said data reconstruction routine applies a predictive modeling algorithm to reconstruct said missing or corrupted data.' . 48. The system of claim 47, wherein said data reconstruction routine comprises :. a) a. first routine collecting standardized spectral data having a corrupted or missig spectral region; and b) a second routine applying a multivariate regression model to said standardized spectral data to predict said missing or, corrupted data based on available data..
47. 49 The system of claim 3, wherein said processing refinement module further , comprises a calibration module for calibrating said data.
48. 50 The system of claim 49, wherein said calibration module. comprises a set of instructions : a). selecting, either automatically or using operator involvement, a group of., bands to be calibrated ;. b) selecting, either automatically or using operator involvement, a reference spectrum : for said group of bands ; c) normalizing each band in said group of bands ; d) selecting ; either automatically or using operator involvement, a calibration. error function to apply to said group of bands; and e) applying said calibration error function to said normalized bands to thereby calibrate said bands.
49. 51, The system of claim 49, wherein said calibration module comprises a set of instructions : a) selecting, either automatically or using operator involvement, a group of bands to be calibrated ; selecting,, either automatically or using operator involvement, a reference spectrum for said group of bands ; c) selecting peak positions for said reference spectrum and for each of said bands in said group of bands; and d) calculating a regression correction for each of said bands in said group of bands.
50. 52 In an integrated spectral data processing, data mining,'and modeling system for use in diverse screening and biomarker discovery applications, said system for use in conjunction with an analytical spectrographic. instrument collecting data from a chemical or biological sample, a. processing refinement system for performing a calibration of said data comprising. a general purpose computer system ; and a machine readable storage medium containing a set of instructions for siad gereral purpose computer system, said instructions, comprising instructions:.' a) selecting, either automatically or using operator involvement ; a group of<BR> <BR> <BR> <BR> <BR> bands in said data'tobe. calibrated ;' b) selecting, either automatically or using operator involvement ; a reference spectrum for said group of bands ; c) normalizing each band in said group of bands ; d) selecting, either automatically or using operator involvement, a calibration error function to apply to said group of bands ; and e) applying said calibration error function to said normalized bands to thereby calibrate said. bands...', %.
51. In an integrated spectral data processing, data mining, and modeling system for use in diverse screening and biomarker discovery applications, said system for use in conjunction with an analytical spectrographic instrument collecting data from a, chemical or biological sample, a processing refinement system performing a calibration of said data comprising : a general purpose computer system ; and a machine readable storage medium containing a set of instructions for said general purpose computer system, said instructions comprising instructions: a) selecting, either automatically or using operator involvement, a group of bands to be calibrated; b) selecting, either automatically or using operator involvement, a reference spectrum for said group of bands ; c) selecting peak positions for said reference spectruin and for each of said bands in said group of bands ; and..' d) calculating a regression correction. for each of said bands in said group of bands.
Description:
Integrated Spectral Data Processing, Data Mining, and Modeling System For Use in Diverse Screening and Biomarker Discovery Applications NOTICE REGARDING COPYRIGHT A portion of this disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION A. Field of the Invention This invention relates generally to the fields of new drug discovery, drug screening and biomarker discovery. More particularly, the invention relates to an integrated computer system including software that obtains raw data of chemical and biological samples from one or more analytical instruments. The inventive computer system and associated software integrates the entire process of data processing, standardizing the data, visualizing the data, reducing the data to modeling form, and analyzing, modeling, and screening the data. The analytical instrument (s) supplying data to the system will typically take the form of a spectrometer, such as, for example, a mass spectrometer or proton nuclear magnetic resonance ('H-NMR) spectrometer.

B. Description of Related Art In the last few years much attention has been devoted to the detection and validation of biomarkers for disease, toxicity, drug efficacy, mechanism of action, etc. The majority of these methods consist of generating, mining, and modeling large numbers of discrete measurements of chemical, biological, and physical attributes. While potentially very large in number, gene-chip based screening methods fall into this category. Recently, spectroscopy (e. g. nuclear magnetic resonance (NMR) and mass spectroscopy (MS),

including Surface Enhance Laser Adsorption and Ionization (SELDI) ) has been used as a screening and profiling tool for biological systems, including metabonomics, i. e. , the quantitative measurement of the multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification, and proteomics, i. e. , the quantitative measurement of the production of proteins of an organism.

While spectroscopy does generate a large number of measurements, these measurements do not represent potentially independent discrete entities or variables.

Spectral data usually represents a discrete sampling of a continuum. As such, the nature of the data is very different from a large collection of discrete measurements, which are potentially independent. Methods tailored for mining data sets of discrete variables are not necessarily appropriate for spectral data where the discrete points do not represent variables.

In addition, since spectra do represent a discrete sampling, the number and position of the measurements is somewhat arbitrary if the point spacing is adequate to describe the sharpest features. The result is that two spectral measurements of the same sample need not measure discrete values at the same point positions but the information content of the two spectra is nevertheless identical. Another aspect of spectral data is that spectral features (resonances, bands, masses, etc. ) are broad and span multiple discrete measurements (see Figure 8).

Spectral features usually have a known theoretical shape (e. g. Gaussian or Lorentzian) for pure components. Through the measurement process, the actual data gets convoluted due to the limitations (finite measurement windows, instrument tune, electronic response time) inherent in instrumental design. As illustrated in Figure 8, convolution usually has the effect of broadening spectral features. Replicate measurements may portray the same spectral band with different widths. Spectral bands are also subject to environmental effects such as temperature, pH, total solids, metal ion concentration, etc. These effects can change both band position and shape. Obviously, calibration defects in the measurement process can cause an apparent shift in band position. Other factors such as noise, interference from other constituents, and differing backgrounds"baseline"also contribute to spectral features.

Unless these processes are taken into consideration in the processing and modeling of spectral data, erroneous conclusions can be obtained. It is clear that discrete measurements obtained from spectral data cannot be treated as variables and appropriate tools suited for data mining spectral data are needed. The actual"variables"in spectroscopy are indeterminate prior to analysis and the ultimate ability to assign variables is determined by the diversity in the samples as expressed in the measurement process.

While the application of spectral methods for biological screening is steadily growing, instrument software is designed for highly interactive manipulation of individual spectra. In addition, the parameters chosen for processing raw spectral data can have a large effect on the results obtained from subsequent modeling. There exists a need for an integrated system from raw data processing through modeling and visualization tools so that processing parameters can be optimized for particular modeling and screening methods.

While instrument software does not provide adequate data mining and modeling capability, software systems that perform data mining, modeling, and analysis are not suited to deal with spectral data. This also indicates the need for a more completely integrated system.

There is a growing trend to take a holistic approach to profiling systems based on combining data from multiple techniques. This also indicates a need to merge data from multiple sources into a common format for modeling.

The present invention was designed to meet the needs and shortcomings of the prior art described above. NMR based screening methods such as metabonomics, metabolite profiling, or ligand-binding assays are currently limited by tedious manual data manipulations between multiple software packages, scripts, and operating systems. Mining and analysis of mass spectral data used for proteomics (e. g. , MS using SELDI or MALDI techniques) also suffers from integration and processing deficiencies. The present invention provides a complete integrated automated path for handling these types of screening and biomarker discovery paradigms. The system is modular and flexible and can be deployed for multiple data types and can integrate and correlate data from multiple sources.

Previous software applications for processing spectral data have been centered on highly interactive detailed investigations of a few samples. Software that can be operated in high throughput mode for screening and data mining of spectral data has heretofore eluded the art. State of the art research, using the prior art software tools, has required awkward handoff of data between multiple software systems, often running on multiple operating systems. Such systems generally lacked the ability to reverse the process in order to mine the data in a different way. Moreover, the commercial developers of spectral software tools have been slow to grasp the significance of screening and the need for powerful modeling tools. Additionally, developers of modeling tools have been slow to grasp the need to integrate their applications with the path to raw data, and not just reduced data. The art, in other words, has lacked an ideal system which provides for seamless

integration of the workflow from the processing of the raw data, through reduction to a data mining form, through modeling and data mining.

There is clearly a need for an integrated system that can manage and capture the information flow from raw instrument data to modeling and mining applications. In addition, there is a need for a software system providing for integration of spectroscopy specific processing into the information workflow, as such an integration allows for seamless mining of spectral data under on-demand processing conditions. Integration also allows the automation of processing and analysis. With automation, screening applications become possible.

A system meeting these objectives, for which there has been a long-felt but unsolved need in the art, is the subject of the present invention. Preferred aspects of the present invention recognize that the ability to query and mine spectral data is inherently linked to the path from raw data through modeling. The ability to change this path and generate new models for data mining is integrated into the present system. Because the present invention integrates these features, and provides for still others not found in prior software systems, it is believed to represent a significant contribution to the art.

In order to aid in understanding some of the features of the present invention, a brief discussion of unsupervised and supervised statistical methods of spectral data analysis in the current state of the art of drug discovery and toxicity screening is set forth initially. For example, from metabonomics studies of urine samples, spectroscopic data submitted to a computer for analysis typically is first passed through several processing steps designed to reduce variation due to the instrument, and reduce minor spectral variations due to sample- to-sample chemical and environmental differences. This is accomplished in part by reducing the spectra to a modeling form by integrating over small regions. The resultant vector of integrated spectral intensities contains between 200-250 integrated regions depending on how many regions were excluded due to interfering resonances from vehicle or metabolites derived from the treatment. These regions are small enough that most can be associated with a particular compound, and large enough to reduce the initial dimensionality of the data for subsequent analysis. Absolute quantitation of metabolites in urine is highly variable from sample to sample and is dependent on factors such as thirst, which are less relevant than changes in metabolic profile. For this reason, the data is usually normalized to unit intensity (scaled to 100%). With normalization, the vector of integrated regions can be interpreted as a probability distribution of proton resonances. With this interpretation, subsequent analysis techniques are designed to look for perturbations in this distribution

between treatments and controls and to correlate these differences with class membership (treatment type) or histo-pathological endpoints. In addition, techniques may look for changes in spectral distribution with respect to time. If the analysis identifies pattern changes that correlate with the desired endpoints, then an inference between the indigenous metabolites contributing to the observed changes and the endpoints can be proposed. As with any proposed relationship between inputs and measured outputs, it should always be kept in mind that the analysis results in correlations and not a direct measure of causality.

Care must be taken in the experimental design to reduce chance correlations.

The numerical methodologies most commonly used for metabonomics data fall into the general category of factor analysis. As implied by the name, these methods seek to reduce the data to a few inherent factors (sometimes referred to as latent variables) that describe the data. Interpretation, class membership and model regression are subsequently derived from these factors. In its least refined sense, the most important factor is the mean and variance between groups. If all control animals generate identical spectra without variation, and all treated animals generate an identical spectra different from the control, the analysis would reduce to two factors. These factors would be the means of the two groups and analysis could be obtained by a simple difference. With these two factors, classification would be possible. It should be noted, however, that these factors include data from all the original integrated intensities, so that interpretation of the results in terms of individual compounds is often difficult. The temptation to only look at select variables in the analysis as if they were independent is great but must handled carefully. The normalization of the data creates a link between all the variables so that univariate interpretations become problematic. This concept is probably the most difficult to learn with respect to using factor based analysis tools.

Identical animals under identical conditions, however, do not generate identical spectra. The starting point is again the mean of the data. The mean is subtracted from the data as the most important factor. The subsequent analysis then asks the question: How many factors are needed to describe the remaining variation? With the answer to this question in hand, the remaining data can be described as (or decomposed to) a linear sum of the factors and factor quantities for each sample. The differences between the relative amounts of factors between samples become the new subject of correlation and analysis.

While the preceding discussion follows a logical progression to analyze the data, it does not describe how to generate factors. The technique most commonly used is Principle Component Analysis (PCA), a well-known technique in the art. Numerically speaking, this

technique decomposes the data into factors (called loadings in PCA terminology) and quantities (referred to as scores). The factors are constrained to be"orthogonal" (e. g. each factor describes totally new information from the others), and ordered such the first factor has the largest variance of scores (e. g. the first factor describes as much of the data as possible), and each subsequent factor describes less and less of the data. It is inferred that the first few factors are describing variation that is important for analysis and that the last few factors are describing noise. This interpretation will usually hold if the magnitude of the differences between the controls and treated samples is greater than the natural variation between like samples. With this assumption holding, analysis can be conducted with just the first few factors, which amounts to reducing the problem to a few dimensions. A visual inspection of plots of scores for samples will indicate groupings based on treatments. If the scores for two groups are segregated along a particular factor, this factor can be used for classification.

If all the groups (controls + treatments) are combined into a single dataset with a common mean, this type of analysis is referred to as unsupervised factor analysis. Since no assumption is made about groupings, significant variation between different treatments will contribute to factors with large variance and may lead to groupings by inspection of the PCA results (score plots).

As the magnitude of the effect of treatment becomes small with respect to the natural variation between like samples, more aggressive methods of data analysis become necessary. These methods are usually referred to as"supervised"in that they use known information about endpoints to find factors that correlate with groupings (recall that with PCA, only variance in the measured data was used to find factors). One of the most popular techniques in this category is called Partial Least Squares (PLS). This is a factor-based method with many common aspects with PCA, however the factors are chosen to both describe the variation in the data (like PCA) but also optimize the correlation with the known endpoints or groupings. With PCA, adding more factors describes more of the data but at some point describes noise. With PLS, adding more factors will give better and better correlations with the endpoints, but at some point over fitting will occur and the models will fail to adequately classify new samples not included in the model. For this reason, supervised methods are subjected to validation or cross validation methodologies to avoid over fitting. The most common technique is to successively leave portions of the training data out of the fitting and then predict the endpoints of the"left out data"as a function of

the number of factors. Analysis of the results will determine the minimum number of factors needed to define the model robustly.

The result of supervised methods like PLS is models that can be applied to new or unknown samples. The models can be used to predict endpoints or generate class memberships. In addition, since these models are based on factors that were derived from the particular data structure used to build the models, a new question can be asked of the unknown sample data. Can the new data be adequately described with the same factors that were used to build the model? If the answer is yes, then it is assumed that the model predictions are reasonable. If the answer is no, the model predictions are suspect since there is a new source of variation in the data that was not in the original model training data. The analysis of the new source of variation is often referred to as"Residual Analysis". Residual Analysis is a powerful feature of factor based modeling methods and offers a particular advantage over many"black box"based pattern recognition systems.

While PCA and PLS form the basis for much of the data analysis in practice, there are other popular analytical methods found in the literature and implemented in many commercial statistical software packages. Such packages include chemometrics analytical software such as Piroutte TM from InfoMetrix, Inc. of Woodinville Washington, software packages from Process Analysis and Automation of the UK, and statistical software packages available from Metrics from Umea, Sweden.

The interested reader is directed to Published PCT patent application of Metabometrix, Inc. , publication no. WO 02/052293A1 for a background discussion of metabonomics and related methods for analysis of spectral data. See also Nicholson et al., <BR> <BR> <BR> <BR> 'Metabonomics' : understanding the metabolic response of living systems to<BR> <BR> <BR> <BR> <BR> patlaoplzysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data, Xenobotica, Vol. 29 no. 11 pp. 1181-1189 (1999). Other background references include Lindon et al., Patter71 recognition methods and applications in biomedical fnagoaetic resonance, Prog. Nucl. Mag. Res. Spec. Vol. 39, pp. 1-40 (2001), and Nicholson et al., High resolution proton magnetic resonance spectroscopy of biological fluids, Prog. Nucl. Mag. Res. Spec. Vol. 21, pp. 449-501 (1989).

SUMMARY OF THE INVENTION In a principal aspect, an integrated spectral data processing, data mining, and modeling system is provided for use in diverse screening and biomarker discovery applications. The system is designed for use in conjunction with an analytical spectrographic instrument that collects data from a chemical or biological sample, and stores output raw data in a file typically resident on a instrumentation computer networked with the inventive system.

The inventive system includes a general purpose computer system and a machine readable storage medium containing a set of instructions in the form of processing routines, described in detail below, for the general purpose computer system. These instructions integrate the following four software modules into an integrated spectral data processing system: (1) a module operating on raw data from files created by the analytical spectrographic instrument and storing raw processed data in a file; (2) a module operating on the raw processed data and containing instructions for providing data standardization of the raw data and storing standardized individualized spectral data in a file and/or a library of files; (3) a module operating on the standardized individualized spectral data and containing instructions for responsively reducing the individualized spectral data into a modeling form and storing the modeling form of the data in a file; and (4) a module operative on the data reduced to modeling form and containing instructions providing a user of the system with tools for performing model building, visualization, analysis and/or prediction of the data.

The system further includes a tracking database containing the results of the model building, visualization, analysis, and/or prediction of said data. The tracking database may store other information, including visualization results, sample and processing parameters, data reduced to modeling form, or libraries of standardized data.

The four modules thus handle the entire workflow process from automated processing and quality control of raw spectral data, to reduced modeling form, to modeling and statistical mining and visualization tools. What previously took days or even weeks of data manipulation, can now be accomplished in a few hours so that the time of the researcher is spent investigating the scientific issues of the studies, and not performing

tedious data manipulations. The system is designed to be flexible and handle data from multiple spectrographic techniques (NMR, mass-spectrometry, or others).

Optional modules that can be used in the system include a processing refinement module, and export module for formatting and exporting data to third party analysis software which maybe optionally provided, and a visual data mining and feature extraction module providing the user visual tools for further analysis of the data.

In one possible embodiment, the system is entirely automated, in that the processing of raw data, data standardization, reduction to modeling form and model building or screening are perform automatically with no human involvement. In this example, the particular options for processing of the data are chosen in advance (or stored in a file or otherwise selected) and the processing proceeds by execution of the modules described herein serially, one after the other. In a more typical implementation, the user is provided with opportunity to select analysis or model building modes as the processing proceeds, to loop back and perform different modes of analysis, or change between model building or screening techniques and specific analytical routines to apply. The flexibility to change the mode of analysis, in a completely integrated system from raw data processing to model building, visualization and analysis, is a highly useful feature that gives the present system much more flexibility that those found in the prior art.

The above summary is an overview of the entire system. Inventive aspects are further presented in the individual modules, as will be appreciated from the following discussion.

BRIEF DESCRIPTION OF THE DRAWINGS Presently preferred embodiments of the invention are described below in conjunction with the appended drawing figures, where like reference numerals refer to like elements in the various views, and wherein: Figures 1A and 1B are a schematic representation of an integrated spectral processing and analysis system in accordance with a presently preferred embodiment of the invention, showing the principal software modules thereof, and the relationship of the inventive system to other systems and equipment typically found in a laboratory setting, including instrumentation control and data storage systems ("group A"), an optional laboratory information management system (LIMS, "group B"), and third party statistical software ("group C").

Figure 2 is a flow chart illustrating the software module identified as"automated processing of raw data"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 3 is a flow chart illustrating the software module identified as"processing refinement"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 4 is a flow chart illustrating the software module identified as"data standardization"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 5 is a flow chart illustrating the software module identified as"visual data mining/statistical analysis/feature extraction"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 6 is a flow chart illustrating the software module identified as"data reduction to modeling form"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 7 is a flow chart illustrating the software module identified as"unsupervised and supervised model building, visualization, analysis and prediction"in Figure 1, showing the principal subroutines of the module and their relationship to other modules in the system of Figure 1.

Figure 8 is a graph showing a typical spectrographic measurement of a biological sample.

Figure 9 is a graph showing NMR spectra of a biological sample, including simultaneous full and expanded scales and an overlay of the integral of the spectroscopic measurement.

Figure 10 is a graph showing an outlier visualization tool for spectroscopic data.

Figure 11 is a graph showing a stack view and band analysis visualization tool.

Figure 12 is an image view of spectroscopic measurements.

Figure 13 is an illustration of a residual magnitude analysis visualization tool.

Figure 14 is an illustration of a pair-wise residual analysis visualization tool.

Figure 15 is an illustration of a group statistical analysis visualization tool.

Figure 16, upper portion, is a 3 component, three-dimensional principal components analysis (PCA) score plot for the first three principal components clusters of spectral data; the score plots are observed along the first component (pcl).

Figure 16, lower portion, is a graph showing the loading of the first PCA factor of the upper portion of Figure 16.

DESCRIPTION OF PREFERRED EMBODIMENTS Overview A primary objective of presently preferred embodiments of the invention is to provide a software-base processing system that integrates the entire process of processing, standardizing, visualizing, reducing, analyzing, modeling, and screening of chemical and biological samples that are analyzed by diverse spectroscopic or multidimensional analytical techniques. This integration is provided in a computer software-based integrated spectral data processing, data analysis and data mining system 100 in Figure 1, shown as "group D". In a typical implementation of the invention, the system 100 cooperates and interacts with other systems present in a state of the art drug discovery or screening laboratory, as described hereinafter, including laboratory instrumentation control and data collection station 102 ("group A"), a laboratory information management system (LIMS) 104 ("group B"), and third party off-the-shelf statistical analysis software 106 ("group C").

The integrated spectral data processing, data analysis and data mining system 100 of Figure 1 consists of databases, files stored in memory, and computer software stored on a machine-readable medium executable by a computing device such as general purpose computer system (not specifically shown in Figure 1). The computer system may take the form of a stand-alone general-purpose computer, a network server, or any other suitable platform, the details of which are not important. The functionality of the software modules comprising the system 100, and their relationship to each other and to the other elements 102,104 and 106 are described in detail below in conjunction with appended Figures. In a preferred embodiment, the system 100 is installed in a network server (or on distributed servers) such that it is accessible to multiple scientists simultaneously.

Referring now to Figure 1, the entire system in which the present inventive system 100 operates can be conceptualized in four groups. The first group (A) denotes the instrumentation control and data collection station 102, which includes a data collection device 1 (e. g.,'H-NMR or mass spectrometer). The data collection device 1 comprises the analytical hardware itself and an associated instrument control and processing computer (not shown). This computer is usually on a network, so that the raw and derivative exported data files are accessible to the system 100 described here. The computer in the data

collection device stores raw data in a native format file, indicated at 2, and/or alternatively in an exported data file, indicated at 3.

The second group B 104 denotes an optional Laboratory Information System (LIMS). If deployed, LIMS systems typically capture descriptive and workflow information about laboratory samples and simple analytical results, and store them as files in memory in the form of text and numbers. For biologically derived samples, information about the animal species, sex, age, histology, pathology, mortality, etc. may be available through a LIMS system. This information can be used directly for annotation of sample spectral data for screening and model building. This information is typically stored in the form of a sample/project tracking database 20 and an outcomes/endpoints/third party analysis database 21.

The third group C 106 is an optional statistical analysis software system for analysis of reduced data. The statistical analysis system includes an off-the-shelf software package indicated at 22. Spectral data, as collected, is not typically in a form amenable for analysis by commercial statistical software. The present system 100 is designed, however, to take advantage of the commercially available tools in the third party software package. The system 100 acts as the engine to process, standardize, and reduce the data to a form directly importable for analysis by commercial software 22.

The fourth group D constitutes the system, process, and associated applications for spectral analysis and screening applications, comprising the integrated spectral data processing, data analysis and data mining system 100 that is the subject of the invention. It complements the deficiencies in Instrument 102, LIMS 104, and Statistical Analysis tools 106 for spectroscopic or multidimensional analytical screening and modeling applications.

The system 100 is sufficient for analysis and data management in the absence of the LIMS 104 and commercial analysis software system 106 but can leverage these capabilities if they exist. The only external requirement of the system 100 is the availability of data from analytical instrumentation.

The system 100 includes the algorithms described below and the integration, automation, and storage of the intermediate and derived states of spectral data and results.

The integration allows the analyst to modify assumptions made about the optimal means to process and model the data for a given application. Without this integration, iteration through multiple procedures to find optimum processing and analysis conditions is

prohibitively tedious. In addition, once optimum parameters for processing and analysis are determined, they can be automated for routine screening applications.

Having described the overview of the system 100 and its interrelationship with other optional systems, the individual modules and systems in Figure 1 will be described with more particularity next.

Group A: Instrument Control/Data Station (102) Screening via spectroscopy begins with the collection of complex spectra and/or separation information on a group of samples from a suitable instrument 1. Spectroscopy (or spectrometry) can be a means of collecting information on complex mixtures in a single measurement process. As such, a carefully designed measurement protocol can become an extremely powerful screening tool. Sample groupings may be the result of designed experiments as is often the case with animal models or ligand binding studies and may include information about the time evolution of an effect. Samples may also be derived on the basis of availability as with clinical sample banks. Due to the complexity of many biologically obtained samples, samples often undergo a fractionation or separation prior to the collection of spectral data. The separation step, if utilized, may be directly coupled to the spectrometer. These techniques are often referred to as multidimensional or hyphenated techniques. Quantitative or semi-quantitative information about the species in the samples may be collected as part of the separation process. Information about fractions may be combined into a multidimensional dataset, which may contain data from multiple spectroscopic techniques. The data collection process is often driven by instrument automation and includes multiple experiments per sample.

An important element of data collection for screening purposes is reproducibility.

Experimental parameters that may result in differences in data properties must be captured in the measurement process. If a Laboratory Information Management System (LIMS, Group B) has been implemented, the sample list for automation is often generated as a query of the LIMS. Essential information about the identity of the samples, and fractions, as well as pertinent acquisition parameters, if not stored in a LIMS, is usually incorporated into the data collection process and exported as part of the sample title or internal comment field. The sample identity is preserved in the resultant data files 2,3. The resultant data may be captured as part of a LIMS or an archive management system. Data files can take many forms, from instrument proprietary to public standards, and the details are not

important. Often in the case of proprietary data forms, the instrument software provides exports to accessible forms in file 3. It is essential that the pertinent information about a sample's identity through all separation and analysis steps be accessible for subsequent analysis. This accessibility can be through a combination of information in the sample data files and a LIMS system.

While often well designed for automated data collection, most instrument software is optimized for highly interactive processing for individual spectra. This is appropriate for research applications in which detailed analysis of a few samples is desired. For screening applications, however, robust, reproducible, automated processing and analysis of large numbers of samples is awkward, tedious, and incomplete at best and impossible at worst using prior art instrumentation software.

Classes of spectrometer suitable for use in this invention include: NMR/NMRS, Mass Spectrometry (MS) (including Surface Enhanced Laser Desorption and Ionisation (SELDI) ), UV-VIS, CD, PE, Fluorescence, IR, NIR, X-ray, NOE, Microwave, and Raman.

Fractions can be generated by a variety of forms of chromatography, affinity capture, diffusion, electrophoresis, filtering, dialysis, and sizing methods.

Examples of hyphenated techniques include: Liquid Chromatography (LC) LC-MSn, LC-NMR, LC-NMR-MS, LC-UV (-MS), Gas Chromatography (GC) GC-IR, and GS-MS.

Examples of sample information include: species, strain, animal, sex, age, diet, body fluid, tissue, treatment type or compound, disease, time point, fraction, protein target and ligands (for ligand binding), intermediates and target compound (combinatorial QC), pH, and information from auxiliary measurements used to characterize the sample not associated with screening endpoint such as clinical-chemistry measurements, metal ion concentration, etc. Such sample information, including the spectral data itself, is stored in the files 2,3.

The files 2,3 are available to the system 100 e. g. , over a computer network.

Group B: LIMS System (104) As research laboratories add spectroscopy-based screening methodology to their standard services, often LIMS systems are implemented to keep track of projects, samples, outcomes, and workflow. Such information is stored in a LIMS database 20. Tracking data within a single project can be straightforward but a well-implemented LIMS system can be invaluable at managing and merging data between projects. This becomes important for

following long term trends in sample analysis and for building prediction models for outcomes that span multiple projects.

LIMS systems may also be very useful at capturing outcomes or endpoints for samples and results from other measurements and showing them in a database 21. Database 21 can be separate from, or integrated with, the database 20. These outcomes can take the form of toxic response, histology, pathology, disease presence, onset, modification, and regression, metabolic pathway modification, phenotype, and mortality. In addition, sample annotations such as clinical chemistries, and histories may be available through a LIMS system to correlate with endpoint and spectroscopy based screening data. LIMS systems may also manage summary results from statistical analyses in the database 21 correlating traditional laboratory measurements to outcomes.

While LIMS systems provide a means to track information about samples, outcomes, and traditional measurements, they are usually not adequate to deal with the multidimensional nature and form of spectroscopic data. Some systems provide a means to archive and catalogue spectroscopic data but provide no means to process, analyze, or build models with this data. The archive provides a means to recover raw instrument data but the native instrument software must be employed for subsequent analysis.

Traditional measurements include data from clinical chemistries and assays. As managed in a LIMS system, these data form a large collection of single-variable (univariate) measurements. While large in number, gene-chip data falls into this category. Taken as a group, multiple-univariate-statistics is often employed with this data in a third party analysis software 22.

Group C: Statistical Software (22) Statistical software for the analysis and correlation with endpoints of univariate and multiple-univariate-data is readily available from many sources. Spectra, while multivariate, are not multi-univariate. As discussed, spectra are discrete samplings of distributions, and methods of analysis should not be biased to methods oriented to the analysis of discrete variables. There are several companies specializing in chemometrics- statistical analysis software, however, that do provide tools appropriate for the analysis and modeling of spectroscopic data. Having a data path into and out of these packages is therefore advantageous. These software packages usually start with data that has been uniformly processed for subsequent analysis and do not provide the tools to do this. While

the utility of these programs is clear, integration with the data is difficult. In reality, it is often necessary or advantageous to impose different processing conditions to the raw instrument data for different modeling purposes. Therefore, this is too tedious to accomplish in practice.

Group D: Integrated Processing and Analysis System (100) There is clearly a need for an integrated system that can manage and capture the information flow from raw instrument data to modeling and mining applications. In addition, the implementation and integration of spectroscopy specific processing into the information workflow allows for seamless mining of spectral data under on-demand processing conditions. Integration also allows the automation of processing and analysis.

With automation, screening applications become possible. This needed system is the subject of the present invention. A presently preferred embodiment is shown as system 100 in Figure 1. An overview of the system 100 is described in this section as shown in Figure 1.

Subsequent sections of this document describe details of key modules and processes making up the system 100.

The process begins with the automated processing of the raw analytical/spectral data in a software module 4, shown in further detail in Figure 2 and described below. Part of the philosophy of the system 100 is to preserve maximal information content for analysis. As such, direct access to the raw, untouched data from the file 2 as collected by the instrument 1 is desirable. If the data format in file 2 is inaccessible, the most information rich export of data is generated, stored in the data file 3, and accessed by module 4.

The system begins with the processing of the raw data in the module 4. For analysis of spectral data, reproducibility is key to interpretation of results. In addition to variability in the samples and instrumental factors, there are reproducibility issues with respect to data processing. For this reason, algorithms employed in module 4 must be robust and applied uniformly to the data. With manual processing, multiple analysts will generate slightly different results. For modeling and screening, reproducibility is more important than processing perfection. The algorithms in module 4 need to be adapted for robust automated application as compared to the interactive algorithms often implemented on analytical instrumentation. All available sample information, annotations, and outcomes should also be extracted from the sample files or available LIMS system. Information about the sample and key processing variables are stored in a system-tracking database 19. Key sample parameters are stored in sample information tables 6 contained within database 19. All

subsequent processing, visualization, and analysis links back to the sample/spectra information stored in these tables 6. The tracking database 19 can be any form (flat tables, spreadsheets, relational database) as long as referential data integrity is maintained. The processing path of raw data processing module 4 is dependent on the type of data under analysis as will be appreciated from the description of Figure 2, set forth below. The modularity of the system easily allows the incorporation of new data types as needed.

In addition to the sample tracking information, the output of the raw data analysis, the processed spectra, is sent to a data file 5. The content of this file 5 should capture the essential raw information from the native format (files 2,3) and provide the flexibility for subsequent reprocessing, standardization, and analysis.

A process refinement module 8 is provided for in a manual intervention step in the event that it is necessary to review and refine the processing parameters. The process refinement module 8 is described further below in conjunction with Figure 3. From the processing information stored in the database 6, processing outliers can be spotted and refinement can be driven from the database or data can be flagged as unsuitable for analysis.

Parameter updates are stored in the database. In some cases, processing parameters (e. g. excluded regions, baseline removal type, band alignments) are chosen for groups of spectra together as a visual process. This module 8 allows for a rapid review/edit of the processed data driven by the samples in the database. The processing refinement module preferably includes a calibration module comprising a set of instructions performing the following steps, described in more detail hereafter: selecting, either automatically or using operator involvement, a group of bands to be calibrated; selecting, either automatically or using operator involvement, a reference spectrum for the group of bands; normalizing each band in the group of bands; selecting, either automatically or using operator involvement, a calibration error function to apply to said group of bands; and applying the calibration error function to the normalized bands to thereby calibrate the bands.

The system 100 includes a data standardization module 7. This is a key module and concept for the system. The data standardization module is described further below in conjunction with Figure 4. Since spectra are discrete points of a continuum, sampling frequency may be different between spectra collected at different times. Part of the standardization performed by module 7 is calculating a uniform sampling distribution in a robust way that preserves all pertinent information content, including curvature and high order moments of the data. In addition to sampling distributions, all post processing

corrections and normalizations are applied uniformly to all data of interest. The result is a standardized spectral data that is sent to file 10 for subsequent analysis. An alternative result is a library of multiple annotated spectra. A library can be formatted for subsequent optimized searches of direct spectral information. A library can also be used for encapsulating information about reference spectra for easy retrieval and comparisons during mining activities. The standardized files and libraries include information about their processing and standardization histories.

The system includes a visualization, mining, and statistical analysis module 11. The module 11, described below in conjunction with Figure 5, uses standardized spectra and information about the samples, such as class groupings and outcomes (e. g. control, treated, mortality, etc. ) and creates visual statistical reductions of the data depending on designated groupings. The module provides a feature whereby group spectral statistics can be compared between groups and to individual spectra. In addition, options include: stack views of the data for export into reports, banded views for easily spotting outliers, band quantification for direct correlation with outcomes, views that emphasize spectral differences, and direct and difference"image"views such as shown in Figure 12. Image views allow the data to be sorted by classification and have all data displayed together with sample identity on one axis, spectral sampling values on the other axis and the intensity displayed as a color map. Regions of reproduced similarity can be spotted easily for the groupings. Selection of a reference spectra or group average allows an image map to be displayed that is colored by the magnitude of sample differences with the reference.

Selection of a reference also allows a plotting of the magnitude of spectral residuals. This magnitude forms a distribution from which outliers can be spotted. If comparison is to a group control as a reference, outliers for treated samples indicate that there is a quantifiable difference for the treated samples. By linking sample spectra to library spectra, spectral identity for unknowns can be inferred. Results from statistical, residual, band quantification ("feature extraction"), and spectral identity ("screening hits") results are stored in a visualization results database 12.

While standardized spectra can be used directly for modeling, often it is expedient to reduce the standard data to another modeling form. A data reduction to modeling form module 13 is therefore included in the system 100. This module, described in further detail in Figure 6, is designed to reduce the data to features likely to correlate with an outcome. If spectral bands can be assigned and are resolvable (mathematically or in actuality separable

from other bands), band quantities can be associated with specific entities in the sample.

Often, spectral regions are segmented into small regions about the width of a typical spectral band, and only segment summaries are carried forward for analysis. In other cases, a spectrum may be characterized and stored depending on the presence or absence of a signal in selected regions. Other feature reduction methods include the use of wavelet coefficients, which can have the benefit of noise exclusion and insensitivity to small variations in peak positions. Other procedures that may be incorporated into module 13 might emphasize differences between relative band positions (e. g. autocorrelation), or sharp features over broad ones (e. g. magnitude of second derivative). The output of modeling reduction from the module 11 is stored in a modeling form database 14. The module 11 can also generate exported data formats that are stored in a file 17, in a format specifically appropriate for 3rd party analysis (Group C) software 22 as indicated in Figure 1.

While export for third party analysis is utilized as indicated by export data formats 17, the system 100 also provides its own internal modeling capability, carried out by module 15. The unsupervised and supervised model building, visualization, analysis and prediction module 15 is described below in conjunction with Figure 7. The module 15 generates several possible types of models, including predictive mode control or"normal" models, classification models, and outcome models, and stores these models as files in the tracking database 19. The modeling capability provided by module 15 is necessary in order to provide a closed loop system for screening applications. As such, the ability to generate and use in a predictive mode control ("normal"), classification, and outcome models is necessary. Predictions/classifications of the sample can subsequently be stored in a model- screening results database 16. In addition, unsupervised pattern recognition can be used as a quality control method as well as for cluster analysis. Visualization and analysis tools are provided for the modeling activities. As a quality control tool, spectral data that is categorically different from members of the same class can be designated in the database 16 as bad data if needed. An additional element of providing a modeling module 15 is the ability to combine data from multiple data or experiment types in a single model. This capability is an additional benefit of integrating processing and analysis with the same system. As with other elements of the system, the software module 15 is modular in design, and any new model building and prediction algorithm that may be desired is easily implemented and integrated with the rest of the module 15 and the system as a whole.

Another useful feature of the system 100 is its ability to automate the entire process from raw data processing to modeling and visualization. The automated processing is indicated by the bold lines 23 in Figure 1. With the modular design, full parameter sets for all modules can be stored so that for a given"named"analysis, the system is initiated simply by selecting the directory (or location in file 2,3) where the raw data resides. Post automation, the processed and standardized data files 5,9, 10,6, 16 are available, the data has been reduced to model form, and a designated modeling process has been performed.

Usually for a first pass at the data, an unsupervised cluster analysis is designated. The analyst obtains cluster visualizations and a database is populated with the tracking and processing information, ready for refinement or other visual mining. For screening applications, a classification model can be applied and the results tabulated. The ability to spot outliers in the processing parameters, cluster analysis, etc. allows subsequent problem solving with the data.

Detailed Description of Modules Module 4: Automated Processing of Raw Data (Figure 2) Referring now to Figures 1 and 2, automated processing of raw data begins with the capture and storage of new raw data in the files (2,3). For new data, a decision is made by the analyst at step 4-01 for full automation or raw processing only. For cases where there is sufficient experience to define processing parameters suitable for the entire process, automation is chosen. Full automation includes a loading of an automation script and stored parameters for the process, indicated at step 4-02., Where full automation is not desired, the process flow proceeds to step 4-04. Preferably, the automated process is hard coded to avoid user modification, and the process can only be changed or generated by analysts with super-user status or a system administrator. Parameters are stored in data structures associated with the various modules (e. g. processing 4, data standardization 7, data reduction 13, and modeling 15). Automation is then driven as indicated in step 4-03, by scripting and executing each module in succession.

Separate entry points are necessary depending on the type of data under study. The processing of NMR for raw data processing is quite different from mass spectroscopy and requires different tools. Since most data types can converge at the standardization or model reduction stages, it is easy to maintain multiple entry points at this stage. Maintaining

separate paths for different data types allows new data types to be added easily as indicated at step 4-06. Only tools specific to the new type need be developed and/or incorporated.

For NMR data, the process proceeds to step 4-05, where parameters are chosen for automated processing of the raw data. This is accomplished with a manual entry form for the location of the data, the processing parameters, and control for database storage. After selection, the processing and tracking continue unattended for all spectra in the directory selected.

The first step in the processing of NMR spectra is to read the acquisition parameters (sweep width, total points, acquisition modes, reference offsets, instrument frequency) and free induction decay data (FID) from the native format as indicated at step 4-08. A record for database tracking is created in the sample/processing database 6 (Figure 1) and the raw data is reformatted to the system format and stored in the processed raw data database 5 (Figure 1).

The FID data is a complex time domain signal. In some cases, the acquisition creates distortion in the first few points of the FID. These can be corrected with linear- prediction methods. (See for example: Kay, S. M. et. al., Spect. jum Analysis-A Modern Perspective, Proc. Of the IEEE, Vol. 69, ppl380-1419 (1981) and Kumaresan, R. et. al., <BR> <BR> <BR> <BR> Estimatirag the Parameters of Exponentially Danaped Sinusoids and Pole-Zeo Modeling in Noise, IEEE Trans. , Vol. ASSP-30, pp. 833-840 (1982), the contents of which are incorporated by reference herein). Bruker data is often subject to digital filtering such that the net result is a large first order phase distortion in the raw data that can be corrected.

(See for example: Moskau, D, Application of Real Time Digital Filters in NMR Spectroscopy, Concepts in Mag. Res. (Mag. Res. Engineering), Vol. 15, pp. 164-176 (2002), the content of which is incorporated by reference herein). These distortions are corrected in module 4-09 before proceeding with further processing.

Processing flow proceeds to module 4-10, where the time domain data is subsequently subjected to window functions and truncations (as directed by processing parameters 4-05) to enhance signal to noise and/or resolution. (For an explanation of NMR terminology and data processing issues see: Ernst, R. R. et. al., Principles of Nuclear Magnetic Resonance in One and Two Dimensions, Oxford University Press, NY, pp. 91- 241, (1987), the content of which is incorporated by reference herein). Since window functions can have an effect on band quantification, window functions are often deferred to the data standardization stage where multiple conditions can be easily tried to see the effect

on modeling and data mining. The data is subsequently transformed via fast Fourier transform to the spectral or frequency domain. Fourier Methods are used frequently in the current invention for resolution enhancement, denoising, derivatives, deconvulution, and frequency filters. For details on Fourier methods and their properties see, for example: Press, W. H., et. al., Numerical Recipes, Cambridge University Press, NY, pp. 381-453, (1986) and Champney, D. C., Fourier Transforms and Their Pltysical Applications, Academic Press, NY, pp. 1-88, (1973), the content of which is incorporated by reference herein.

Often an internal reference/calibration standard is added to samples or there are persistent bands in groups of samples that can be used for calibration purposes. Reference calibration frequencies are selected during parameter set up. Reference bands are identified and calibration offsets are adjusted by module 4-11 as necessary and tracked in the database. For internal standards, full-width-half-height calculations are made and stored in the database 6 as a measure of quality control on instrument tuning. If needed, phasing can be added to select reference peak positions.

Biological samples are frequently in water. Although instrument parameters are chosen to minimize the effects of water, it still remains a strong signal. For dilute samples, data processing is difficult in the presence of a strong water signal. For these cases a time domain frequency filter can be applied to minimize the effect of water. (See for example: Marion, D. et. al, Improved Solvent S'uppression izz One-and Two-Diznensional Spectra by Convolution of Time-Domain Data, J. of Mag. Res. , Vol. 84, pp. 425-430 (1989), the content of which is incorporated by reference herein). This occurs in module 4-12. Filters implemented in module 4-12 should be sufficiently narrow as to not have quantitative consequences on neighboring bands, and should only be applied at this stage if necessary for subsequent processing.

Phasing of NMR spectra is one of the key algorithms implemented in the processing module 4, and is performed in module 4-13. The objective of phasing is to deconvolute the arbitrary frequency phase shift and associated frequency dependent distortions. There are many ways of devising an algorithm to accomplish phasing. Any phase algorithm can be easily implemented in the system described herein. In practice, algorithms that minimize the curvature of selected baseline regions work well. Baseline regions can be chosen during processing setup step 4-05. Phase parameters can be optimized by simplex or direct 2nd order models. (See for example: Press et. al. (1986) pp. 289-293, and Deming, S. N. and

Morgan, S. L., Simplex Optimization of Variables in Analytical Chemistry, Analytical Chemistry, Vol. 45, pp. 278-283, (1973), the content of which is incorporated by reference herein). Phase parameters are tracked in the system database for quality control purposes.

Successive runs under identical instrument conditions should not have highly varying phase parameters and the ability to monitor the parameters for a run is highly useful.

Additional processing can be optionally added for NMR at this stage. One module is a module 4-14 performing frequency domain smoothing or baseline corrections. (See for example: Savitzky, A. , and Golay, M. J. E., Smoothing and Differentiation of Data by Simplified Least Squares Procedures, Analytical Chemistry, Vol. 38, pp 1627-1639, (1964) which is a standard reverence for smoothing and differentiation and is referred to as the Savitzsky-Golay method. Also, see Goehner, R. P. Background Subtaction Subroutine for Spectral Data, Analytical Chemistry, Vol. 50, pp. 1223-1225 (1978), Liu, J. and Koenig, J. L., A New Baseline Correction Algorithm Using Objective Criteria, Applied Spectroscopy, Vol. 41, pp. 447-449, (1987). The content of these two references is incorporated by reference herein. However, it is more useful to apply these algorithms at the data standardization stage (module 7, Fig. 1, Fig. 4) so that different parameters can be iteratively applied if needed to an entire block of data for trial modeling runs.

The final step of processing NMR data is a storage step 4-15 of the NMR processed data in the system format or database 5. NMR FIDs and processed complex spectra are stored as complex vector quantities. All parameters are stored in data structures and substructures by category. Raw processing parameters are assigned and/or generated and are stored in a single data structure. Data load and save procedures select specific data blocks to save. e. g. raw FID and spectral generation parameters are saved when converting the raw data step 4-08. Processed spectra and parameters are added to the file later via module 4-15. For rapid retrieval, processed spectra are saved in addition to the FID. While the processed spectra can be regenerated from the FID and the processing parameters, if frequency filtering is applied, this can result in unnecessary delay in data retrieval. Since the frequency axis is trivially computed from the processing and acquisition parameters, it is generated by data retrieval routines.

Mass spectral data (i. e. time-of-flight (MALDI and SELDI) ) used to characterize biological mixtures are typically collected as signal vs. time. Process steps or modules 4- 07,4-16, 4-17,4-18 and 4-19 are used with mass spectral data. Raw (binary) format mass spectral data is usually not accessible to third party programs. Fortunately, native software

programs allow export of the spectral data to comma-separated-value (csv) or other spreadsheet forms which are easily parsed. Processing begins with the selection of the processing parameters 4-07. The parameters include the location of the data, tracking information, mass axis resolution factors described below, smoothing, and baseline correction parameters.

At step 4-16, raw mass spectral data is converted to the system format, and trading of the data is set up in the database 6. Exported raw spectral data is easily read. The relationship of mass to time for time of flight (TOF) instruments is typically a simple polynomial (quadratic) and the polynomial parameters are chosen so that the masses of known calibration standards that span the scan area of interest are accurately represented. If the mass data in the export file represents points taken with a constant clock rate, the relationship between mass and TOF axis can be trivially calculated and this calibration information stored with the data and in the tracking database 6. The ability to generate this pseudo-calibration allows for applying subsequent different calibration equations if needed.

Since TOF instruments collect data in time, the time points are usually evenly spaced. Analysis occurs with the mass axis, however, which after applying the non-linear calibration results in mass points being unevenly spaced. In order to leverage processing algorithms, which assume even spacing on the axis of interest, it is useful to create a linearized version of the mass axis. Accordingly, the automated processing module 4 includes a linearization module 4-17. Fourier interpolating the time axis by a factor chosen during parameter setup (4-8 works well) can accomplish this without loss of intensity information. The desired mass axis resolution is chosen during parameter setup. The desired mass points are calculated from the inverse of the calibration equation and subsequently obtained by linearly interpolating the Fourier interpolated time axis.

Subsequent to linearizing the mass axis, standard smoothing and baseline algorithms can be applied in module 4-18. The default baseline correction involves calculating a convex shape around the base of the spectrum or using user generated baseline pivots for standard segmented baseline subtraction, or both. The baseline points are stored in a data structure so that the process of baseline subtraction is reversible if different baseline correction parameters are desired in subsequent analysis.

Post processing the data and parameters are saved in the system format by module 4- 19. The original time domain MS data is stored in addition to the processed and linearized mass axis. With the linearization of the mass axis, only a vector of intensities is needed for

storage. The mass axis can be trivially calculated from the processing parameters. As with NMR or any other data type implemented, the data are stored in the system format file 5 and tracking is maintained in the system database 6.

Module 8: Processing Refinement (Figure 3) Referring now to Figures 1 and 3, after the initial processing of the raw data in module 4 (Figure 3), processed raw files 5 are generated and stored, with sample identity and key processing parameters stored in the system database 6. Depending on the type of data (NMR vs. mass spectrometry), review and refinement requires a different path. Thus, a determination of data type is required in step 8-01. As other modules for different data types are added, specific review and refinement procedures can be incorporated as indicated by step 8-08. The innovation needed for this module was the ability to rapidly sort out those spectra that might need attention or to find spectra that indicate problems with the samples or the spectrometer. The answer is to store key processing and quality parameters in the system database 6 and to subsequently drive additional automation and visual refinement by selection of records within the database and not from the manual selection of individual files for inspection apart from the other spectra. The ability to summarize processing parameters and spot outliers easily is not only novel but it is also critical to screening protocols that utilize spectroscopy. Ultimately, these spectral results will be mined and/or modeled for significant differences between groups. It is imperative that these differences be based on actual chemical or biological phenomena, and not be based on artifacts of data collection or processing. As a consequence, outliers determined from processing need to be reviewed for processing error, sample error, or instrument error, prior to analyzing the data for relevant differences.

For NMR data, inspection of the processing and quality control parameters stored in the database 6, may indicate which sample spectra need to be reviewed. For NMR spectra that have an internal reference, several indicators of quality are automatically generated.

The first is whether the reference band was found. Failure to do so obviously indicates a failure of some type. The second parameter is the offset where the band was found. Offsets that are outliers from the typical values found in an analysis run indicate that spurious resonances have occurred near the reference or that the reference algorithm found the wrong reference. In addition, the full-width-half-height is calculated for the reference. This is a measure of the instrument performance and tuning. An outlier in this metric (larger that

typical for a run), indicates that the band resolution for this sample is not optimal. This could be due to contamination of the sample, failure of the instrument to find a lock signal, or that the instrument shim was sub-optimal. In addition, the database captures the phase adjustments found from automated phasing. Phase parameters that are outliers from the range of typical values flag sample signal or auto-phasing issues. By inspecting these quality parameters, a small subset of spectra that need to be reviewed can be determined.

Samples to be reviewed are determined by inspection of the database processing parameter capture. Subsequent process refinements are driven from records chosen in the database. Step 8-02, therefore, is a determination of whether samples to be reviewed are obtained from the database or from manual editing. One option available is to process selected spectra using database values at step 8-03. Suspect spectra can have their processing values set to the group average prior to visual refinement. This feature can also be used if automated parameters for a sample type are not yet known or not yet optimal. A few spectra can be processed by hand and the resulting processing values can be applied to the remainder of the spectra automatically prior to visual inspection. The reprocessed spectra are saved to a memory file 5 storing processed raw data.

For NMR, phasing is the most problematic processing function. For this reason an optimized visualization tool for manual verification and correction has been designed. The tool is provided on the user interface of the computer system implementing the system 100.

The tool is shown in Figure 9 and indicated in the software flow chart at step 8-04. The visualization tool of Figure 9 simultaneously includes multiple views of the data useful for review. A full view of the spectra is overlaid with a scaled view showing details of the baseline. In addition, an integral trace is overlaid on the data. Zooming is also allowed.

When zoomed on spectral features, the integral trace maintains a full spectral display so that some measure of the state of the phasing of the full spectra is always available. Phase parameters are adjusted by sliders, features provided on the user interface. Parameters for automated phasing based on baseline regions are also available by activating appropriate icons. These automation parameters can be changed and the effect on phase inspected. If a set of baseline regions seems optimal for phasing, the remainder of the spectra in the list for inspection can use these parameters to finish the process in automation. These parameters can then be utilized for full automation the next time this type of sample/spectra is collected for analysis. In addition to phase, this same interface allows reference peak locations to be adjusted based on selecting the reference peak of interest. The automation algorithms for

reference location finding are sufficiently robust, however, that adjustment of NMR reference calibration is only needed for extremely problematic data. The resulting processing parameters are stored in the system database 6 and the processed spectra files 5 are updated.

For TOF-MS data, a review of baseline and smoothing results may indicate that refinement is necessary. An apply corrections module 8-05 is therefore provided. These refinement and correction procedures are based on automation with chosen parameters.

Standard smoothing involves selection of window widths and function order and involves a tradeoff between signal-to-noise enhancement and resolution. Automated generation of spectral baselines is best for screening purposes. Reproducibility is more critical than baseline perfection and utilizing a consistent set of parameters for all spectra in a set meets this criteria. After review, additional baseline regions may be selected and applied to all spectra.

In some cases, band positions in TOF-MS appear to change position between replicate samples. This is usually a result of instrument and sample preparation limitations, and not variation due to phenomena to be captured in modeling or analysis. This shift can be modeled as a calibration error. Ideally, samples could be incorporated with internal standards for generating a calibration for each sample.

To overcome the calibration issues, an algorithm has been developed that utilizes persistent bands within a group of spectra under analysis to generate a secondary calibration for each sample. This algorithm is implemented in software in routine 8-06. This procedure could be accomplished manually at the instrument by calibrating selected sample bands in each sample but this would be prohibitively tedious. For most cases, the secondary calibration supplements the instrument calibration with a simple offset and bias correction.

This is consistent with the observation that most calibration errors are proportional to the mass. In extreme cases needing a non-linear correction, the secondary calibration could replace the instrument calibration. Data analysis could proceed without this correction but tedious, error-prone, manual alignment of subsequent mining form results needs to be accomplished, rendering the potential for automated screening applications tentative at best.

The routine 8-06 begins by processing the data without recalibration to standardized form. Visualization tools provided in module 11 (see Figures 1 and 5) are used to inspect the data to find band ranges that are common to the majority of the samples and that span the mass ranges of interest. These spectra are selected either automatically or using human

involvement. These spectra or spectral ranges are subsequently recorded and stored. At the same time, the individual spectra are analyzed to find the spectrum that is closest to the mean of the group. This spectrum is recorded as the reference to the group, again either automatically or using human involvement. Calibration is initiated with a list of spectra to be calibrated, the reference spectrum, and the ranges to match.

Each band to be calibrated is then normalized, using a variety of options including smoothing, normalization, and derivative order. The spectra are first smoothed. and differentiated. Differentiation emphasizes different features of bands. No derivatives emphasize general band shape. First derivatives emphasize band maxima and second derivatives emphasize positions of sharp features. Except for cases where the signal to noise ratio is very strong, no derivatives are needed. Setting the minimum, mean, or median of the band to 0 and subsequently normalizing to uniform range, area, or Euclidean norm normalizes the data. Setting the minimum value to 0 and using Euclidean norms works well in practice. Normalization of each band independently gives each band equal weight when solving for the new calibration.

To perform or solve the new calibration, a calibration error function is chosen and then applied to the normalized spectral data. The calibration error function can include a sum-of-squares difference between the reference spectra and the spectra to be calibrated or a sum of inner products between the normalized bands of the spectra. The sum of inner products (using Euclidean norms) would equal the number of bands used for calibration for identical spectra and is a useful choice. The calibration parameters are then subjected to optimization functions (e. g. simplex) to minimize the differences between the reference and the spectra to be calibrated. Diagnostics are collected on each individual spectral recalibration for the recalibration procedure (e. g. total calibration error) and for each band (band calibration error). These diagnostics can be used to reject or weight the bands differently for a second pass. This procedure can be augmented to use the initial group mean as the reference and after recalibration, calculation of a new group mean and iterating until a stable group mean is obtained.

An alternative to this procedure would include picking peak positions between the reference spectrum and the spectrum to be recalibrated and calculating a simple regression correction, as is done for instrument calibration procedures. Instrument calibration is used with known materials with known masses. When calibrating to persistent bands, however, a metric that maximizes band similarity works well without having to select peak positions.

Utilizing visualization that overlays multiple spectra simultaneously, results from internal calibration algorithms are displayed to the user for inspection, indicated at step 8- 07. If bands of like substances need further refinement, addition of fewer bands for alignment can be selected and the process of steps 8-05, 8-06 and 8-07 repeated. For typical bias and offset corrections to the original instrument calibration, selection of four to ten persistent bands that span the range of interest is usually sufficient. Secondary calibration parameters are saved in the database 6, and processed spectra are saved to the file 5.

Module 7: Data Standardization (Figure 4) Referring now to Figures 1 and 4, the automated processing and review modules 4 and 8 are designed to generate consistent, reproducible, processed raw spectra. For individual spectra the process would be complete. In order to compare information between large numbers of individual spectra, or in order to create group averages and variances for spectra, an additional processing step is needed to"normalize"the information contained between the various axes. Hence, the system 100 includes a data standardization module 7, shown in more detail in Figure 4.

While data-enhancing procedures have been allowed prior to this point, it is strongly preferable to only apply those algorithms that are necessary to process the data (e. g. solvent filtering NMR spectra for dilute samples so that phasing can be accomplished. ) Algorithms such as smoothing, baseline correction, frequency filtering, and window functions each have an effect on the data. By applying these algorithms as an additional step to produce standard data, multiple sets of parameters can be utilized to elucidate the most effective ones for subsequent analysis and modeling. This ability is a distinctive feature of this system.

The process 7 is initiated by a step 7-01 of entering the parameters to be used for the processing steps designated below. The samples are chosen from records in the sample database 6. The data standardization process 7 operates on data from the processes spectra data files 5.

The first processing step 7-02 is optional. If chosen, the processing applies digital filtering for interfering components. This typically applies to NMR data where solvent, buffer, or water signals are of large magnitude relative to the spectral bands of interest. A list of frequencies is chosen for filtering and an inverse-Fourier time domain spectra is created and the filters applied to the time domain data. Parameters need to be optimized so

that the filter is sufficiently wide to eliminate the unwanted filter but sufficiently sharp so that neighboring bands are not attenuated.

The next step 7-03 is to apply additional time-domain window functions typically applied to NMR data for signal-to-noise or resolution enhancement. While usually applied to NMR data, however, any signal can be pseudo-inverted by inverse-Fourier algorithms, and noise reduction windows applied and this procedure is allowed by the system. This is an example of the benefit of linearizing the x-axis for MS data.

Spectra are multivariate, but not in the sense that they may have tens of thousands of spectral points. They are multivariate because they have multiple features (or intrinsic factors) that correlate with underlying compositions of chemical and biological entities.

Prior to performing some form of feature of factor analysis, the actual multivariate dimensionality is indeterminate. The points along the x-axis represent discrete points of a continuum of values. The measured intensities at these discrete values are highly correlated due to the fact that features span many measured points. The measured discrete values are usually sufficiently spaced to adequately sample actual spectral features. In order to facilitate further analysis, it is necessary to match the spectral points without loosing the interrelations between the points. This matching is performed by a resampling/resolution matching algorithm in routine 7-04. Methods that strictly interpolate points in the original spectrum may loose information about band maxima and curvature unless the point spacing in the initial spectrum is very dense relative to the spectral features. Two strategies can be utilized to minimize or eliminate information loss in the resolution matching procedure.

Both start with an inverse fast-Fourier transform. For the first method, extending the inverse domain data by a large factor of points (8-16 is usually sufficient) and returning to the original domain precedes interpolation. The x-axis is now sufficiently dense to apply simple interpolation or spline procedure without information loss. The second method involves a discrete Fourier integration for each point in the desired axis resolution. This method is slower but yields exact interpolated values.

The sample spectra would, after the resolution matching procedure, have identical sampling resolutions. Consistent baseline and or smoothing can now be applied in step or routine 7-05. Sometimes it is advantageous to perform modeling utilizing spectral derivatives. For example, if modeling results are optimized when band positions are key to the results, then utilizing a first derivative will amplify this effect. If modeling results are optimized whereby sharp features are emphasized over diffuse features, then modeling with

second derivative data will amplify this effect. Derivatives can be generated with a window function in the inverse-Fourier domain or with the same algorithms utilized for smoothing.

Occasionally, interferences occur in spectral regions thereby masking useful information. Many of the modeling algorithms commonly used to analyze data can handle blocks of missing information. If these algorithms (e. g. , PCA, PLS, as described later) are used (majority case) then the interfering region can be regarded as missing data. Some modeling or visualization tools may not handle missing data and some type of method to reconstruct the missing data may be needed. This feature is provided by a reconstruction module 7-06. Since the data is standardized to a uniform x-axis, a method to reconstruct the data based on predictive modeling has been designed to perform reconstruction, indicated at step 7-08. In addition, reconstruction models can be saved in a database 7-07 for routine use for cases where reconstruction is part of automated screening. The procedure of step 7- 08 begins by collecting a set of spectral data that has been standardized but that do not have the corrupted or missing region. The spectral points in these spectra that correspond to the corrupted region are treated as the dependent variables to be modeled. The remainders of the spectral points are regarded as the independent variables. Multivariate regression methods such as principal-component-regression or partial-least-squares regression can then be used to build models to predict the missing data based on available data from other spectral regions. For NMR data, this is particularly useful in that chemical species contributing NMR signals typically have multiple bands. If one band is in the independent block and the other in the dependent block then the reconstruction procedure will be highly successful at full information reconstruction. In other cases, the fact that chemical species are correlated allows a reasonable reconstruction. One benefit to this method is that it is independent of the normalization of the data. The model database 7-07 consists of regression coefficients to reconstruct the masked region. Since this process would only be irregularly used, third party software could be used for the actual reconstruction modeling.

The usual final step in data standardization is to apply intensity normalization in module 7-09. Intensity normalization is chosen to match the modeling to be performed. In cases where the absolute intensity can be correlated to modeling results or constituent concentrations, no normalization is chosen. In many cases however, the absolute intensity is unreliable, or only relative intensities are important for model prediction. In these cases, normalization is usually applied. For situations where the ratio between known specific species are the key to modeling, normalization can consist of scaling a known band to unit

intensity so that all other intensities are indicated as a ratio to the chosen band. In the most common case, normalization is chosen so that the total area of each spectrum is set to one.

For this case, the spectrum is treated as a probability distribution. This is advantageous when modeling strategies look for changes in probability distribution of constituents. Other strategies such as Euclidean normalization and normalizing to the maximum peak are also provided for. Areas can be excluded for normalization but scaled and kept for analysis.

The final processed spectra are saved in a new standardized individual spectra file 10 (see also Fig. 4) of just the processed spectral data and the data structure of processing parameters for reference.

In some cases, screening involves comparison of trial screening data to known standards. For this case, it is advantageous to build an additional layer of structure on the standardized data. Hence, the processing flow proceeds to a generate libraries routine 7-10, where a determination is made as to whether libraries are to be generated. The module 7- 11 groups data and generates feature selection tables for a library of standardized data.

Information about the identity of the samples and database key indexes or features can be pulled from the sample database, as well as information about how the data should be grouped, and such information used to build a library or libraries, indicated at step 7-11.

Standardized libraries in the form of files 9 (see Figs. 1 and 4) are very similar to standardized individual files, except the spectral data is stored as a multidimensional structure of spectra. Database key values for the individual spectra are also stored in a table within the library file. Individual spectra files have a one-to-one relationship with the database records but this additional table is needed since one library file contains multiple records from the database. In addition, tables of peak positions can be incorporated in the library file as a quick reference to searching for spectra that contain peaks or peak searches based on unions of peaks.

Module 11: Visual Data Mining/Statistical Analysis/Feature Extraction (Figure 5) Referring now to Figures 1 and 5, after data standardization in module 7, the data stored in standardized individual spectra database 10 and the associated database records 6 are prepared for visual analysis in module 11-01. The module 11 deals with analysis by visualization techniques in the illustrated embodiment. Subsequent modules deal with data modeling but start with the same processed data. If annotations about the samples were not available during the initial data processing of the samples, they can be added at any time.

This preparation performed by module 11-01 includes generating sample lists, groupings associated with the sample, retrieving reference libraries and generating renormalization and deresolution factors. Additionally, the user is prompted to input display and labeling options. Specifically, group membership (e. g. control, treatments, time points, etc. ) or any categorical information around which statistical summaries may be useful can and should be associated with samples/spectra in the database. The visualization tool allows selection of database records including which fields to use for group statistics and which fields indicate that reference libraries are available for a given sample. Additional parameters specific to visualization include data normalization options, visualization resolution, and spectra label options. The ability to reduce the resolution prior to visualization is provided to account for limited computing resources. For example, if individual spectra contain 32000 points, most computer video cards and monitors only display 1600 pixels easily and for surveying large numbers of spectra, lowering the resolution facilitates computing performance. For detailed analysis of smaller numbers of spectra, full spectral resolution is available. Spectral resolution can be lowered by averaging over user-selected numbers of spectral points or by averaging over set x-axis resolution widths.

The final step prior to launching the visualization tools is the selection of the mode of visualization, indicated at step 11-02. Three modes are available depending on the type of questions being asked, stack, outlier and statistical analysis. Each is discussed below.

The first mode, the outlier analysis tool 11-02, facilitates finding spectra that are different from the rest selected for display. An example of this tool is shown in Figure 10.

This tool is most often used to spot"outliers". In this mode, all spectral traces have the same color except for the one selected. With this method, where spectra nearly overlay, bands will occur. Single spectra that fall outside of the dense banded areas are obviously different. The user can scroll through all spectra, highlighting each in turn or using the mouse to point to specific traces that are separate from the majority. The record ID appears in the analysis window on the screen display. Figure 10 shows a single spectra with ID=testlO1 that has a unique resonance at 1.5. Selecting this resonance highlights this spectrum. Outliers can subsequently be denoted in the database 0006 with explanation and flags as to fitness for further analysis.

The second mode is the traditional stack view of spectral data indicated in step 11- 04, and shown in Figure 11. This method is useful for incorporation into reports and visually summarizing entire experiments. In addition, users can zoom select regions of

spectra, and, based on a current visual limit, request in step 11-05 that a summary be generated, indicated at 11-06 and stored in a database table 12 (see also Fig. 1). The summary includes the area for each spectrum in the selected region, the maximum value, and the location of the maximum value. Often, the area or maximum value for a band is proportional to a concentration for a constituent within the sample. Relative areas for multiple bands would be indicative of relative concentrations. The location of the band maximum can be a measure of environmental influences, such as pH, and can be used as in indirect means of determining this type of information. The user can zoom and summarize multiple regions in succession with the results automatically stored in the visualization results database 12 for later analysis. In addition, an interactive user interface is provided with this tool whereby the user can type or paste into a dialog box multiple regions of the spectra and populate the database tables automatically for routine analysis needing this feature.

The third mode of visualization is designed to perform detailed statistical comparisons between individual spectra and between groups. This mode is indicated by processing step 11-07. The groupings designated at the beginning of this activity are used to generate spectral means and variances for comparison between groups and between groups and individual spectra. Comparisons are selected by providing two scroll bars to select between groups for comparison and two scroll bars to select between individual spectra for comparison. See for example figures 14 and 15. For comparison between individual spectra, one is designated as the reference and the second is designated for comparison. Any individual, or group mean can be selected as the reference. Figures 14 and 15 are indicative of comparisons routinely provided and are described below. In addition, two auxiliary views are provided to visualize trends and differences between groups and individual spectra and to select which spectra to focus on for analysis.

Within the statistical visualization and analysis mode, several auxiliary viewing tools are provided. The first auxiliary view is designated as an image view of the data (11- 08, Figure 12). If selected, this view displays all spectra simultaneously on edge (top-down) and color coded according to intensity. The user can control the color scale to emphasize intense or minor spectral features, indicated by module 11-09. In addition, the user can zoom any portion of the image view and control the resolution displayed. As is illustrated in Figure 12, if the spectra are aligned by group membership, inspection of the bands in this view may indicate those regions that differentiate between groups. From this view the user

can designate an individual spectrum to compare in the comparison tool. In addition, the image view can be generated relative to the individual spectrum designated as the "reference". If this view is selected, the image is generated as a difference map from the reference, and the color intensity indicates if the difference is positive or negative relative to the reference. If the reference spectrum chosen is the mean for the control samples then the difference image could be used to highlight bands that are more or less intense than the control for treatment groups.

The second auxiliary view, a residual magnitude view, indicated by processing step 11-10, is initiated by selecting a reference spectrum. The processing step 11-10 calculates the magnitude of the difference (sometimes referred to as the residual) between the reference and all the individual spectra. An example of this view is shown in Figure 13.

Figure 13 shows a typical result for two groups of data. Group 1 is designated as spectra from a control group. The mean of Group 1 is used for the reference. As can be seen, the magnitude of the residuals for group 1 samples is fairly consistent. The Group 2 samples are different from the Group 1 samples, as is designated by their larger residuals. The outlier spectrum illustrated in Figure 10 has a very large residual. Appended to the individual sample spectra, are the group means. It can be seen at the right edge of Figure 13 that the residual for the Group 1 mean is 0, as it should be since this was designated as the reference and the final residual is the mean of Group 2 relative to Group 1. This mode further features a tool for the user to query data points. The user can select a residual point via step 11-11 and this spectrum becomes the"comparison"individual spectrum in the compare mode. In addition, the user can export the residual information to the visualization results database 12. The database 12 keeps track of the magnitude as well as the spectrum used for the reference so that multiple references can be chosen and the results stored.

The image and residual plot auxiliary views are designed to visualize bulk differences between spectra. Spectra selected from these views are returned to a main pair- wise comparison tool 11-12 for detailed analysis. As already mentioned, the user can scroll through comparisons between all combinations of groups and individuals.

The software also includes a tool for pair-wise residual analysis, indicated at module 11-14 and shown in Figure 14. For comparison between individuals, the analyst is often screening for major differences or similarity. One spectrum is chosen as a reference and the remainder can be scrolled for overlay with the reference and in addition view the difference, designated as the residual. A module 11-15 is provided to scale the data either manually or

to specific peaks to highlight different band differences. The pattern associated with the difference can be indicative of how the sample should be classified. In some cases, it is possible to build reference libraries of samples of known class. If this is the case, the analyst can request that reference library information stored in database 9 (Fig. 1) be imported for comparison with this spectrum. The spectra under question now moves into the reference slot, and the library spectra are available to scroll for comparison. If the analyst determines that a library spectrum is a match for the spectrum under study, this is denoted as a"hit". The flagging of individual spectra or groups of spectra is indicated by processing step 11-16. A summary of this information is stored in the system database 6.

The summary includes the spectrum used as a reference, the spectrum for comparison, and, if the comparison is a library spectrum, additional information about its original database record.

For groups of spectra, the average and variance for each group are calculated and appended to the block of individual spectra. The group averages are treated as additional individual spectra when comparing pair-wise between individuals. When comparing between groups however, the variance information can be used to perform statistical analysis. A group statistics view module 11-13 is provided for this purpose, an example of which is shown in Figure 15. In this view, the analyst sees the mean and additional traces indicating the standard deviation around the mean. Scrolling through groups, the analyst can inspect which regions have large variance, and which ones have minimal variance. The analyst can also select a difference-of-the-means significance value (p-value) and highlight regions that are different within the statistical confidence selected. The analyst can also request that a correlation coefficient be calculated between the groups, which would be a measure of classification power for spectral features. Spectral features indicating significant difference of the means are flagged (as seen in Figure 15). The spectra can also be simultaneously flagged with significant correlation values with different markers. The analyst can choose to store the group summaries in the database 12. If selected, group means, standard deviations, significance, and correlation values are stored.

Finally, the analyst can compare groups to individual spectra in a module 11-17. By displaying group means and the associated standard deviations, the analyst can scroll through individual spectra and note individuals that are outliers to the group. If there are spectra that the analyst determines by comparison are important to note, a pair-wise

summary can be flagged and noted via module 11-16 and stored in the visualization results database 12.

Module 13: Data Reduction to Modeling Form (Figure 6) Referring now to Figures 1 and 6, a benefit of creating and having standardized spectra is the ability to visualize and model directly on the full spectral data. It is often expedient, however, to apply some variety of feature selection, or other data reduction process prior to modeling. Supervised methods in particular, without mathematical constraints appropriate for the data under study, will select random spectral features or artifacts if needed to generate an apparent predictive model. While predictive with the training set, these models will not generate robust predictions for screening applications.

Data reduction to modeling form is therefore a means of distilling the data into essential features, and acts as a constraint on which spectral features are allowed for modeling.

The data reduction to modeling form module 13 is initiated by selecting which spectral records in the system database to reduce, indicated at step 13-01. The creation of standardized spectra is a prerequisite for this activity. Reduction methods and parameters are chosen as well as the mode of output for the results and the process initiated for automated completion. In particular, the module 13-01 sets up sample lists, data reduction parameters, storage and export options.

If desired, adjustments to the standardized spectra can be performed as a first step, via module 13-03. The most common adjustments performed by module 13-03 would be changing the normalization or selection of specific regions to use or exclude from modeling form. Additional processing, such as smoothing or derivatives, could be performed if needed. The process flow then branches based on the reduction method chosen. With the modular design, any method useful for reduction can be incorporated into the system and added to the available options of reduction to modeling. Some of the more common procedures are outlined below.

The system includes a data segmentation module 13-05. Data segmentation is very frequently chosen. This module consists simply of an algorithm integrating over consecutive regions of spectra. The list of integrations serves as the basis for modeling.

This method is based on the fact that most spectral features are much broader than the discrete resolution of the data. By choosing segment widths that are slightly smaller than the typical width of spectral features, the essential information is encoded while at the same time minimizing residual noise and spurious small features. If, during modeling, it is

determined that information in particular segments is important for modeling or classification, these regions can be subsequently modeled with higher resolution segments or full resolution data.

The system further includes a peak selection and deconvolution module 13-6. Peak selection or other deconvolution methods implemented in this module are also frequently used for reduction. Peaks, features, or resonances in spectral data generally have a theoretical shape (e. g. Gaussian or Lorentzian are common). These peaks, via their area or intensity, are subsequently associated with concentrations of chemical or biological constituents in the sample. In principle, it would be possible to find and correlate every spectral feature with a constituent concentration. In practice, with potentially thousands of contributing constituents in a biological sample, this procedure would be impractical. An intermediate solution is to find and quantify spectral features for known assigned entities.

Methods of varying degree of sophistication exist to perform this task, from least squares fitting of multiple overlapping peaks, to linear-predictive, maximum entropy, and other resolution enhancement algorithms. (See for example: Stephenson, D. S., Linear Prediction and Maximum Entropy Methods in NMR Spectroscopy, Progress in NMR Spect. , Vol. 20, pp. 515-626, (1988), Berkhuijsen, R. et. al., Retrieval ofFzrequencies, Amplitudes, Damping Factors, and Phases from Time-Domain Signals Using a Linear Least-Squares Procedure, J. of Mag. Res. , Vol. 61, pp. 465-481, (1985), the content of which are incorporated by reference herein). The output of module 13-6 is a list of areas for peaks annotated with their assumed parent constituent. This reduced list of constituent driven features is used for modeling.

When theoretical relationships are well known between constituents and the measured signals, other procedures similar to peak selection are also possible. Thus, the system includes a component selection by simulation and comparison module 13-8.

Iterative refinement comparing between predicted and measured features is used by this module to deconvolute constituent information. Some spectral types (e. g. NMR) have multiple peaks or bands associated with constituents in the sample. The positions and patterns associated with these bands can be simulated quite accurately with a set of parameters associated with the sample (e. g. , chemical shifts and coupling constants) and the instrument (e. g. , field strength). Automated refinement of the sample parameters compared to the actual measured data can generate a modeled spectrum that can be used to deconvolute the experimental data into constituent information. Spectra are modeled as a

sum of constituent spectra. The coefficients for each constituent are subsequently used for modeling. For mass spectroscopy, the theoretical predictions of isotopic patterns for high- resolution data, and the presence of multiple charge states for constituents, can be exploited to associate multiple peaks with the same constituent and thereby reduce the spectra to pure constituent spectra before proceeding to modeling. (See for example: Brown, R. S. and Gilfrich, N. L., Maximum-Likelihood Restoration Data Processing Techniques Applied to Matrix-Assisted Laser Desorption Mass Spectra, Applied Spectroscopy, V. 47, pp. 103-110 (1993), and Ferrige, A. G. , et. al., Disentagling Epectrospray Spectra with Maximum Entropy, Rapid. Commun. Mass Spec. , Vol. 6,707-711 (1992), the contents of which are incorporated by reference herein).

An alternative procedure that can be used in spectral reduction is the use of wavelet decomposition. (See for example: Alsberg, B. K. et. al., Tutorial: An Introduction to Wavelet Transforms for Chemometricians : A Tizzze-Frequency Approach, Chemometrics And Intel. Lab. Systems, Vol. 37, pp. 215-239, (1997), Cai, C. and Harrington, P., Different Discrete Wavelet Transforms Applied to Denoising Analytical Data, J. Chem. Inf. Comput.

Sci. , Vol. 38, pp. 1161-1170, (1988), and Tan, H. , and Brown, S., Wavelet Analysis Applied to Renzovi7zg Non-constant, Varying Spectroscopic Background in Multivaraite Calibration, J. of Chemometrics, Vol. 16, pp. 228-240, (2002), the contents of which are incorporated by reference herein). This is provided in a wavelet decomposition module 13-7. Wavelets have been used for image and spectral compression and for noise reduction. The ability to reduce noise and compress data into a smaller set of representative features makes wavelets an attractive method to use. Wavelet coefficients from the decomposition are subsequently utilized for model building.

Many other methods indicated by module 13-4 can be exploited to reduce data to simpler forms. Class memberships from fuzzy clustering methods can be used to reduce spectra into subgroups of memberships. For samples subjected to combination separation- spectroscopy techniques, or for samples that evolve in time with spectra subsequently taken at designated time intervals, multi-way and evolving factor analysis techniques can be used to determine the underlying number of constituent groups and associated"spectral" concentration. Time evolving data can also be distilled to a trajectory and used for modeling and classification. In some cases auto-correlation or cross-correlation is a useful reduction method. In general, any useful method for data distillation can be incorporated

into the system and implemented in module 13-4. These methods can be used for both reduction and modeling and are discussed in the modeling context in the next section.

After processing by any one of the modules 13-4,13-5, 13-6,13-7, 13-8, the data to be used for modeling is supplied to an output module 13-9, which directs the data to the appropriate database or file (14,17 or 15, see Figs. 1 and 6). In addition, additional collation may be used to combine data from multiple sources for analysis. Information about sample classification, additional measurements by other techniques (from database 21), information obtained by multiple spectroscopic techniques, and experimental outcomes associated with the samples can be bundled with spectral reduction data for modeling. Data can be stored in database tables 14 within the system tracking database 19, exported for analysis by third party software platforms (see item 17, Figures 1 and 6), or passed directly into the model building and/or prediction module 15 within the system 100.

Module 15: Model Building/Visualization/Analysis/Prediction (Figure 7) The importance of selecting appropriate data reduction methods has already been described. It is just as important to select appropriate modeling methods. As there are many existing modeling methods, the system 100 has been designed to easily leverage standard modeling and analysis tools that are commercially available. Methods that have been determined to be useful for screening (prediction or classification), amenable to automation, or useful for data quality control are preferably incorporated directly into the system 100 and integrated into the processing flow, as described herein. Exploratory development can be incorporated into the system or explored outside the system using exported data.

With reference to Figures 1 and 7, the model building module 15 starts with reduced data from module 13 (Fig. 6). If no reduction was performed, standardized spectra can be used directly for analysis. Both prediction and model building activities are supported in module 15. The user is prompted to select either screening prediction or model building at step 15-1. The prediction branch and module 15-2 presumes the existence of selected prediction models stored in memory 18 previously produced by the model building activities. In step 15-2, the stored classification or predictive model 18 is applied to the data reduced to modeling form 13 and the results are saved to the model/screening results file 16.

Two types of modeling and exploratory activities are available, and the user is prompted to select which one they want at step 15-3. The two categories of methods are referred to as supervised and unsupervised modeling.

Unsupervised modeling activities attempt to find intrinsic patterns (or clusters) in the data without knowledge of the known endpoints. The reduced spectral data may be augmented with additional information collected about the samples. The intrinsic patterns may correlate with known endpoints or suggest relationships between samples not previously known. In addition, individual samples that seem to have no relationship to the majority can be identified as outliers or problematic for investigation. Many techniques are available. A selection of which to use is made by the branch model type module 15-3.

Often, the most useful technique is determined by trying many techniques such as the techniques indicated by routines or algorithms shown by 15-14,15-15, 15-16,15-17 and 15- 18 in Figure 7.

For unsupervised modeling, additional data transforms and scaling may be applied to the data prior to modeling. Normalization may be applied to the data to emphasize different factors. Normalization to unit area is chosen to emphasize differences in relative distributions. Normalization to a particular spectral feature (usually associated with a known entity) is chosen to emphasize relative expression to the entity of choice.

Subsequent to normalization, most methods minimally subtract the mean of the data as an initial starting point so that differences between samples are modeled and not the average.

While normalization and mean centering are typical modeling transforms for spectral data, other types of data transforms and scalings are in common practice. A common scaling is to scale each variable to unit variance to give each variable equal weighting in modeling. If multiple data types have been merged together, a variance weighted scaling may be applied to blocks of data to give each block equal weight in modeling. If non-linear relationships between the populations represented in the data are suspected, a number of non-linear transforms can be applied such as logarithmic or exponential transforms. The variables themselves can be non-linearized by augmenting the data with polynomial expansions of the initial variables (inverse, square root, square, cross-terms, etc. ). These transforms are usually built directly into the modeling algorithms.

The predominant method used for unsupervised modeling is Principal Component Analysis (PCA), module 15-14, shown in Figure 16. (For a better understanding of PCA see for example: Wold, S. et. al., Pri71cipal Component Analysis, Chemo. And Intell. Lab.

Sys. , Vol. 2, pp. 37-52, (1987) and Dunteman, G. H., Principal Components Analysis, Sage Publications, Inc. , California, (1989), the contents of which are incorporated by reference herein). PCA is a bi-linear decomposition of a data matrix into a new orthogonal coordinate system that consists of linear combinations of the original variables (called loadings) and the new variables on the new coordinate system (called scores). The decomposition is accomplished such that the first component describes the maximum variation in the data matrix, and each subsequent component describes the maximum information after removal of the previous factors from the data. This procedure usually reduces the data to its approximate rank in low dimensional space (i. e. , 2 or 3 dimensions).

The scores are variable coordinates on independent factors and are therefore independent.

Plots of scores in low dimensional space may indicate that the data forms sub groupings.

Figure 16, upper portion, indicates a typical outcome for a PCA as implemented in the system. By inspection, clusters can be found in the reduced data. For this example, the clusters are delineated along principal component one (PC1). Inspection of the factor associated with PC1, shows that spectral component around-3 x-axis units is the main spectral feature driving the clustering. If this spectral feature is a known chemical or biological entity, then an inference can be made that there exists differential expression for this entity. If the identity of this feature is not known, the pattern represented by the spectral data can still be used to classify the data. For data sets of higher dimensionality (e. g. hyphenated methods), tri-linear or N-linear data decompositions may be applied in a similar manner as PCA. (See for example: Bro, R., PARAFAC. Tutorial and Applications, Chemo. and Intell. Lab. Sys. , Vol. 38, pp. 149-171 (1997), Harrington, P. , et. al. , Two- Dimensional Correlation Analysis, Chemo. and Intell. Lab. Sys. , Vol. 50, pp. 149-174 (2000), Wold, S. et. al., Multi-Way Principal Cornponents-and PLS-Analysis, J. of Chemometrics, Vol. 1, pp. 41-56 (1987), Manne, R. et. al., Subwindow Factor Analysis, Chemo. and Intell. Lab. Sys. , Vol. 45, pp. 171-176, (1999), and Windig, W. , and Antalek, B. , Resolving Nuclear Magnetic Resonance Data of Complex Mixtures by Three-way Methods : Examples of Chemical Solutions and the Human Brain, Chemo. and Intell. Lab.

Sys. , Vol. 46, pp 207-219 (1999), the contents of which are incorporated by reference herein).

Partial unsupervised methods (similar to PCA) can also be applied to hyphenated or otherwise multidimensional reduced spectral data. These methods are implemented in module 15-18. Often a set of time-ordered spectra is generated for samples. Methods such

as Evolving Factor analysis, Batch analysis, or Curve Resolution Analysis capitalize on the relationship of the spectral features in time to find factors that may be used for classification. (See for example: Schostack, K. J. , and Malinowski, E. R., Investigation of Window Facror Analysis and Matrix Regression Analysis in Chromatography, Chemo. and Intell. Lab. Sys. , Vol. 20, pp. 173-182, (1993), Vanslyke, S. J. , and Wentzell, P. D., Limitations of Evolving Principal Component Innovation Analysis for Peak Purity Detection in Chromatography, Chemo. and Intell. Lab. Sys. , Vol. 20, pp. 183-195 (1993), Liang, Y. and Kvalheim, O. M. , Heuristic Evolving Latent Projections : Resolving Hyphenated Chromatographic Profiles by Cozzzponent Stripping, Chemo. and Intell. Lab.

Sys. , Vol 20, pp. 115-125 (1993), Lindberg, W, et. al., Multivariate Resolution of Overlapping Peaks in Liquid Chromatography Using Diode Array Detection, Anal. Chem., Vol. 58, pp. 299-303 (1986), Shen, H. et. al., Resolution of On-flow Liquid Chromatography Proton Nuclear Magnetic Resonance Using Canonical Correlation and Constrained Linear Regression, Chemo. and Intell. Lab. Sys. , Vol. 62, pp. 61-78 (2002), Duchesne, C. , and MacGregor, J. F., Multivariate Analysis and Optinzization of Process Variable Trajeetories for Batch Processes, Chemo. and Intell. Lab. Sys. , Vol. 51, pp. 125-137, (2000), and Anitti, H. , et. al., Batch Statistical Processing of IH NMR-derived Urinary Spectral Data, J. of Chemometrics, Vol. 16, pp. 461-468 (2002), the contents of which are incorporated by reference herein). These methods are unsupervised in that they do not use information about sample classifications in the modeling, but they are supervised in that they attempt to model the data with time as a dependent variable. Samples can be characterized (mapped or clustered), independent of time, by their coordinates on the time-based factors.

Several methods attempt to reduce data into groups based on defined measures of similarity and dissimilarity. Various parameterized metrics allow the calculation of distances between objects"samples"in multidimensional space"reduced spectral data".

Distance metrics (subsequently used for clustering) can be tuned to emphasize objects that are most similar or most dissimilar. (A very common metric is known as Mahalanobis Distance. See for example: Maesschalck, R. D. , et. al., Tutorial : The Malzalanobis Distance, Chemo. and Intell. Lab. Sys. , Vol. 50, pp. 1-18 (2000), the content of which is incorporated by reference herein). Two older methods include Hierarchical Cluster Analysis (HCA) implemented in module 15-15, and Non-linear Mapping, implemented in module 15-17.

(See for example: Massart, D. L. , et. al., Data Handling in Science and Technology-v2 Chemomet. rics. a Textbook, Elsevier, NY, pp. 319-338 (1998), Sammon, J. W., A Nonlinear

Mapping for Data Structure Analysis, IEEE Trans. on Computers, Vol. C-18, pp. 401-409 (1969), the contents of which are incorporated by reference herein). In HCA, the data is progressively surveyed for the most similar objects that are subsequently linked (clustered) and reduced to a new object representing the cluster and the process repeated. The result is the familiar dendographic or tree representation of the objects, their cluster relationships, and cluster distances. In Non-linear Mapping (module 15-17), a matrix of pair-wise distances is generated between all the objects in the original dimensional space. From a starting set of coordinates for each object in lower dimensional space (usually 2-3), a distance matrix is also calculated. The set of coordinates in low dimensional space are optimized to generate a distance matrix that best represents the multidimensional distances.

The assumption is that the major relationship between the samples can now be represented and visualized in the lower dimensional space. For cases where relationships between sample descriptors are particularly complex, Kohonen mapping (Neural Network based Self-Organizing Feature Mapping (SOFM), is applied via module 15-16. (See for example: Zupan, J. , et. al, Tutorial : Kohonezz and Counterpropagation Artificial Neual Networks in Analytical Chemistry, Chemo. and Intell. Lab. Sys. , Vol. 38, pp. 1-23, (1997), the contents of which are incorporated by reference herein). SOFM has advantages over HCA in that objects are classified not only in relationship to similarity to other objects but also to how patterns relate to other objects in proximity (neighborhood). Fuzzy c-means clustering has also been applied as an unsupervised classification tool. (See for example: Adams, M. J., Clzenzoznetrics in Analytical Spectroscopy, The Royal Society of Chemistry, Cambridge, pp.

109-114, (1995), and Linusson, A. et. al., Fuzzy Clustering of 627 Alcohols, Guided by a Strategy foi"Cluster Analysis of Chemical Coznpounds for Combinatorial Chemisty, Chemo. and Intell. Lab. Sys. , Vol. 44, pp. 213-227 (1998), the contents of which are incorporated by reference herein). In this method, a predetermined number of clusters to be found in the data and a parameter that determines the fuzziness of class boundaries are chosen prior to the analysis. The method then finds the number of chosen centroids and calculates class membership for each sample relative to the cluster centers. Patterns of class membership can be used to classify sample similarity or dissimilarity.

In terms of diagnostics, PCA has many advantages over most methods. Since PCA models the variational structure of the data, samples can be characterized relative to their membership to this structure. Spectra can be characterized as outliers within and without of the modeled variation. Outliers within the model occur when the factors adequately

describe the spectral features of a sample but the scores along any particular factor fall outside the distribution of the majority of the samples. Outliers outside the model occur when there is significant residual for sample spectra relative to the normal magnitude of spectral residuals. This indicates that additional factors would be necessary to describe this sample. (Figure 13 illustrates the power of residual analysis in another context. ) In addition, algorithms exist that allow PCA to occur in the presence of missing data. (See for example: Arteaga, F. , and Ferrer, A., Dealing with Missing Data in MSPC : Several Methods, Different Interpretations, Sonze Examples, J. of Chemometrics, Vol. 16, pp. 408- 418 (2002), Grung, B. and Manne, R., Missing Values in Principal Component Analysis, Chemo. and Intell. Lab. Sys. , Vol. 42, pp. 125-139 (1998), Nelson, P. , et. al., Missing Data Methods in PCA and PLS. Score Calculations With Incontplete Observations, Chemo. and Intell. Lab. Sys. , Vol. 35, pp 45-65 (1996), Walczak, B. , and Massart, D. L., Tutorial : Dealing with Missing Data : Part I & II, Chemo. and Intell. Lab. Sys. , Vol. 58, pp. 15-42 (2001), the contents of which are incorporated by reference herein). The distance metric (self-mapping) methods described here do not have these advantages, but are nevertheless useful as data survey tools.

The modularity of the system allows the integration of any other method useful for screening or analysis, indicated by module 15-19. In addition, the output of many unsupervised methods can be utilized as the input for other unsupervised methods or for supervised methods. (See for example: Wold, S. et. al. , Hierarchial Multiblock PLS and PC Models for Easier Model Interpretation and as an Alternative to Variable Selection, J. of Chemometrics, Vol. 10, pp. 463-482 (1996), Westerhuis, J. A. , et. al., Analysis ofMultiblock and Hierarchical PCA and PLS Models, J. of Chemometrics, Vol. 12, pp. 301-321 (1998), Janne, K, et. al., Hierarchical Principal Component Analysis (PCA) and Projection to Latent Structure (PLS) Technique on Spectroscopic Data as a Data Pretreatrnent for Calibration, J. of Chemometrics, Vol. 15, pp. 203-213 (2001), the contents of which are incorporated by reference herein). Principal component scores often serve this purpose. In this case the beneficial diagnostics and data reduction methods that are provided by PCA are leveraged against the benefits of other pattern recognition approaches.

The results from the unsupervised analysis performed by modules 15-14 to 15-19 are stored in the model screening/results system database 16. The database 16 can contain scores, loadings, residuals from PCA analysis, class means, models, and membership from fuzzy c-means clustering, etc.

Supervised modeling activities are implemented in routines 15-5 to 15-12.

Supervised modeling uses known classes, properties, or outcomes to derive models that correlate the reduced spectral data with the known endpoints. Supervised models can be used for classification or property/outcome prediction. If not already associated with the sample data, these endpoints can be extracted from a third party database 21, and combined with the reduced data 15-4 for modeling. Additional measurements can be added to the reduced spectra to form an independent block of data. This set of data is referred to as the "training set". As with unsupervised modeling, many methods are available, and the user is prompted to make a selection at step 15-5. The ultimate best method may not be the method that is the most predictive with the training data. The best model for deployment in a screening environment is often a compromise between predictive power and one that is robust. Robust methods are not sensitive to noise or random events and may provide diagnostics about the fitness of a test dataset for model prediction. For example, good diagnostics would suggest that a trial spectrum for prediction is an extrapolation of the population of the data used for modeling. Since models are better for interpolation than extrapolation, this diagnostic would warn of potentially inaccurate predictions. Other diagnostics may suggest that the structure or variation represented by a trial spectrum is outside the variation patterns used in model building. This diagnostic would suggest the presence of additional factors for follow-up in the trial samples or perhaps a new population (class) of samples has been discovered.

As with unsupervised methods, the independent variables may undergo transformation, scaling, or non-linearization prior to supervised modeling. In addition, the output from some unsupervised methods can be used as inputs for supervised methods (e. g.

Principal Component Analysis (scores), c-means clustering (class membership) ). If the output from unsupervised methods are used for supervised modeling, these variables can also be subject to scaling, transformation, and non-linearization.

For supervised modeling, additional preprocessing methods may be employed to select or screen features in the reduced data used for modeling. A relative new class of methods known as"orthogonal projection methods"are designed to find patterns in the independent data that have no correlation with the predictive outcomes. (See for example: Wold, S. et. al, Orthogonal Signal Correction of Near-Infrared Spectra, Chemo. and Intell.

Lab. Sys. , Vol. 44, pp. 175-185 (1998), Andersson, C. A., Direct Orthogonalization, Chemo. and Intell. Lab. Sys. , Vol. 47, pp. 51-63 (1999), and Trygg, J. and Wold, S., Orthogonal

Projections to Latezzt Sti^zcctures (O-PLS), J. of Chemometrics, Vol. 16, pp. 119-128 (2002) the contents of which are incorporated by reference herein). These patterns (or factors) can then be projected out of the independent data prior to modeling. These methods are particularly advantageous when the signal that correlates with the dependent outcome is small relative to the spectral background. Methods that are based on variable selection, as is employed by genetic algorithms and programming methods, are less attractive for spectral data since the actual variables may be indeterminate. These methods often do not have appropriate constraints and will converge on accidental or spurious elements in the data that happen to correlate with the outcomes but nevertheless fail to generate robust predictions in screening applications.

One of the simplest supervised classification methods is k-Nearest Neighbors (KNN, implemented in module 15-10). KNN analysis seeks to classify spectra by a distance proximity to samples of known classifications. (See for example: Kowalski, B. R. , and Bender, C. F., Pattern Recognition. A Powerful Approach to Interpreting Chemical Data, J.

Am. Chem. Soc. , Vol. 94, pp 5632-5639 (1972), Kowalski, B. R. and Bender, C. F., The K- Nearest Neighbor Classification Rule (Pattern Recognition) Applied to Nuclear Magnetic Resonance Spectral Interpretation, Anal. Chem. , Vol. 44, pp. 1405-1411 (1972), Alsberg, B. K. et. al., Classification of Pyrolysis Mass Speetra by Fuzzy Multivariate Rule Induction- <BR> <BR> <BR> Comparison with Regression, K-nearest Neighbor, Neural, and Decision-tree Methods, Anal. Chim. Acta, Vol. 348, pp. 389-407 (1997) and Devroye, L, et. al., A Probabilistic Tlzeory of Pattern Recognition, Springer-Verlag, NY, pp. 61-90 (1996), the contents of which are incorporated by reference herein). The stored prediction model 18 is simply a database of representative spectra and their known classifications.

A highly successful classification model is known as soft independent modeling by class analogy (SIMCA), implemented by module 15-18. (See for example: Massart, et. al, 1998, pp. 385-414 and Wold, S., Pattern Recognition by Means of Disjoint Principal Component Models, Pattern Recognition, Vol. 8., pp. 127-139 (1976). For applications to Metabonomics see for example: Holmes, E. , et. al., Development of a Model for <BR> <BR> <BR> Classification of Toxin-Induced Lesions Using IH NMR Spectroscopy of Urine Combined with Pattern Recngnition, NMR in Biomedicine, Vol. 11, pp. 235-244 (1998) and Holmes, E. et. al., Clzenzonaetric Models for Toxicity Classification Based ozz NMR Spectra of Biofluids, Chem. Res. Toxicol. , Vol. 13, pp. 471-478 (2000). The contents of these references are incorporated by reference herein). SIMCA is very similar to PCA, but builds

separate PCA models for each known class. In this way, only the variation that is intrinsic to the given class is modeled. Samples to be screened are classified by their membership to each proposed class. Membership can be accessed in a similar manner as with PCA diagnostics. Diagnostics can be generated for each class model. Classification may not be exclusive to a single model. In some cases it is sufficient to generate a single model for normal or control samples. Classification is simply based on single class diagnostics as normal or abnormal. In some cases, residuals from abnormal spectra can be hierarchically classified. Stored models include the factors for each classification model.

Module 15-11 implements families of methods that are known as"Kernel Methods" such as Support Vector Machine, in which the model building for classification has been designed to be feasible for extremely large data sets. (See for example: Belousov, A. I., et. al., Applicatiozzal Aspects of Support Vector Aqachines, J. of Chemometrics, Vol. 16, pp.

482-489 (2002), and Belousov, A. I., et. al., A Flexible Classification Approach with Optimal Generalisation Performance : Support Vector Machines, Chemo. and Intell. Lab. Sys. , Vol.

64, pp 15-25 (2002), the contents of which are incorporated by reference herein). The models are built iteratively by making multiple passes through the data so that the data need not be stored in total in the computer's active memory. These methods have found use for very large datasets such as is generated for high-throughput screening (HTS) applications in drug discovery. While it is possible to incorporate these methods into the system, the benefits that are realized for applications such as HTS are often outweighed by the fact that the models act as"black box"predictors without the power of diagnostics from methods such as PCA and SIMCA. If needed, Kernel algorithms to generate PCA models can be implemented. (For example see: Wu, W. , et. al., The Kernel PCA Algorithms for Wide Data, Part I : Theory and Algorithms, Chemo. and Intell. Lab. Sys. , Vol. 36, pp. 165-172 (1997) and Rosipal, R. and Trejo, L. , Kernel Partial Least Squares Regression in Reproducirzg Kernel Hilbert Space, J. of Machine Learning Research, Vol. 2, pp. 97-123 (2001), the contents of which are incorporated by reference herein). The size of the datasets used for building models, however, is rarely a limitation.

For property/outcome prediction, there exist a continuum of least squares methods (e. g. Partial Least Squares Regression, Classical Least Squares, Inverse Least Squares, Stepwise Regression, Principal Component Regression, Ridge Regression, etc.), implemented in modules 15-06 and 15-07. (See for example: Wold, S. , et. al., PLS- reg7essio7t, a Basic Tool of Chemometrics, Chemo. and Intell. Lab. Sys. , Vol. 58, pp. 109-

130 (2001), Geladi, P and Kowalski, B., Partial Least-Squares Regression : A Tutorial, Anal. Chim. Acta, Vol. 185, pp 1-17 (1986), Beebe, K. R. , and Kowalski, B. R., An Introduction to Multivariate Calibration and Analysis, Anal. Chem. , Vol. 58, pp. 1007-1017 (1987), Deming, S. N and Morgan, S. L., The Use of Linear Models and Matrix Least Squares in Clinical Chemistry, Clin. Chem. , Vol. 25, pp. 840-854 (1979), Nelson, M. P., et. al, Multivaraite Optical Cornputation for Predictive Spectroscopy, Anal. Chem. Vol. 70, pp. 73-82 (1998), Xie, Y. , and Kalivas, J. H., Evaluation of Principal Cornponent Selection Methods to Form a Global Prediction Model by Principal Component Regression, Anal.

Chim. Acta. , Vol. 348, pp. 19-27 (1997), the contents of which are incorporated by reference herein). These methods are based on the determination of a set of regression coefficients which, when applied as an inner product with the reduced modeling parameters, generate the desired prediction. These methods differ in how the coefficients are chosen.

PCA scores can be used to generate a least squares regression model known as Principal Component Regression (PCR). All the diagnostics of PCA are coupled with property prediction. These methods are generally subjected to techniques (cross validation, jackknifing, or bootstrapping) that minimize over fitting and generate diagnostics that indicate the predictability of the model. (See for example: Martens, H. , et. al., Analysis or Designed Experi ยป zents by Stabilised PLS Regression and Jack-knifing, Chemo. and Intell. <BR> <BR> <BR> <P>Lab. Sys. , Vol. 58, pp. 151-170 (2001), Wehrens, R. , and Linden, V. , Bootstrapping Principal Conzponertt Regression Models, J. of Chemometrics, Vol. 11, pp. 157-171 (1997), Wehrens, R., et. al. , The Bootstrap : a Tutorial, Chemo. and Intell. Lab. Sys. , Vol. 54, pp.

35-52 (2000), and Davison, A. C, and Hinkley, D. V., Bootstrap Methods and Their Applications, Cambridge University Press, Cambridge, pp. 191-254,256-325 (1997), the content of which is incorporated by reference herein). In addition, outliers that may be disproportionately biasing the model can be eliminated. Many variations exist and when coupled with weighting functions and non-linear modifications constitute a large family. Of particular note however is Partial Least Squares Regression (PLS). This method generates factors that, like PCA, model the variation in the independent block of data. The additional constraint is that the factors are chosen such that they have maximum correlation with the dependent variable of interest. Additional factors are added to the model based on criteria of increasing the variance modeled and the ability to predict the variable of interest without over fitting. Even though these models generally generate outputs that are continuous, they have also been applied as classification models (e. g. PLS-Discriminate Analysis, PLS-DA)

where the dependent variable is set to binary values corresponding to class membership.

(See for example: Kemsley, E. K., Discriminant Analysis of Higlz-dimensional Data : A Comparison of Principal Components Analysis and Partial Least Squares Data Reduction Methods, Chemo. and Intell. Lab. Sys. , Vol. 33, pp. 47-61 (1996). For an example applied to Metabonomics see: Gavaghan, C. L. , et. al., An NMR-based Metabonozzzic Approach to Investigate the Biochemical Consequences of Genetic Strain Differences : Application to the C57BLIOJ and Alpk.-ApfCD mouse, FEBS Letters, Vol. 484, pp. 169-174 (2000). The contents of these references is incorporated by reference herein). If used for classification, sufficient training data sets must be utilized for modeling. If not, features will be selected by the modeling algorithms that correlate with the binary output by chance.

The stored prediction models include the factors generated by the models for diagnostics and the regression coefficients needed for prediction.

While not providing the full range of diagnostics that linear methods provide (e. g.

PCR, PLS), Neural Network (NN) approaches for both prediction (e. g. linear filters, backpropogation) and classification (e. g. perceptron, probabilistic NN (pnn) ) can be useful when data relationships are known to be non-linear. A neural network classification and prediction module 15-9 is thus incorporated into the module 15-20. (See for example: Liu, Y. et. al., Clzemometric Data Analysis Using Artifieial Neural Networks, Applied Spectroscopy, Vol. 47, pp. 12-23 (1993), Bishop, C. M., Neural Networks for Pattern Recognition, Oxford University Press, NY, ip particlual pp. 385-439 (1995). For example applications to Metabonomics see for example: Holmes, E. , et. al., Metabonomic Characterization of Genetic Variations in Toxicologocal and Metabolic Responses Using Probabilistic Neural Networks, Chem. Res. Toxicol. , Vol. 14, pp. 182-191 (2001). The contents of these references is incorporated by reference herein). NN architecture can also be constructed to correspond to Bayesian probabilities. Care must be chosen in the selection of the appropriate network for the problem at hand and in the selection of the network architecture. Often, methods such as PCA are applied to the data to reduce the rank of the problem prior to generating neural network models. Networks can have a form similar to linear regression but exist in multiple layers of inputs and outputs and therefore are better at modeling non-linear relationships of unknown form. Network architecture, over-fitting, and prediction/outlier diagnostics are all problematic issues in the use of Neural Networks so that they are only indicated for critical need cases where linear methods (or their non-linear

variants) do not work. Stored prediction models include the network architecture, weights, and bias needed for prediction.

Additional hybrid, variable selective and other methods are available for modeling and the modularity allows the implementation of any needed model via module 15-12. One method that has demonstrated positive results is called Classification and Regression Trees (CART) and is used for both classification and prediction, as the name implies. (See for example: Alsberg, B. K, et. al. 1997, Levroye, Luc, et. al. 1996, pp. 315-362, Breiman, et. al., Classification and Regression Trees, Chapman and Hall, New York (1984), the contents of which are incorporated by reference herein). The method selects splits in the data (nodes) based on variable criteria and grows trees based on node splits. Nodes and split criteria are chosen to generate the desired classification or prediction. Other variable selective methods such as genetic algorithms, attempt to select key variables to be used by a desired modeling form that best predict the desired outcome. These methods offer appeal because the apparent output is a minimal list of variables that correlate with the desired outcome. In practice, with spectral data, the actual variables are indeterminate and application of these methods generates models that fail to validate against test samples. Without very large sets of data to minimize the high probability of random chance correlation with classification output, these methods do not employ the proper constraints on the selection of real and not accidental correlation. The most robust models are not those that generate perfect output, but those that model the actual correlation in the data with the desired output. If there is no correlation between the output and the data, then the modeling procedure should be indicative of this fact.

Special case models can also be generated. Non-linear parametric models of known form can be generated with non-linear least squares procedures. (See for example: Frank, I. E., Tutorial: Modern Nonlinear Regression Methods, Chemo. and Intell. Lab. Sys. , Vol.

27, pp. 1-9 (1995), the content of which is incorporated by reference herein). Time series and autoregressive models can be generated to identify system states. For example, biological systems can have states identifiable by analysis of the time evolution of chemical or biological species.

Predicative models stored in database 18 or applied in module 15-2 can be used for classification (e. g. toxic mechanism of action, disease type), the prediction of properties (e. g. disease progression) or outcome (e. g. mortality). In practice, modeling methods that generate both predictions and diagnostics (e. g. Principal Components, SIMCA, Partial-

Least-Squares) on the test spectra are more suitable than black-box approaches (e. g. Support Vector Machine, Classification and Regression Trees). The diagnostics and predictions are stored in the system database 16. The integration and modularity of the modeling process, indicated at 15-20, allows modeling to take place within the system 100 architecture or with third party software 22, as indicated in Fig. 1 and described previously.

As noted earlier, the software described herein preferably includes convenient user interface tools and menus that allow the user to navigate between modules in the system, such as to change from unsupervised to supervised model building in Figure 7, to switch from model building to screening in Figure 7, to view the results in the tracking database or the data reduced to modeling form. These and other details of the user interface aspect of software are considered to be within the level of ability of those skilled in the art of computer programming in the context of drug discovery and spectroscopic data analysis.

Typical Use Cases Typical uses of the present inventive system are as follows: 1) Applications utilizing complex analytical data and particularly spectral and hyphenated separation-spectral data to analyze, detect, and/or screen for differentiation of samples or systems (chemical, physical, and/or biological) based on classification, outcome or property prediction, differentiated populations, or perturbation from normal.

2) Applications utilizing the union of multiple techniques and data sources to analyze, detect, and screen for differentiation of samples or systems based on classification, outcome or property prediction, differentiated populations, or perturbation from normal.

3) Analysis can be performed on systematic data obtained from designed experiments (controlled classes, treatments, or perturbations) or random or selected samplings of available populations. Outcomes or classifications may be known or unknown at the time of data analysis. Analysis may be based on constituent analysis (known and unknown) or on patterns of constituent levels or any combination thereof.

Samples can include raw, separated, fractioned, blended, and processed and derived forms of biological fluids, tissues, extracted, excreted, or expired matter, or chemical mixtures, solutions, and substances regardless of state.

Other examples of the use of the inventive system include:

1) NMR based (e. g. metabonomics, metabolomics, metabolite profiling) detection and/or identification of differentially expressed (or patterns of expressed) endogenous or drug metabolites or chemical type for phenotyping, disease modeling, biomarker discovery, toxic response, or target validation.

2) Mass Spectrometry based (e. g. Proteomics, SELDITM, MALDI, LC-MSn, FT-MS) detection of differentially expressed (or patterns of expressed) proteins or chemical type for phenotyping, disease modeling, biomarker discovery, toxic response, or target validation.

3) NMR based screening, detection, and deconvolution of ligand binding.

4) NMR and/or MS based verification of combinatorial chemistry synthesis.

Presently preferred and alternative embodiments of the invention have been described with particularity above. Alternatives to the specific processing steps and process flow may occur without departure from the scope of the invention. The true scope of the invention will be determined by reference to the appended claims.