METHODS FOR SPECTRAL ANALYSIS AND THEIR APPLICATIONS: RELIABILITY ASSESSMENT

Title:

METHODS FOR SPECTRAL ANALYSIS AND THEIR APPLICATIONS: RELIABILITY ASSESSMENT

Document Type and Number:

WIPO Patent Application WO/2002/099452

Kind Code:

A1

Abstract:

This invention pertains to chemometric methods for the analysis of chemical, biochemical, and biological data, for example, spectral data, for example, nuclear magnetic resonance (NMR) spectra and other types of spectra, and their applications. More particularly, the present invention pertains to a method for classifying a sample spectrum, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property. The present invention also pertains to corresponding methods of classification (of spectra, samples, subjects, etc.), methods of identifying biomarkers and/or biomarker combinations; methods of analysis of an applied stimulus or condition; methods of diagnosis; etc.

Inventors:

NICHOLSON JEREMY KIRK (GB)
LINDON JOHN CHRISTOPHER (GB)
EBBELS TIMOTHY MARK DAVID (GB)
HOLMES ELAINE (GB)

Application Number:

PCT/GB2002/002758

Publication Date:

December 12, 2002

Filing Date:

May 31, 2002

Export Citation:

Click for automatic bibliography generation Help

Assignee:

METABOMETRIX LTD (GB)
NICHOLSON JEREMY KIRK (GB)
LINDON JOHN CHRISTOPHER (GB)
EBBELS TIMOTHY MARK DAVID (GB)
HOLMES ELAINE (GB)

International Classes:

G01R33/465; G01R33/46; (IPC1-7): G01R33/465

Domestic Patent References:

WO2001028412A1	2001-04-26
WO1993010468A1	1993-05-27

Foreign References:

GB2317703A

1998-04-01

Other References:

J.K. NICHOLSON, J.C. LINDON, E. HOLMES: "Metabonomics: ...", XENOBIOTICA, vol. 29, no. 11, 1999, pages 1181 - 1189, XP001021360
J.C. LINDON ET AL.: "NMR Spectroscopy of Biofluids", ANNUAL REPORTS ON NMR SPECTROSCOPY, vol. 38, 1999, pages 1 - 88, XP001055966

Attorney, Agent or Firm:

Brasnett, Adrian H. (Greater London WC2B 6HP, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1.

A method for classifying a sample spectrum, said method comprising the steps of : (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property.

2.

A method for classifying a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property.

3.

A method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property.

4.	A method according to any one of claims 1 to 3, wherein said at least one statistical property is a plurality of statistical properties.

5.

A method for classifying a sample spectrum, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property.

6.

A method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property.

7.

A method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property.

8.

* * *.

9.

A method for classifying a sample spectrum, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property, with an associated confidence level.

10.

A method for classifying a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property, with an associated confidence level.

11.

A method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property, with an associated confidence level.

12.	A method according to any one of claims 8 to 10, wherein said at least one statistical property is a plurality of statistical properties.

13.

A method for classifying a sample spectrum, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level.

14.

A method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property, with an associated confidence level.

15.

A method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property, with an associated confidence level.

16.

A method for classifying a sample spectrum for a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

17.

A method for classifying a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

18.

A method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

19.	A method according to any one of claims 15 to 17, wherein said at least one statistical property is a plurality of statistical properties.

20.	A method according to any one of claims 15 to 18, wherein said at least one confidence interval spectrum is a plurality of confidence interval spectra.

21.

A method for classifying a sample spectrum for a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

22.

A method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

23.

A method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

24.

* * *.

25.

A method of classifying a set of sample spectra, said method comprising the steps of: (a) calculating at least one statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; (b) classifying said set of sample spectra as normal or abnormal, with respect to said reference state, on the basis of said statistical properties, with an associated confidence level, specifically on the basis of the extent to which one or more corresponding statistical properties of said set of sample spectra falls within a corresponding confidence interval spectrum.

26.	A method according to claim 23, wherein said at least one statistical property is a plurality of statistical properties.

27.	A method according to claim 23 or 24, wherein said at least one confidence interval spectrum is a plurality of confidence interval spectra.

28.

A method of classifying a set of sample spectra, said method comprising the steps of: (a) calculating a statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property; (b) classifying said set of sample spectra as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis of the extent to which a corresponding statistical property (e. g., mean) of said set of sample spectra falls within a corresponding confidence interval spectrum.

29.	A method of classifying a sample or a set of samples, comprising a method of classifying a set of sample spectra according to any one of claims 23 to 26, wherein said set of sample spectra are for said sample or said set of samples.

30.

A method of classifying a subject or a set of subjects, comprising a method of classifying a set of sample spectra according to any one of claims 23 to 26, wherein said set of sample spectra are for a sample or a set of samples, wherein said sample or set of samples are from said subject or said set of subjects.

31.

* * *.

32.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying one or more experimental parameters derived from a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters fall outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal experimental parameters, with an associated confidence level.

33.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying one or more spectral regions of a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal spectral regions have signal intensities which fall outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

34.	A method according to claim 29 or 30, wherein said at least one statistical property is a plurality of statistical properties.

35.	A method according to any one of claims 29 to 31, wherein said at least one confidence interval spectrum is a plurality of confidence interval spectra.

36.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property ; (b) classifying one or more experimental parameters derived from a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters fall outside said confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal experimental parameters, with an associated confidence level.

37.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property; (b) classifying one or more spectral regions of a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal spectral regions have signal intensities which fall outside said confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

38.

* * *.

39.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; (b) classifying one or more experimental parameters derived from a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that abnormal such spectral regions have one or more statistical properties which falls outside a corresponding confidence interval spectrum ; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

40.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; (b) classifying one or more spectral regions of a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such spectral regions have one or more statistical properties which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said spectral regions, with an associated confidence level.

41.	A method according to claim 35 or 36, wherein said at least one statistical property is a plurality of statistical properties.

42.	A method according to any one of claims 35 to 37, wherein said at least one confidence interval spectrum is a plurality of confidence interval spectra.

43.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of : (a) calculating a statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property; (b) classifying one or more experimental parameters derived from a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters have a statistical property which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

44.

A method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property for each of a plurality of subsets of a set of equivalent spectra for a reference state, to yield a set of statistical properties; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property; (b) classifying one or more spectral regions of a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such spectral regions have a statistical property which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said spectral regions, with an associated confidence level.

45.

* * *.

46.	A method according to any one of claims 1 to 40, wherein said spectra are, or comprise, NMR spectra or NMR spectral data.

47.

* * *.

48.	A method according to any one of claims 1 to 41, wherein said reference state is a predetermined state defined by a suitable population representative of that predetermined state.

49.	A method according to any one of claims 1 to 41, wherein said reference state is defined by one or more control organisms.

50.	A method according to any one of claims 1 to 41, wherein said reference state is defined by one or more control organisms in a predose state.

51.

* * *.

52.	A method according to any one of claims 1 to 44, wherein said statistical property or statistical properties is/are selected from: mean; standard deviation; relative standard deviation; skewness; and, kurtosis.

53.	A method according to any one of claims 1 to 44, wherein said statistical property or statistical properties is/are selected from: standard deviation; relative standard deviation; skewness; and, kurtosis.

54.	A method according to any one of claims 1 to 44, wherein said statistical property or statistical properties is/are selected from: standard deviation and relative standard deviation.

55.	A method according to any one of claims 1 to 44, wherein said statistical property is mean.

56.	A method according to any one of claims 1 to 44, wherein said statistical property is standard deviation.

57.	A method according to any one of claims 1 to 44, wherein said statistical property is relative standard deviation.

58.	A method according to any one of claims 1 to 44, wherein said statistical property is skewness.

59.	A method according to any one of claims 1 to 44, wherein said statistical property is kurtosis.

60.

* * *.

61.	A method according to any one of claims 1 to 52, wherein said confidence level is 95%.

62.

* * *.

63.	A method according to any one of claims 1 to 53, wherein said"extent to which falls within"and said"extent to which falls outside"is determined by whether or not a predetermined fraction of data points fall outside a corresponding confidence interval spectrum.

64.	A method according to claim 54, wherein said predetermined fraction is 2%.

65.

* * *.

66.	A biomarker or biomarker combination identified by a method according to any one of claims 29 to 55.

67.	A biomarker or biomarker combination identified by a method according to any one of claims 29 to 55, for use in a method of classification.

68.	A method of classification which employs or relies upon one or more biomarkers or biomarker combinations identified by a method according to any one of claims 29 to 55.

69.	An assay for use in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method according to any one of claims 29 to 55.

70.	Use of an assay in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method according to any one of claims 29 to 55.

71.	A method of diagnosis employing one or more biomarkers or biomarker combinations identified by a method according to any one of claims 29 to 55.

72.	Use of one or more biomarkers or biomarker combinations identified by a method of classification according to any one of claims 128 and 4155.

73.

* * *.

74.

A method of analysis of an applied stimulus or condition, which method employs : a method of classifying a spectrum according to any one of claims 1,4, 5,8,11,12,15,18,19,20,23,24,25,26, and 4155; a method of classifying a sample, according to any one of claims 2,4,6, 9,11,13,16,18,19,21,27, and 4155; a method of classifying a subject according to any one of claims 3,4,7, 10,11,14,17,18,19,22,28, and 4155; or, a method of identifying a candidate biomarker or biomarker combination, according to any one of claims 2955; wherein said sample spectrum or spectra are for a sample from an organism which has been subjected to said applied stimulus ; and, wherein said set of equivalent spectra for a reference state comprises one or more control spectra for each of one or more samples from each of one or more organisms which have not been subjected to said applied stimulus.

75.	A method of diagnosis of an applied stimulus or condition, comprising a method of analysis of an applied stimulus or condition, according to claim 63.

76.	A method of therapeutic monitoring of a subject undergoing therapy, comprising a method of analysis of an applied stimulus or condition, according to claim 63.

77.	A method of evaluating drug therapy and/or drug efficacy, comprising a method of analysis of an applied stimulus or condition, according to claim 63.

78.	A method of detecting toxic sideeffects of drug, comprising a method of analysis of an applied stimulus or condition, according to claim 63.

79.	A method of characterising and/or identifying a drug in overdose, comprising a method of analysis of an applied stimulus or condition, according to claim 63.

80.

* * *.

81.	A computer system or device operatively configured to implement a method according to any one of claims 1 to 68.

82.	Computer code suitable for implementing a method according to any one of claims 1 to 68 on a suitable computer system.

83.	A computer program comprising computer program means adapted to perform a method according to any one of claims 1 to 68 when said program is run on a computer.

84.	A computer program according to claim 71 embodied on a computer readable medium.

85.	A data carrier which carries computer code suitable for implementing a method according to any one of claims 1 to 68 on a suitable computer.

Description:

METHODS FOR SPECTRAL ANALYSIS AND THEIR APPLICATIONS : RELIABILITY ASSESSMENT RELATED APPLICATION This application is related to (and where permitted by law, claims priority to United States Provisional patent application USSN 60/295, 636 filed 04 June 2001, the contents of which are incorporated herein by reference in their entirety.

TECHNICAL FIELD This invention pertains generally to the field of metabonomics, and, more particularly, to chemometric methods for the analysis of chemical, biochemical, and biological data, for example, spectral data, for example, nuclear magnetic resonance (NMR) spectra and other types of spectra, and their applications, including, e. g., methods of classification (of spectra, samples, subjects, etc.), methods of identifying biomarkers and/or biomarker combinations; methods of analysis of an applied stimulus or condition; methods of diagnosis; etc.

BACKGROUND Throughout this specification, including the claims which follow, unless the context requires otherwise, the word"comprise,"and variations such as"comprises"and "comprising,"will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps.

It must be noted that, as used in the specification and the appended claims, the singular forms"a,""an,"and"the"include plural referents unless the context clearly dictates otherwise.

Ranges are often expressed herein as from"about"one particular value, and/or to "about"another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value.

Similarly, when values are expressed as approximations, by the use of the antecedent "about,"it will be understood that the particular value forms another embodiment.

Biosystems Biosystems can conveniently be viewed at several levels of bio-molecular organisation based on biochemistry, i. e., genetic and gene expression (genomic and transcriptomic), protein and signaling (proteomic) and metabolic control and regulation (metabonomic).

There are also important cellular ionic regulation variations that relate to genetic, proteomic and metabolic activities, and systematic studies on these even at the cellular and sub-cellular level should also be investigated to complete the full description of the bio-molecular organisation of a bio-system.

Significant progress has been made in developing methods to determine and quantify the biochemical processes occurring in living systems. Such methods are valuable in the diagnosis, prognosis and treatment of disease, the development of drugs, for improving therapeutic regimes for current drugs, and the like.

Many diseases of the human or animal body (such as cancers, degenerative diseases, autoimmune diseases and the like) have an underlying basis in alterations in the expression of certain genes. The expressed gene products, proteins, mediate effects such as abnormal cell growth, cell death or inflammation. Some of these effects are caused directly by protein-protein interactions; other are caused by proteins acting on small molecules (e. g."second messengers") which trigger effects including further gene expression.

Likewise, disease states caused by external agents such as viruses and bacteria provoke a multitude of complex responses in infected host.

In a similar manner, the treatment of disease through the administration of drugs can result in a wide range of desired effects and unwanted side effects in a patient.

In recent years, it has been appreciated that the reaction of human and animal subjects to disease and treatments for them can vary according to the genomic makeup of an individual. This has led to the development of the field of"pharmacogenomics."A fuller understanding of how an individual's own genome reacts to a particular disease and/or drug treatment will allow the development of new therapies, as well as the refinement of existing ones.

At the genetic level, methods for examining gene expression in response to these types of events are often referred to as"genomic methods,"and are concerned with the detection and quantification of the expression of an organism's genes, collectively referred to as its"genome,"usually by detecting and/or quantifying genetic molecules, such as DNA and RNA. Genomic studies often exploit proprietary"gene chips,"which are small disposable devices encoded with an array of genes that respond to extracted mRNAs produced by cells (see, for example, Klenk et al., 1997). Many genes can be placed on a chip array and patterns of gene expression, or changes therein, can be monitored rapidly, although at some considerable cost.

However, the biological consequences of gene expression, or altered gene expression following perturbation, are extremely complex. This has led to the development of "proteomic methods"which are concerned with the semi-quantitative measurement of the production of cellular proteins of an organism, collectively referred to as its "proteome" (see, for example, Geisow, 1998). Proteomic measurements utilise a variety of technologies, but all involve a protein separation method, e. g., 2D gel-electrophoresis, allied to a chemical characterisation method, usually, some form of mass spectrometry.

At present, genomic methods have a high associated operational cost and proteomic methods require investment in expensive capital cost equipment and are labour intensive, but both have the potential to be powerful tools for studying biological response. The choice of method is still uncertain since careful studies have sometimes shown a low correlation between the pattern of gene expression and the pattern of protein expression, probably due to sampling for the two technologies at inappropriate time points. See, e. g., Gygi et al., 1999. Even in combination, genomic and proteomic methods still do not provide the range of information needed for understanding integrated cellular function in a living system, since they do not take account of the dynamic metabolic status of the whole organism.

For example, genomic and proteomic studies may implicate a particular gene or protein in a disease or a xenobiotic response because the level of expression is altered, but the change in gene or protein level may be transitory or may be counteracted downstream and as a result there may be no effect at the cellular and/or biochemical level.

Conversely, sampling tissue for genomic and proteomic studies at inappropriate time points may result in a relevant gene or protein being overlooked.

Gene-based prognosis has yet to become a clinical reality for any major prevalent disease, almost all of which have multi-gene modes of inheritance and significant environmental impact making it difficult to identify the gene panels responsible for susceptibility.

While genomic and proteomic methods may be useful aids, for example, in drug development, they do suffer from substantial limitations. For example, while genomic and proteomic methods may ultimately give profound insights into toxicological mechanisms and provide new surrogate biomarkers of disease, at present it is very difficult to relate genomic and proteomic findings to classical cellular or biochemical indices or endpoints. One simple reason for this is that with current technology and approach, the correlation of the time-response to drug exposure is difficult. Further difficulties arise with in vitro cell-based studies. These difficulties are particularly important for the many known cases where the metabolism of the compound is a prerequisite for a toxic effect and especially true where the target organ is not the site of primary metabolism. This is particularly true for pro-drugs, where some aspect of in situ chemical (e. g., enzymatic) modification is required for activity.

Metabonomics A new"metabonomic"approach has been developed which is aimed at augmenting and complementing the information provided by genomics and proteomics."Metabonomics" is conventionally defined as"the quantitative measurement of the multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification" (see, for example, Nicholson et al., 1999). This concept has arisen primarily from the application of'H NMR spectroscopy to study the metabolic composition of biofluids, cells, and tissues and from studies utilising pattern recognition (PR), expert systems and other chemoinformatic tools to interpret and classify complex NMR-generated metabolic data sets. Metabonomic methods have the potential, ultimately, to determine the entire dynamic metabolic make-up of an organism.

As outlined above, each level of bio-molecular organisation requires a series of analytical bio-technologies appropriate to the recovery of the individual types of bio- molecular data. Genomic, proteomic and metabonomic technologies by definition generate massive data sets which require appropriate multi-variate statistical tools

(chemometrics, bio-informatics) for data mining and to extract useful biological information. These data exploration tools also allow the inter-relationships between multivariate data sets from the different technologies to be investigated, they facilitate dimension reduction and extraction of latent properties and allow multidimensional visualization.

This leads to the concept of"bionomics", the quantitative measurement and understanding of the integrated function (and dysfunction) of biological systems at all major levels of bio-molecular organisation. In the study of altered gene expression, (known as transcriptomics), the variables are mRNA responses measured using gene chips, in proteomics, protein synthesis and associated post-translational modifications are typically measured using (mainly) gel-electrophoresis coupled to mass spectrometry. In both cases, thousands of variables can be measured and related to biological end-points using statistical methods. In metabolic (metabonomic) studies, only NMR (especially'H) and mass spectrometry has been used to provide this level of data density on bio-materials although these data can be supplemented by conventional biochemical assays.

For in vivo mammalian studies, the ability to perform metabonomic studies on biofluids such as plasma, CSF and urine is very important because it gives integrated systems- based information on the whole organism. Furthermore, in clinical settings, for the full utilization of functional genomic knowledge in patient screening, diagnostics and prognostics, it is much more practical and ethically-acceptable to analyze biofluid samples than to perform human tissue biopsies and measure gene responses.

A pathological condition or a xenobiotic may act at the pharmacological level only and hence may not affect gene regulation or expression directly. Alternatively significant disease or toxicological effects may be completely unrelated to gene switching. For example, exposure to ethanol in vivo may cause many changes in gene expression but none of these events explains drunkenness. In cases such as these, genomic and proteomic methods are likely to be ineffective. However, all disease or drug-induced pathophysiological perturbations result in disturbances in the ratios and concentrations, binding or fluxes of endogenous biochemicals, either by direct chemical reaction or by binding to key enzymes or nucleic acids that control metabolism. If these disturbances are of sufficient magnitude, effects will result which will affect the efficient functioning of the whole organism. In body fluids, metabolites are in dynamic equilibrium with those

inside cells and tissues and, consequently, abnormal cellular processes in tissues of the whole organism following a toxic insult or as a consequence of disease will be reflected in altered biofluid compositions.

Fluids secreted, excreted, or otherwise derived from an organism ("biofluids") provide a unique window into its biochemical status since the composition of a given biofluid is a consequence of the function of the cells that are intimately concerned with the fluid's manufacture and secretion. For example, the composition of a particular fluid (e. g., urine, blood plasma, milk, etc.) can carry biochemical information on details of organ function (or dysfunction), for example, as a result of xenobiotics, disease, and/or genetic modification. Similarly, the composition and condition of an organism's tissues are also indicators of the organism's biochemical status.

In general, a xenobiotic is a substance (e. g., compound, composition) which is administered to an organism, or to which the organism is exposed. In general, xenobiotics are chemical, biochemical or biological species (e. g., compounds) which are not normally present in that organism, or are normally present in that organism, but not at the level obtained following administration/exposure. Examples of xenobiotics include drugs, formulated medicines and their components (e. g., vaccines, immunological stimulants, inert carrier vehicles), infectious agents, pesticides, herbicides, substances present in foods (e. g. plant compounds administered to animals), and substances present in the environment.

In general, a disease state pertains to a deviation from the normal healthy state of the organism. Examples of disease states include, but are not limited to, bacterial, viral, and parasitic infections; cancer in all its forms; degenerative diseases (e. g., arthritis, multiple sclerosis) ; trauma (e. g., as a result of injury); organ failure (including diabetes); cardiovascular disease (e. g., atherosclerosis, thrombosis); and, inherited diseases caused by genetic composition (e. g., sickle-cell anaemia).

In general, a genetic modification pertains to alteration of the genetic composition of an organism. Examples of genetic modifications include, but are not limited to: the incorporation of a gene or genes into an organism from another species; increasing the number of copies of an existing gene or genes in an organism; removal of a gene or genes from an organism; and, rendering a gene or genes in an organism non-functional.

Biofluids often exhibit very subtle changes in metabolite profile in response to external stimuli. This is because the body's cellular systems attempt to maintain homeostasis (constancy of internal environment), for example, in the face of cytotoxic challenge.

One means of achieving this is to modulate the composition of biofluids. Hence, even when cellular homeostasis is maintained, subtle responses to disease or toxicity are expressed in altered biofluid composition. However, dietary, diurnal and hormonal variations may also influence biofluid compositions, and it is clearly important to differentiate these effects if correct biochemical inferences are to be drawn from their analysis.

Metabonomics offers a number of distinct advantages (over genomics and proteomics) in a clinical setting: firstly, it can often be performed on standard preparations (e. g., of serum, plasma, urine, etc.), circumventing the need for specialist preparations of cellular RNA and protein required for genomics and proteomics, respectively. Secondly, many of the risk factors already identified (e. g., levels of various lipids in blood) are small molecule metabolites which will contribute to the metabonomic dataset.

Application of NMR to Metabonomics One of the most successful approaches to biofluid analysis has been the use of NMR spectroscopy (see, for example, Nicholson et al., 1989) ; similarly, intact tissues have been successfully analysed using magic-angle-spinning'H NMR spectroscopy (see, for example, Moka et al., 1998 ; Tomlins et al., 1998).

The NMR spectrum of a biofluid provides a metabolic fingerprint or profile of the organism from which the biofluid was obtained, and this metabolic fingerprint or profile is characteristically changed by a disease, toxic process, or genetic modification. For example, NMR spectra may be collected for various states of an organism (e. g., pre- dose and various times post-dose, for one or more xenobiotics, separately or in combination; healthy (control) and diseased animal ; unmodified (control) and genetically modified animal).

For example, in the evaluation of undesired toxic side-effects of drugs, each compound or class of compound produces characteristic changes in the concentrations and patterns of endogenous metabolites in biofluids that provide information on the sites and

basic mechanisms of the toxic process NMR analysis of biofluids has successfully uncovered novel metabolic markers of organ-specific toxicity in the laboratory rat, and it is in this"exploratory"role that NMR as an analytical biochemistry technique excels.

However, the biomarker information in NMR spectra of biofluids is very subtle, as hundreds of compounds representing many pathways can often be measured simultaneously, and it is this overall metabonomic response to toxic insult that so well characterises the lesion.

Another important advantage of NMR-based metabonomics over genomics or proteomics is the intrinsic analytical accuracy of NMR spectroscopy. Reanalysis of the same sample by 1 H NMR spectroscopy results in a typical coefficient of variation for the measurement of peak intensities in a spectrum of less than 5% across the whole range of peaks. Thus if the appropriate experiments are undertaken, on average the value of each peak intensity will lie in the range 0.95 to 1.05 of the true value. In addition, it is possible using NMR spectroscopy to measure absolute amounts or concentrations of a number of analytes whereas using gene chip technology only fold changes can be determined. The best available accuracy achieved using gene chips is a two fold change, i. e., the value for each parameter lies in the range 0.50 to 2.00 fold of the"true" value) and proteomic technology is even less intrinsically accurate. A similar limitation also applies to proteomic studies.

Although, undoubtedly, technology is improving at a rapid rate the gap between the intrinsic accuracies of NMR spectroscopy and gene chip technology is so wide that it will require a revolutionary rather than evolutionary improvement in gene expression quantification methodology before it can rival the accuracy of NMR spectroscopy.

The intrinsic accuracy of NMR provides a distinct advantage when applying pattern recognition techniques. The multivariate nature of the NMR data means that classification of samples is possible using a combination of descriptors even when one descriptor is not sufficient, because of the inherently low analytical variation in the data.

All biological fluids and tissues have their own characteristic physico-chemical properties, and these affect the types of NMR experiment that may be usefully employed. One major advantage of using NMR spectroscopy to study complex biomixtures is that measurements can often be made with minimal sample preparation (usually with only the addition of 5-10% D2O) and a detailed analytical profile can be

obtained on the whole biological sample. Sample volumes are small, typically 0.3 to 0.5 mL for standard probes, and as low as 3 pL for microprobes. Acquisition of simple NMR spectra is rapid and efficient using flow-injection technology. It is usually necessary to suppress the water NMR resonance.

Many biofluids are not chemically stable and for this reason care should be taken in their collection and storage. For example, cell lysis in erythrocytes can easily occur. If a substantial amount of D20 has been added, then it is possible that certain'H NMR resonances will be lost by H/D exchange. Freeze-drying of biofluid samples also causes the loss of volatile components such as acetone. Biofluids are also very prone to microbiological contamination, especially fluids, such as urine, which are difficult to collect under sterile conditions. Many biofluids contain significant amounts of active enzymes, either normally or due to a disease state or organ damage, and these enzymes may alter the composition of the biofluid following sampling. Samples should be stored deep frozen to minimise the effects of such contamination. Sodium azide is usually added to urine at the collection point to act as an antimicrobial agent. Metal ions and or cheating agents (e. g., EDTA) may be added to bind to endogenous metal ions (e. g., Ca2+, Mg2+ and Zon2+) and cheating agents (e. g., free amino acids, especially glutamate, cysteine, histidine and aspartate; citrate) to intentionally alter and/or enhance the NMR spectrum.

In all cases the analytical problem usually involves the detection of"trace"amounts of analytes in a very complex matrix of potential interferences. It is, therefore, critical to choose a suitable analytical technique for the particular class of analyte of interest in the particular biomatrix which could be, for example, a biofluid or a tissue. High resolution NMR spectroscopy (in particular'H NMR) appears to be particularly appropriate. The main advantages of using'H NMR spectroscopy in this area are the speed of the method (with spectra being obtained in 5 to 10 minutes), the requirement for minimal sample preparation, and the fact that it provides a non-selective detector for all metabolites in the biofluid regardless of their structural type, provided only that they are present above the detection limit of the NMR experiment and that they contain non- exchangeable hydrogen atoms. The speed advantage is of crucial importance in this area of work as the clinical condition of a patient may require rapid diagnosis, and can change very rapidly and so correspondingly rapid changes must be made to the therapy provided.

NMR studies of body fluids should ideally be performed at the highest magnetic field available to obtain maximal dispersion and sensitivity and most'H NMR studies have been performed at 400 MHz or greater. With every new increase in available spectrometer frequency the number of resonances that can be resolved in a biofluid increases and although this has the effect of solving some assignment problems, it also poses new ones. Furthermore, there are still important problems of spectral interpretation that arise due to compartmentation and binding of small molecules in the organised macromolecular domains that exist in some biofluids such as blood plasma and bile. All this complexity need not reduce the diagnostic capabilities and potential of the technique, but demonstrates the problems of biological variation and the influence of variation on diagnostic certainty.

The information content of biofluid spectra is very high and the complete assignment of the'H NMR spectrum of most biofluids is usually not possible (even using 900 MHz NMR spectroscopy). However, the assignment problems vary considerably between biofluid types. Some fluids have near constant composition and concentrations and in these the majority of the NMR signals have been assigned. In contrast, urine composition can be very variable and there is enormous variation in the concentration range of NMR-detectable metabolites ; consequently, complete analysis is much more difficult. Those metabolites present close to the limits of detection for 1-dimensional (1 D) NMR spectroscopy (typically ca. 100 nM at 800 MHz) pose severe NMR spectral assignment problems. (In absolute terms, the detection limit may be ca. 4 nmol, e. g., 1 pg of a 250 g/mol compound in a 0.5 mL sample volume.) Even at the present level of technology in NMR, it is not yet possible to detect many important biochemical substances (e. g. hormones, some proteins, nucleic acids) in body fluids because of problems with sensitivity, line widths, dispersion and dynamic range and this area of research will continue to be technology-limited. In addition, the collection of NMR spectra of biofluids may be complicated by the relative water intensity, sample viscosity, protein content, lipid content, and low molecular weight peak overlap.

Usually in order to assign'H NMR spectra, comparison is made with spectra of authentic materials and/or by standard addition of an authentic reference standard to the sample. Additional confirmation of assignments is usually sought from the application of other NMR methods, including, for example, 2-dimensional (2D) NMR methods, particularly COSY (correlation spectroscopy), TOCSY (total correlation spectroscopy), inverse-detected heteronuclear correlation methods such as HMBC

(heteronuclear multiple bond correlation), HSQC (heteronuclear single quantum coherence), and HMQC (heteronuclear multiple quantum coherence), 2D J-resolved (JRES) methods, spin-echo methods, relaxation editing, diffusion editing (e. g., both 1 D NMR and 2D NMR such as diffusion-edited TOCSY), and multiple quantum filtering.

Detailed'H NMR spectroscopic data for a wide range of metabolites and biomolecules found in biofluids have been published (see, for example, Lindon et al., 1999) and supplementary information is available in several literature compilations of data (see, for example, Fan, 1996; Sze et al., 1994).

For example, the successful application of'H NMR spectroscopy of biofluids to study a variety of metabolic diseases and toxic processes has now been well established and many novel metabolic markers of organ-specific toxicity have been discovered (see, for example, Nicholson et al., 1989; Lindon et al., 1999). For example, NMR spectra of urine is identifiably altered in situations where damage has occurred to the kidney or liver. It has been shown that specific and identifiable changes can be observed which distinguish the organ that is the site of a toxic lesion. Also it is possible to focus in on particular parts of an organ such as the cortex of the kidney and even in favourable cases to very localised parts of the cortex.

It is also possible to deduce the biochemical mechanism of the xenobiotic toxicity, based on a biochemical interpretation of the changes in the urine. A wide range of toxins has now been investigated including mostly kidney toxins and liver toxins, but also testicular toxins, mitochondrial toxins and muscle toxins.

Pattern Recognition However, a limiting factor in understanding the biochemical information from both 1 D and 2D-NMR spectra of tissues and biofluids is their complexity. The most efficient way to investigate these complex multiparametric data is employ the 1 D and 2D NMR metabonomic approach in combination with computer-based"pattern recognition" (PR) methods and expert systems. These statistical tools are similar to those currently being explored by workers in the fields of genomics and proteomics.

Pattern recognition (PR) methods can be used to reduce the complexity of data sets, to generate scientific hypotheses and to test hypotheses. In general, the use of pattern recognition algorithms allows the identification, and, with some methods, the

interpretation of some non-random behaviour in a complex system which can be obscured by noise or random variations in the parameters defining the system. Also, the number of parameters used can be very large such that visualisation of the regularities, which for the human brain is best in no more than three dimensions, can be difficult. Usually the number of measured descriptors is much greater than three and so simple scatter plots cannot be used to visualise any similarity between samples.

Pattern recognition methods have been used widely to characterise many different types of problem ranging for example over linguistics, fingerprinting, chemistry and psychology. In the context of the methods described herein, pattern recognition is the use of multivariate statistics, both parametric and non-parametric, to analyse spectroscopic data, and hence to classify samples and to predict the value of some dependent variable based on a range of observed measurements. There are two main approaches. One set of methods is termed"unsupervised"and these simply reduce data complexity in a rational way and also produce display plots which can be interpreted by the human eye. The other approach is termed"supervised"whereby a training set of samples with known class or outcome is used to produce a mathematical model and this is then evaluated with independent validation data sets.

Unsupervised PR methods are used to analyse data without reference to any other independent knowledge, for example, without regard to the identity or nature of a xenobiotic or its mode of action. Examples of unsupervised pattern recognition methods include principal component analysis (PCA), hierarchical cluster analysis (HCA), and non-linear mapping (NLM).

One of the most useful and easily applied unsupervised PR techniques is principal components analysis (PCA) (see, for example, Kowalski et al, 1986). Principal components (PCs) are new variables created from linear combinations of the starting variables with appropriate weighting coefficients. The properties of these PCs are such that: (i) each PC is orthogonal to (uncorrelated with) all other PCs, and (ii) the first PC contains the largest part of the variance of the data set (information content) with subsequent PCs containing correspondingly smaller amounts of variance.

PCA, a dimension reduction technique, takes m objects or samples, each described by values in K dimensions (descriptor vectors), and extracts a set of eigenvectors, which are linear combinations of the descriptor vectors. The eigenvectors and eigenvalues are obtained by diagonalisation of the covariance matrix of the data. The eigenvectors

can be thought of as a new set of orthogonal plotting axes, called principal components (PCs). The extraction of the systematic variations in the data is accomplished by projection and modelling of variance and covariance structure of the data matrix. The primary axis is a single eigenvector describing the largest variation in the data, and is termed principal component one (PC1). Subsequent PCs, ranked by decreasing eigenvalue, describe successively less variability. The variation in the data that has not been described by the PCs is called residual variance and signifies how well the model fits the data. The projections of the descriptor vectors onto the PCs are defined as scores, which reveal the relationships between the samples or objects. In a graphical representation (a"scores plot"or eigenvector projection), objects or samples having similar descriptor vectors will group together in clusters. Another graphical representation is called a loadings plot, and this connects the PCs to the individual descriptor vectors, and displays both the importance of each descriptor vector to the interpretation of a PC and the relationship among descriptor vectors in that PC. In fact, a loading value is simply the cosine of the angle which the original descriptor vector makes with the PC. Descriptor vectors which fall close to the origin in this plot carry little information in the PC, while descriptor vectors distant from the origin (high loading) are important in interpretation.

Thus a plot of the first two or three PC scores gives the"best"representation, in terms of information content, of the data set in two or three dimensions, respectively. A plot of the first two principal component scores, PC1 and PC2 provides the maximum information content of the data in two dimensions. Such PC maps can be used to visualise inherent clustering behaviour, for example, for drugs and toxins based on similarity of their metabonomic responses and hence mechanism of action. Of course, the clustering information might be in lower PCs and these have also to be examined.

Hierarchical Cluster Analysis, another unsupervised pattern recognition method, permits the grouping of data points which are similar by virtue of being"near"to one another in some multidimensional space. Individual data points may be, for example, the signal intensities for particular assigned peaks in an NMR spectrum. A"similarity matrix,"S, is constructed with elements s, j = 1-r, j/r, j", where rij is the interpoint distance between points i and j (e. g., Euclidean interpoint distance), and rij ma, is the largest interpoint distance for all points. The most distant pair of points will have six equal to 0, since rjj then equals r. Conversefy, the closest pair of points will have the largest sij. For two identical points, slj is 1.

The similarity matrix is scanned for the closest pair of points. The pair of points are reported with their separation distance, and then the two points are deleted and replaced with a single combined point. The process is then repeated iteratively until only one point remains. A number of different methods may be used to determine how two clusters will be joined, including the nearest neighbour method (also known as the single link method), the furthest neighbour method, and the centroid method (including centroid link, incremental link, median link, group average link, and flexible link variations).

The reported connectivities are then plotted as a dendrogram (a tree-like chart which allows visualisation of clustering), showing sample-sample connectivities versus increasing separation distance (or equivalently, versus decreasing similarity). The dendrogram has the property in which the branch lengths are proportional to the distances between the various clusters and hence the length of the branches linking one sample to the next is a measure of their similarity. In this way, similar data points may be identified algorithmically.

Non-linear mapping (NLM) is a simple concept which involves calculation of the distances between all of the points in the original K dimensions. This is followed by construction of a map of points in 2 or 3 dimensions where the sample points are placed in random positions or at values determined by a prior principal components analysis.

The least squares criterion is used to move the sample points in the lower dimension map to fit the inter-point distances in the lower dimension space to those in the K dimensional space. Non-linear mapping is therefore an approximation to the true inter- point distances, but points close in K-dimensional space should also be close in 2 or 3 dimensional space (see, for example, Brown et al., 1996; Farrant et al., 1992).

In this simple metabonomic approach, a sample from an animal treated with a compound of unknown toxicity is compared with a database of NMR-generated metabolic data from control and toxin-treated animals. By observing its position on the PR map relative to samples of known effect, the unknown toxin can often be classified.

The same approach can be used for human samples for classification according to disease. However, such data are often more complex, with time-related biochemical changes detected by NMR. Also, it is more rigorous to compare effects of xenobiotics in the original K-dimensional NMR metabonomic space.

Alternatively, and in order to develop automatic classification methods, it has proved efficient to use a"supervised"approach to NMR data analysis. Here, a"training set"of NMR metabonomic data is used to construct a statistical model that predicts correctly the"class"of each sample. This training set is then tested with independent data (referred to as a test or validation set) to determine the robustness of the computer-based model. These models are sometimes termed"expert systems,"but may be based on a range of different mathematical procedures. Supervised methods can use a data set with reduced dimensionality (for example, the first few principal components), but typically use unreduced data, with all dimensionality. In all cases the methods allow the quantitative description of the multivariate boundaries that characterise and separate each class, for example, each class of xenobiotic in terms of its metabolic effects. It is also possible to obtain confidence limits on any predictions, for example, a level of probability to be placed on the goodness of fit (see, for example, Kowalski et al., 1986). The robustness of the predictive models can also be checked using cross-validation, by leaving out selected samples from the analysis.

Expert systems may operate to generate a variety of useful outputs, for example, (i) classification of the sample as"normal"or"abnormal" (this is a useful tool in the control of spectrometer automation, e. g., using sequential flow injection NMR spectroscopy); (ii) classification of the target organ for toxicity and site of action within the tissue where in certain cases, mechanism of toxic action may also be classified ; and, (iii) identification of the biomarkers of a pathological disease condition or toxic effect for the particular compound under study. For example, a sample can be classified as belonging to a single class of toxicity, to multiple classes of toxicity (more than one target organ), or to no class. The latter case would indicate deviation from normality (control) based on the training set model but having a dissimilar metabolic effect to any toxicity class modelled in the training set (unknown toxicity type). Under (ii), a system could also be generated to support decisions in clinical medicine (e. g., for efficacy of drugs) rather than toxicity.

Examples of supervised pattern recognition methods include the following : soft independent modelling of class analysis (SIMCA) (see, for example, Wold, 1976) ; partial least squares analysis (PLS) (see, for example, Wold, 1966; Joreskog, 1982; Frank, 1984; Bro, R., 1997);

linear descriminant analysis (LDA) (see, for example, Nillson, 1965); K-nearest neighbour analysis (KNN) (see, for example, Brown et al., 1996); artificial neural networks (ANN) (see, for example, Wasserman, 1989; Anker et al., 1992; Hare, 1994); probabilistic neural networks (PNNs) (see, for example, Parzen, 1962; Bishop, 1995; Speckt, 1990; Broomhead et al., 1988; Patterson, 1996); rule induction (RI) (see, for example, Quinlan, 1986); and, Bayesian methods (see, for example, Bretthorst, 1990a, 1990b, 1988).

As the size of metabonomic databases increases together with improvements in rapid throughput of NMR samples (> 300 samples per day per spectrometer is now possible with the first generation of flow injection systems), more subtle expert systems may be necessary, for example, using techniques such as"fuzzy logic"which permit greater flexibility in decision boundaries.

Application to Metabonomics Pattern recognition methods have been applied to the analysis of metabonomic data.

See, for example, Lindon et al., 2001. A number of spectroscopic techniques have been used to generate the data, including NMR spectroscopy and mass spectrometry.

Pattern recognition analysis of such data sets has been successful in some cases. The successful studies include, for example, complex NMR data from biofluids, (see, for example, Anthony et al., 1994; Anthony et al., 1995; Beckwith-Hall et al., 1998; Garland et al., 1990a; Garland et al., 1990b; Garland et al., 1991; Holmes et al., 1998a; Holmes et al., 1998b ; Holmes et al., 1992; Holmes et al., 1994; Spraul et al., 1994; Tranter et al., 1999) conventional NMR spectra from tissue samples (Somorjai et al., 1995), magic- angle-spinning (MAS) NMR spectra of tissues (Garrod et al., 2001), in vivo NMR spectra (Morvan et al., 1990; Howells et al., 1993; Stoyanova et al., 1995; Kuesel et al., 1996; Confort-Gouny et al., 1992; Weber et al., 1998), wines (Martin et al., 1998,1999) and plant tissues (Kopka et al., 2000).

Although the utility of the metabonomic approach is well established, its full potential has not yet been exploited. The metabolic variation is often subtle, and powerful analysis methods are required for detection of particular analytes, especially when the data (e. g., NMR spectra) are so complex. For example, all that has been previously

proposed is still not generally sufficient to achieve clinically useful diagnosis of disease.

New methods to extract useful metabolic information from biofluids are needed.

One aim of the present invention is to provide data analysis methods for the detection of such metabolic variations, as part of a metabonomic approach, which address one or more of the known problems, including those discussed herein.

SUMMARY OF THE INVENTION One aspect of the present invention pertains to chemometric methods for the analysis of chemical, biochemical, and biological data, for example, spectral data, for example, nuclear magnetic resonance (NMR) spectra and other types of spectra.

One aspect of the present invention pertains to a method of classifying a spectrum, as described herein.

One aspect of the present invention pertains to a method of classifying a sample, as described herein.

One aspect of the present invention pertains to a method of classifying a subject as described herein.

One aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, as described herein.

One aspect of the present invention pertains to a biomarker or biomarker combination identified by a method as described herein.

One aspect of the present invention pertains to a biomarker or biomarker combination identified by a method as described herein, for use in a method of classification.

One aspect of the present invention pertains to a method of classification which employs or relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to use of one or more biomarkers or biomarker combinations identified by a method of classification as described herein.

One aspect of the present invention pertains to an assay for use in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to use of an assay in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to a method of diagnosis employing one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to a method of diagnosis of an applied stimulus or condition, comprising a method of analysis of an applied stimulus, as described herein.

One aspect of the present invention pertains to a method of therapeutic monitoring of a subject undergoing therapy, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of evaluating drug therapy and/or drug efficacy, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of detecting toxic side-effects of drug, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of characterising and/or identifying a drug in overdose, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a computer system or device, such as a computer or linked computers, operatively configured to implement a method as

described herein; and related computer code computer programs, data carriers carrying such code and programs, and the like.

These and other aspects of the present invention are described herein.

As will be appreciated by one of skill in the art, features and preferred embodiments of one aspect of the present invention will also pertain to other aspects of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a graph of"binned"signal intensity versus chemical shift (o 0-10) for the 95% confidence interval high (A) and low (B) spectra, and a test spectrum (C), as described in Example 1.

Figure 2 is a graph of"binned"signal intensity versus chemical shift (o 3.78-4. 10) for the 95% confidence interval high (A) and low (B) spectra, and a test spectrum (Test 1) (C), as described in Example 1.

Figure 3 is a graph of mean signal intensity versus chemical shift (o 3.78-4.10) for the 95% confidence interval high (A) and low (B) spectra, and a set of test spectra (C), as described in Example 2.

Figure 4 is a graph of standard deviation of signal intensity versus chemical shift (o 3.78-4.10) for the 95% confidence interval high (A) and low (B) spectra, and a set of test spectra (C), as described in Example 2.

Figure 5 is a graph of relative standard deviation of signal intensity versus chemical shift (d 3.78-4.10) for the 95% confidence interval high (A) and low (B) spectra, and a set of test spectra (C), as described in Example 2.

Figure 6 is a graph of skew of signal intensity versus chemical shift (o 3.78-4.10) for the 95% confidence interval high (A) and low (B) spectra, and a set of test spectra (C), as described in Example 2.

Figure 7 is a graph of kurtosis of signal intensity versus chemical shift (o 3.78-4.10) for the 95% confidence interval high (A) and low (B) spectra, and a set of test spectra (C), as described in Example 2.

DETAILED DESCRIPTION OF THE INVENTION The present invention pertains generally to the field of chemometrics, metabonomics, and, more particularly, to methods for the analysis of biological data, particularly spectra.

The methods of the present invention are applicable to chemical, biochemical, and biological data, for example, spectra, and especially spectra generated using types of spectroscopy and spectrometry which are useful in chemical and biochemical (i. e., molecular) studies.

The methods described herein facilitate more powerful analysis of spectral data. For example, the methods of the present invention make possible the identification of spectral changes associated with an event of interest from a spectral background which is non-specific and/or irrelevant.

Reliability Assessment Spectra often have features (e. g., peaks, noise spikes, baseline artefacts, etc.) which interfere with and/or reduce the power and/or accuracy of subsequent analysis. Some of these features are artefacts of the particular type of spectra, its method of acquisition, adventitious impurities, and the like. However, some of these spectral features are associated with chemical species not accidentally or unintentionally present in the sample under study, but instead reflect, for example, the effects of an applied stimulus.

For example, in metabonomic studies, a sample from an organism under study may show spectral evidence of a large number of metabolites. In general, these metabolites may be placed in one of three classes: (A) Endogenous metabolites, the levels of which are significantly altered by the application of an applied stimulus. A single metabolite of this type is typically referred to as a biomarker. In a more complex case, where the levels of several, or more,

metabolites are changed (whether increased or decreased), the group of metabolites are typically referred to as a biomarker combination. Biomarker combinations may include a time dependence, for example, levels of metabolite A up at 24 hours, and back to normal at 48 hours along with levels of metabolite B down at 24 hours and up at 72 hours. For example, an increase in taurine together with creatine levels in urine is a general marker for liver damage. In a more complex example, toxins which cause lesions in the S3 portion of the renal proximal tubule cause elevations of urinary glucose, amino acids and organic acids with decreases in tricarboxylic acid cycle intermediates.

(B) Endogenous metabolites, the levels of which are unaffected by application of the applied stimulus. Such metabolites are often referred to as"background signals." (C) Metabolites, which appear in the sample and which arise from a xenobiotic itself or its metabolites. For example, paracetamol is seen in urine mainly as paracetamol sulfate and paracetamol glucuronide conjugates. In some cases unchanged paracetamol can also be seen. Of course, these metabolites will be present only if the applied stimulus includes a xenobiotic.

Many of the metabolites falling in class C are collectively referred to herein as "interfering signals."Such signals often provide little information about the organism's response to an applied stimulus, while dominating and interfering with the metabonomic description of the stimulated organism.

In contrast, metabolites falling in class A (biomarkers, biomarker combinations) are an indicator the organism's response to an applied stimulus.

Whether or not a particular metabolite is, or is a candidate as, a biomarker or biomarker combination can often be determined from known data regarding the applied stimulus under study. For example, there may be a large body of public knowledge regarding the metabolism of a particular compound, or of compounds having a particular substructure.

Often, a biomarker or biomarker combination, and its associated spectral features, can be readily identified by eye by the skilled artisan from one or more of a range of types of NMR spectra. However, if new spectral features are observed which are not readily identified, the associated compounds giving rise to these features can be isolated and

characterised using known methods, for example, by coupling liquid chromatography with NMR (e. g., HPLC-NMR) or mass spectrometry (e. g., HPLC-MS).

In all metabonomic work it is important to understand the natural variation in the population under study. Since the multivariate techniques used to analyse the data are statistical in nature, it is useful to have a statistical description of the control population in order to detect any departures from normality.

It is therefore desirable to be able to classify, statistically, a spectrum (and an associated sample, or organism, as appropriate), for example, as"normal"or "abnormal,"with an associated statistical reliability. This may be especially useful, for example, when screening a potential control pool.

It is also desirable to be able to identify, statistically, a potential (candidate) biomarker or biomarker combination, and assign to it an associated statistical reliability that it is, in fact, a biomarker or biomarker combination.

These processes are referred to herein as"reliability assessment." Assigning Statistical Reliability In general, a statistical description of a control population is generated in order to detect (e. g., in a test sample) any departures from control behaviour; such departures are then assigned a statistical reliability. For example, the departure may be described as falling outside a proportion of a suitable control population, for example, falling outside 95% of a control population.

Thus, one aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality, on the basis of a statistical description of a control population.

One aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality of a subject, on the basis of a statistical description of subjects of a control population.

One aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality of a test sample, on the basis of a statistical description of samples from a control population.

One aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality of a test sample from a subject, on the basis of a statistical description of samples from subjects of a control population.

One aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality of a sample spectrum for a test sample, on the basis of a statistical description of sample spectra for samples from a control population.

One aspect of the present invention pertains to a method of assigning a statistical reliability to a departure from normality of a sample spectrum for a test sample from a subject, on the basis of a statistical description of sample spectra for samples. from subjects of a control population.

One aspect of the present invention pertains to a method of classifying a spectrum, as described herein.

One aspect of the present invention pertains to a method of classifying a sample by classifying a spectrum for said sample, wherein said method of classifying a spectrum is as described herein.

One aspect of the present invention pertains to a method of classifying a subject by classifying a spectrum for a sample from said subject, wherein said method of classifying a spectrum is as described herein.

Classifying with Statistical Reliability Subjects, samples, spectra, etc. can be classified (e. g., as normal or abnormal) on the basis of a departure from normality, and in this case, classified with a statistical reliability. For example, classification may be based on a departure from normality described as falling outside a proportion of a suitable control population, for example, falling outside 95% of a control population, and so classification may be described as normal or abnormal with 95% confidence.

Thus, one aspect of the present invention pertains to a method of classifying with a statistical reliability, on the basis of deviation from a statistical description of a control population.

One aspect of the present invention pertains to a method of classifying a subject with a statistical reliability, on the basis of deviation from a statistical description of a control population.

One aspect of the present invention pertains to a method of classifying a test sample with a statistical reliability, on the basis of deviation from a statistical description of samples from a control population.

One aspect of the present invention pertains to a method of classifying a test sample from a subject with a statistical reliability, on the basis of a statistical description of samples from subjects of a control population.

One aspect of the present invention pertains to a method of classifying a sample spectrum for a test sample with a statistical reliability, on the basis of a statistical description of samples from a control population.

One aspect of the present invention pertains to a method of classifying a sample spectrum for a test sample from a subject with a statistical reliability, on the basis of a statistical description of samples from subjects of a control population.

Classification with respect to a Reference State One aspect of the present invention pertains to a method for classifying a sample spectrum, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of:

(a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

One aspect of the present invention pertains to a method for classifying a sample spectrum, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and,

(b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property.

Classification, Classifying, and Classes As discussed above, many aspects of the present invention pertain to methods of classifying things, for example, a spectrum, a sample, a subject, etc. In such methods, the thing is classified, that is, it is associated with an outcome, or, more specifically, it is assigned membership to a particular class (i. e., it is assigned class membership), and is said"to be of,""to belong to,""to be a member of,"a particular class.

Classification is made (i. e., class membership is assigned) on the basis of particular criteria.

For example, as discussed below, where classification is made with respect to a reference state which defines normality,"normality"is one class, and"abnormality"is another class.

In another example,"presence of a predetermined condition"is one class, and "absence of a predetermined condition"is another class ; in such cases, classification (i. e., assignment to one of these classes) is equivalent to diagnosis.

Controls, Normality, and Reference State In the context of a"departure from normality,"the state of normality is a reference state defined by a suitable control or controls, representing that reference state. It is with respect to this reference state (e. g., normality) that classification is made.

Suitable controls are usually selected on the basis of the organism (e. g., subject, patient) under study (test subject, study subject, etc.), and the nature of the study (e. g., type of sample, type of spectra, etc.). Usually, controls are selected to represent the state of"normality."However, controls may be selected another state, if suitable. As described herein, deviations from normality (e. g., higher than normal, lower than normal) in test data, test samples, test subjects, etc. are used in classification, diagnosis, etc.

For example, in most cases, control subjects are the same species as the test subject and are chosen to be representative of the equivalent normal (e. g., healthy) organism.

A control population is a population of control subjects. If appropriate, control subjects may have characteristics in common (e. g., sex, ethnicity, age group, etc.) with the test subject. If appropriate, control subjects may have characteristics (e. g., age group, etc.) which differ from those of the test subject. For example, it may be desirable to choose healthy 20-year olds of the same sex and ethnicity as the study subject as control subjects.

In most cases, control samples are taken from control subjects. Usually, control samples are of the same sample type (e. g., serum), and are collected and handled (e. g., treated, processed, stored) under the same or similar conditions, as the sample under study (e. g., test sample, study sample).

In most cases, control data (e. g., control values) are obtained from control samples which are taken from control subjects. Usually, control data (e. g., control data sets, control spectral data, control spectra, etc.) are of the same type (e. g., 1-D 1H NMR, etc.), and are collected and handled (e. g., recorded, processed) under the same or similar conditions (e. g., parameters), as the test data.

Similarly, the"reference"state is a pre-determined state defined by a suitable population representative of that pre-determined state.

In one embodiment, the reference state is that of control, e. g., as defined by one or more control organisms.

In one embodiment, the reference state is that of pre-dose, e. g., as defined by one or more pre-dose organisms, that is, prior to treatment or therapy, e. g., with a xenobiotic.

Equivalent Spectra Many of the methods described herein involve forming a statistical description of a set of equivalent spectra, e. g., for a reference state. The term"equivalent spectra,"as used herein, pertains to spectra for the same state, in this case, the reference state.

In one embodiment, the equivalent spectra are spectra for a single sample.

In one embodiment, the equivalent spectra are spectra for a number of samples from a single organism.

In one embodiment, the equivalent spectra are spectra for one sample from each of a number of organisms of the same type.

In one embodiment, the equivalent spectra are spectra for a number of samples from each of a number of organisms, all of the same type.

In each case, the samples are the same type.

In one embodiment, the set of equivalent spectra comprises at least 10 spectra.

In one embodiment, the set comprises at least 20 spectra.

In one embodiment, the set comprises at least 50 spectra.

In one embodiment, the set comprises at least 100 spectra.

In one embodiment, the set comprises at least 200 spectra.

In one embodiment, the set comprises at least 500 spectra.

In one embodiment, the set comprises at least 1000 spectra.

Statistical Properties As discussed above, many aspects of the present invention pertain to methods which reply upon statistical properties, for example, as a basis (e. g., as criteria) for classification.

Examples of statistical properties include one or more of: mean; standard deviation; relative standard deviation; skewness ; and, kurtosis; as well as other well known statistical properties.

In one embodiment, said at least one statistical property is/are selected from: mean; standard deviation; relative standard deviation; skewness; and, kurtosis.

In one embodiment, said at least one statistical property is/are selected from: standard deviation; relative standard deviation; skewness; and, kurtosis.

In one embodiment, said at least one statistical property is/are selected from: standard deviation and relative standard deviation.

In one embodiment, said at least one statistical property is mean.

In one embodiment, said at least one statistical property is standard deviation.

In one embodiment, said at least one statistical property is relative standard deviation.

In one embodiment, said at least one statistical property is skewness.

In one embodiment, said at least one statistical property is kurtosis.

Classifying with a Confidence Level Once a statistical description of a reference state (e. g., control population) has been formed, it is possible to define confidence intervals (e. g., 95% confidence) for each spectral region, that is, intervals within which a certain fraction (e. g., 95%) of normal variation of the reference state is seen. Thus, for a test spectrum, any departure from normality of the value of an experimental parameter derived from spectral data (with respect to the reference state) can be detected, and its association with, for example, an applied stimulus or deviation from normality, can be made with a certain level of confidence.

For example, it may be possible to conclude that, in a test organism, a certain metabolite, indicated by (e. g., signal intensity at) one or more spectral regions, exhibits non-control behaviour at 95% confidence; that is, in only 5% of control organisms is the level of that metabolite seen to be more extreme (further from the mean) than the measured value in the test organism.

Thus, one aspect of the present invention pertains to a method for classifying a sample spectrum, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property, with an associated confidence level.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property, with an associated confidence level.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property, with an associated confidence level.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

One aspect of the present invention pertains to a method for classifying a sample spectrum, said method comprising the steps of : (a) calculating a statistical property for a set of equivalent spectra for a reference state; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property, with an associated confidence level.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property, with an associated confidence level.

Each spectrum is, to one degree or another, representative of the composition of the sample from which it was recorded. In general, a sample can be generalised as an n-dimensional object, where the coordinate along each of the axes or dimensions is the concentration of individual chemical or biochemical species.

Equivalently, the sample can be represented via its spectrum, also as an n-dimensional object, x, where the coordinate along each of the axes or dimensions (Xt, X2, X3,... Xj) is an experimental parameter derived from spectral data, for example, the spectral intensity (or equivalent spectroscopic parameter) at each data point. For example, for a 1 D NMR spectrum, each of x"X2, X3, etc. may represent signal intensity at different chemical shifts. It is not necessary to assign spectral features (e. g., peaks, features, lines) at this stage, since the spectrum is treated solely as a statistical object.

The spectra may or may not have been subjected to data compression, as described herein. In one preferred embodiment, the methods of the present invention are applied to spectra which have been compressed, for example, into buckets, segments, or bins.

Similarly, the spectra may or may not have been normalised. In one preferred embodiment, the spectra are normalised, preferably to unit total integrated intensity.

An equivalent spectra set, X, may be formed from nx equivalent spectra, each of which is denoted x ; (where i runs from 1 to nx) and each of which has descriptors xij (where j ranges from 1 to the total number of descriptors). Each row, i, corresponds to an individual spectrum, and each column, j, corresponds to an experimental parameter derived from spectral data, e. g., the signal intensity at a particular value of the spectroscopic parameter.

In this way, each column represents a set of, e. g., signal intensity values for a particular value of the spectroscopic parameter. The numbers in each column are then treated as

a set, and the statistical properties of that set calculated. Examples of statistical properties include, but are not limited to, the mean, standard deviation, relative standard deviation, skewness, and kurtosis.

Statistical properties are calculated for the equivalent spectra set, for example, for each column of X, and the results tabulated (e. g., below the respective columns), so that, for each statistical property, there is a new row corresponding to that property. Graphically, each new row is also a spectrum, and can be considered as a"statistics spectrum." For example, a new"arithmetic mean"row, m, is obtained with descriptors mj.

Graphically, this is the mean spectrum.

Similarly, a new"standard deviation"row, std, is obtained with descriptors, stdj, and graphically, this is the standard deviation spectrum.

Large values in the standard deviation spectrum appear at those spectral windows (e. g., chemical shifts) where the standard deviation is large, and small values appear at those spectral windows (e. g., chemical shifts) where the standard deviation is small.

Similarly, a"relative standard deviation"spectrum, rstd, with descriptors rstd ; a "skewness"spectrum, sk, with descriptors skj ; and a"kurtosis"spectrum, kt, with descriptors kit ;, may be formed.

Additional statistics spectra may be calculated, for example, using a cumulative distribution function. Cumulative distribution values are calculated for each column, and again tabulated below the columns. The cumulative distribution is defined by the relation dj (s) = f, where, for descriptor j, f is the fraction of numbers (i. e., signal intensity values) which are less than, or equal to, the threshold value s. Thus, f ranges from 0 to 1, and s is a signal intensity value.

For example, if there are 900 spectra, and, for descriptor j (i. e., a particular spectroscopic parameter), 45 of the 900 spectra have a signal intensity less than or equal to 0.6 units, then dj (0.6) = 45/900 = 0.05. That is, 5% of the spectra have a signal intensity of 0.6 or less, at that particular spectroscopic parameter, j.

In practice, the set of N numbers (the intensity values in a given column) is sorted in increasing order. The value N*f is calculated, and from the sorted set the N*f-th member identified, or where N*f is not a whole number, the value of the theoretical N*f- th member calculated, for example, by extrapolation of the adjacent members. The value of this identified member is reported as s in d (s) =f. Various simple (e. g., linear, normal, etc.) and complex methods of extrapolation are well known in the art.

For example, for the set (3,4,4,5,5,5,5,5,6,6,7), when f=0. 1 (i. e., 10%), N is 10, N*f is 1, the first number is 3, and d (3) =0. 1. When=0. 9 (i. e., 90%), N is 10, N*f is 9, the 9th number is 6, and d (6) =0.9. So, 10% of the numbers are less than or equal to 3, 90% of the numbers are less than or equal to 6, and the 80% confidence interval is defined as 3 < s 6.

In this way, confidence limits may be calculated using the cumulative distribution function. The confidence interval is characterised by a"low"and"high"threshold (L and H, respectively) defined by the cumulative distribution. The specified fraction of members fall within the range from L to H; more specifically, the specified fraction of members are more than L and less than or equal to H.

For example, to determine a 95% confidence interval, L is calculated from the cumulative distribution function as d (L) =0.025, and H is calculated as d (H) =0. 975.

Then, 95% of the members will fall in the range from L to H, more specifically, 95% of the members will be more than L and less than or equal to H, that is, L < s H.

Examples of confidence levels (e. g., predetermined confidence levels) include 99%, 98%, 97%, 96%, 95%, 90%, 85%, 80%, and 75%.

For the equivalent spectra set X, a particular confidence level is selected (e. g., 95%) and the relevant low and high thresholds calculated from the cumulative distribution function, for the values (e. g., spectral intensity, x, ;) at each descriptor j. The low thresholds, Lj, are tabulated, e. g., below the columns ; graphically, this gives a"low threshold (95%)"spectrum. Similarly, the high thresholds, Hj, are tabulated, e. g., below the columns ; graphically this give a"high threshold (95%)"spectrum. These two spectra may be displayed graphically to give a band (e. g.,"spectral intensity confidence interval (95%)"spectrum) in which 95% of the spectra fall. Such"band"spectra are referred to herein as"confidence interval spectra." These statistics spectra, described above, may be used to assess test spectra.

For example, classification may be made on the basis of the extent to which the sample spectrum falls within a confidence interval spectrum, for example, on the basis of whether or not the sample spectrum falls wholly, partially, or not at all, within a confidence interval spectrum.

For example, once a statistical description has been prepared for data from a (preferably large) group of control organisms, individual control organisms (whether a part of the control group, or outside the control group) can be compared to determine how similar or dissimilar they are from the control pool. In this way, highly variant, and therefore"abnormal"organisms can be identified, and their data removed from the pool.

Also, once a statistical description has been prepared for data from a (preferably large) group of control organisms, data from test organisms (e. g., which have been subjected to an applied stimulus) can be compared to help classify them as"normal"or "abnormal,"with respect to the control organisms.

In a simple case, a 95% confidence interval spectrum for the mean is calculated, and a sample spectrum is classified on the basis of the extent to which it falls within that confidence interval spectrum. Corresponding classifications may be made on the basis

of confidence interval spectra for other statistical properties, and the corresponding property for the sample spectrum.

Thus, one aspect of the present invention pertains to a method for classifying a sample spectrum for a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples representing a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating at least one statistical property for a set of equivalent spectra for samples from subjects representing a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property; and,

(b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said at least one statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

In one embodiment, said at least one confidence interval spectrum is a plurality of confidence interval spectra.

One aspect of the present invention pertains to a method for classifying a sample spectrum for a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said sample spectrum as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

One aspect of the present invention pertains to a method for classifying a test sample, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples representing a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said test sample as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for said test sample and said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

One aspect of the present invention pertains to a method for classifying a test subject, said method comprising the steps of: (a) calculating a statistical property for a set of equivalent spectra for samples from subjects representing a reference state; including calculating a confidence interval spectrum associated with a predetermined confidence level for said statistical property; and, (b) classifying said test subject as normal or abnormal, with respect to said reference state, on the basis of a sample spectrum for a test sample from said test subject and said statistical property, with an associated confidence level, specifically on the basis of the extent to which said sample spectrum falls within said confidence interval spectrum.

Comparing Pools Also, separate pools can be compared. For example, a set of sample spectra may be compared to a set of equivalent spectra for a reference state, using, for example, the methods described below.

The set of sample spectra may be, for example, spectra for a single sample, spectra for a set of samples (e. g., one spectrum per sample, several spectra per sample), spectra for a set of samples from a set or organisms (e. g., one spectrum per sample, one sample per organism, etc.). The set of sample spectra may be, for example, for a test group, another control group, etc.

For example, it may be desirable to compare a new pool of organisms to an existing reference pool of organisms, to determine how they compare as a group, e. g., whether or not they are a typical pool that, in effect, could have been drawn from the reference pool. If not, then the new pool is deviant in some way, and it is possible to determine which spectra (and therefore which organisms) and which spectral regions therein (and therefore possible biomarkers) are responsible for this deviation.

To do so, additional statistics are calculated for the equivalent spectra set X, for example, by using a"bootstrap"method. For example, a number of sub-sets of X, denoted Xss k, where k ranges from 1 to the number of sub-sets, nss, may be selected (e. g., at random). Typically, the number of spectra in each sub-set is the same.

Typically, the number of sub-sets, nss, is as large as is practical (e. g., as limited by the

size of the set X, and available computing power). Many methods for selecting the sub- set are known, including, for example,"sampling without replacement"and"sampling with replacement." Statistics (e. g., mean, standard deviation, relative standard deviation, skewness, and kurtosis) are then calculated for each sub-set, and the resulting statistics spectra tabulated. This gives, for example, a new set of nSS mean spectra, mk, denoted M; a new set of nSS standard deviation spectra, stdk, denoted STD; a new set of nss relative standard deviation spectra, rstdk, denoted RSTD; a new set of nSs skewness spectra, skk, denoted SK; and, a new set of nSs kurtosis spectra, ktk, denoted KT.

Confidence limits (e. g., H and L) are then calculated using a cumulative distribution function for each of these new sets (e. g., M, STD, RSTD, SK, KT). For example, a 95% confidence limit may be chosen, and the H and L values for each set calculated (e. g., Hm and Lm ; HS, d and Lstd ; Hrstd and Lrstd ; Hsk and Lsk ; Hkt and Lkt).

Statistics spectra (e. g., mean, standard deviation, relative standard deviation, skewness, and kurtosis) are then calculated for a set of sample spectra. These statistics spectra are used to assess the set of sample spectra with respect to the set of equivalent spectra for a reference state, to determine, for example, how similar or dissimilar they are. For example, does the sample set mean fall within the confidence interval for the reference sub-set mean. In this way, a test pool of organisms can be qualifiedas"normal"or"abnormal." This approach is useful, for example, when assessing a new pool of control organisms to determine whether or not it conforms with the existing pool of control organisms.

It is also useful when assessing a pool of test organisms which have been subjected to an applied stimulus, which can be compared to a control pool to help classify them as "normal"or"abnormal."Again, it is possible to determine which spectra (and therefore which organisms) and which spectral regions therein (and therefore possible biomarkers) are responsible for this deviation.

Thus, one aspect of the present invention pertains to a method of classifying a set of sample spectra, said method comprising the steps of:

(a) calculating a statistical property (e. g., mean) for each of a plurality of sub- sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property (e. g., H and L for each of mean1, mean2, etc.); (b) classifying said set of sample spectra as normal or abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis of the extent to which a corresponding statistical property (e. g., mean) of said set of sample spectra falls within a corresponding confidence interval spectrum.

More generally, one aspect of the present invention pertains to a method of classifying a set of sample spectra, said method comprising the steps of: (a) calculating at least one statistical property (e. g., mean, std) for each of a plurality of sub-sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc., std1, std2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property (e. g., H and L for each of mean1, mean2, etc., std1, std2, etc.); (b) classifying said set of sample spectra as normal or abnormal, with respect to said reference state, on the basis of said statistical properties, with an associated confidence level, specifically on the basis of the extent to which one or more corresponding statistical properties (e. g., mean, std) of said set of sample spectra falls within a corresponding confidence interval spectrum.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

In one embodiment, said at least one confidence interval spectrum is a plurality of confidence interval spectra.

These methods may be employed in a corresponding method of classifying a sample, a set of samples, a subject, or a set of subjects, on the basis of an appropriate set of sample spectra.

One aspect of the present invention pertains to a method of classifying a sample or a set of samples, comprising a method of classifying a set of sample spectra as described herein, wherein said set of sample spectra are for said sample or said set of samples.

One aspect of the present invention pertains to a method of classifying a subject or a set of subjects, comprising a method of classifying a set of sample spectra as described herein, wherein said set of sample spectra are for a sample or a set of samples, wherein said sample or set of samples are from said subject or said set of subjects.

For example, these methods may be employed in a corresponding method of classifying a sample, for example, on the basis of a set of sample spectra for said sample.

For example, these methods may be employed in a corresponding method of classifying a set of samples, for example, on the basis of a set of sample spectra comprising a sample spectrum for each sample (i. e., one spectrum per sample) ; on the basis of a set of sample spectra comprising a plurality of sample spectra for each sample (i. e., many spectra per sample).

For example, these methods may be employed in a corresponding method of classifying a subject, for example, on the basis of a set of sample spectra for a sample from said subject (i. e., many spectra per sample, one sample); on the basis of a set of sample spectra comprising a sample spectrum for each of a plurality of samples from said subject (i. e., one spectrum per sample, many samples); on the basis of a set of sample spectra comprising a plurality of sample spectra for each of a plurality of samples from said subject (i. e., many spectra per sample, many samples).

For example, these methods may be employed in a corresponding method of classifying a set of subjects, for example,

on the basis of a set of sample spectra comprising a sample spectrum for a sample from each of said subjects (i. e., one spectrum per sample, one sample per subject); on the basis of a set of sample spectra comprising a set of sample spectra for a sample from each of said subjects (i. e., many spectra per sample, one sample per subject); on the basis of a set of sample spectra comprising a sample spectrum for each of set of samples from each of said subjects (i. e., one spectrum per sample, many samples per subject); on the basis of a set of sample spectra comprising a plurality of sample spectra for each of set of samples from each of said subjects (i. e., many spectra per sample, many samples per subject).

Sets and Sub-Sets of Equivalent Spectra In one embodiment, the set of sample spectra and each of the sub-sets of the set of equivalent spectra are the same size.

In one embodiment, the size of the sub-sets is 10 or more.

In one embodiment, the size of the sub-sets is 20 or more.

In one embodiment, the size of the sub-sets is 50 or more.

In one embodiment, the size of the sub-sets is 100 or more.

In one embodiment, the number of sub-sets is 10 or more.

In one embodiment, the number of sub-sets is 50 or more.

In one embodiment, the number of sub-sets is 100 or more.

In one embodiment, the number of sub-sets is 200 or more.

In one embodiment, the number of sub-sets is 1000 or more.

Classification: Falls Within/Falls Outside As discussed above, a spectrum (whether a data spectrum or statistics spectrum) may be classified as normal or abnormal on the basis of the extent to which it falls within a confidence interval spectrum, for example, on the basis of whether or not it falls wholly, partially, or not at all, within a confidence interval spectrum.

For the avoidance of doubt, all cases may be classified as one of two mutually exclusive classes"falls within"or"falls outside."Classification as"falls within"is equivalent to classification as"not falls outside"and classification as"falls outside"is equivalent to classification as"not falls within." In general, said"extent to which falls within"and said"extent to which falls outside"is determined by whether or not a predetermined fraction of data points fall outside a corresponding confidence interval spectrum.

In one embodiment, a spectrum may be classified as abnormal if even one data point or only a few data points (e. g., more than a predetermined fraction of data points) fall outside the confidence interval spectrum. The requirements for uniformity and conformity are set, for example, by the predetermined fraction, which is selected according to the particular circumstances.

In one embodiment, the spectrum is classified as abnormal or normal on the basis of whether or not more than a predetermined fraction of data points of said sample spectrum fall outside said confidence interval spectrum.

In one embodiment, the spectrum is classified as abnormal on the basis that more than a predetermined fraction of data points of said sample spectrum fall outside said confidence interval spectrum.

For example, one or a very few data points failing outside the confidence interval may indicate candidates for biomarkers and biomarker combinations.

However, when screening candidates for controls, conformity requirements may be relaxed. For example, if a candidate control organism is rare or expensive, it may be desirable to relax the conformity criteria; in contrast, if a candidate control organism can be readily and cheaply replaced, then it may be desirable to tighten the conformity criteria.

Similarly, if strict conformity is not required because the expected changes in the test group are very large, it may be desirable to relax the conformity criteria; in contrast, if conformity is crucial in order to detect small or subtle changes in the test group, then it may be desirable to tighten the conformity criteria.

The downstream consequences may also play a role is deciding the conformity requirement. For example, the uniformity of the controls will affect the confidence with which a test sample is classified as"abnormal,"which may ultimately determine, for example, whether or not a particular therapy or surgery is performed. In such cases, strict conformity may be desirable.

In any case, examples of such predetermined fractions include 0.1%, 0.5%, 1%, 2%, 3%, 5%, 8%, 10%, 15%, 20%, and 25%.

Identifying Biomarkers Since the statistical description of the data is arranged by descriptor, j, it is possible to identify those descriptors which are responsible for the spectrum (e. g., and consequently, the sample, the organism) being classified as abnormal. For example, for a test organism subjected to an applied stimulus, peaks at these notable descriptors are strong candidates as biomarkers or biomarker combinations for the applied stimulus in question. These notable descriptors identify windows or spectral regions (e. g., NMR chemical shift ranges) of particular interest.

The cumulative distribution statistics, and in particular the confidence interval spectrum, may be used to classify a particular descriptor (e. g., a particular experimental parameter derived from spectral data), such as a window or spectral region (e. g., in which one or more peaks fall) as a biomarker or biomarker combination, or at least as a candidate as one, with a certain degree of confidence. For example, data from test organisms subjected to a particular applied stimulus may reveal one or more signals which fall outside the 95% confidence interval for control organisms. The underlying metabolite (s) responsible for this peak or peak region may then be classified as a biomarker or biomarker combination, or at least as a candidate as one, for that applied stimulus, with 95% confidence.

Thus, one aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property (e. g., mean) for a set of equivalent spectra for a reference state;

including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property (e. g., H and L for mean); (b) classifying one or more experimental parameters derived from a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters fall outside said confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal experimental parameters, with an associated confidence level.

One aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property (e. g., mean) for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property (e. g., H and L for mean); (b) classifying one or more spectral regions of a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal spectral regions have signal intensities which fall outside said confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

More generally, one aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property (e. g., mean, std, etc.) for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property (e. g., H and L for each of mean, std., etc.); and, (b) classifying one or more experimental parameters derived from a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters fall outside a corresponding confidence interval spectrum; and,

(c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal experimental parameters, with an associated confidence level.

One aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property (e. g., mean, std, etc.) for a set of equivalent spectra for a reference state; including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property (e. g., H and L for each of mean, std., etc.); and, (b) classifying one or more spectral regions of a sample spectrum as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal spectral regions have signal intensities which fall outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

In one embodiment, said at least one confidence interval spectrum is a plurality of confidence interval spectra.

Candidate biomarkers and/or biomarker combinations may be further examined using conventional methods to confirm that they are, in fact, biomarkers and/or biomarker combinations. For example, the identity of the chemical species may be confirmed, for example, using complementary spectroscopic and analytic techniques. Relevant metabolic pathways which involve the chemical species may be examined to determine their role, for example, in the applied stimulus under study.

Identifying Biomarkers Using Pools Biomarkers and/or biomarker combinations may also be identified using pools of data.

Thus, one aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property (e. g., mean) for each of a plurality of sub- sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property (e. g., H and L for each of mean1, mean2, etc.); (b) classifying one or more experimental parameters derived from a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such abnormal experimental parameters have a statistical property which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

One aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating a statistical property (e. g., mean) for each of a plurality of sub- sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for said statistical property (e. g., H and L for each of mean1, mean2, etc.); (b) classifying one or more spectral regions of a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such spectral regions have a statistical property which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said spectral regions, with an associated confidence level.

More generally, one aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of:

(a) calculating at least one statistical property (e. g., mean, std) for each of a plurality of sub-sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc., std1, std2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property (e. g., H and L for each of mean1, mean2, etc., std1, std2, etc.); (b) classifying one or more experimental parameters derived from a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that abnormal such spectral regions have one or more statistical properties (e. g., mean, std) which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said abnormal spectral regions, with an associated confidence level.

One aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination, said method comprising the steps of: (a) calculating at least one statistical property (e. g., mean, std) for each of a plurality of sub-sets (e. g., set1, set2, etc.) of a set of equivalent spectra for a reference state, to yield a set of statistical properties (e. g., mean1, mean2, etc., std1, std2, etc.); including calculating at least one confidence interval spectrum associated with a predetermined confidence level for at least one of said at least one statistical property (e. g., H and L for each of mean1, mean2, etc., std1, std2, etc.); (b) classifying one or more spectral regions of a set of sample spectra as abnormal, with respect to said reference state, on the basis of said statistical property, with an associated confidence level, specifically on the basis that such spectral regions have one or more statistical properties (e. g., mean, std) which falls outside a corresponding confidence interval spectrum; and, (c) identifying said candidate biomarker or biomarker combination on the basis of said spectral regions, with an associated confidence level.

In one embodiment, said at least one statistical property is a plurality of statistical properties.

In one embodiment, said at least one confidence interval spectrum is a plurality of confidence interval spectra.

Biomarkers and Their Use As discussed above, one aspect of the present invention pertains to a method of identifying a candidate biomarker or biomarker combination for an applied stimulus or condition, as described herein. Additional aspects of the invention involve the biomarker or biomarker combination so identified.

Thus, one aspect of the present invention pertains to a biomarker or biomarker combination identified by a method as described herein.

One aspect of the present invention pertains to a biomarker or biomarker combination identified by a method as described herein, for use in a method of classification.

One aspect of the present invention pertains to a method of classification which employs or relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to use of one or more biomarkers or biomarker combinations identified by a method of classification as described herein.

One aspect of the present invention pertains to an assay for use in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to use of an assay in a method of classification, which assay relies upon one or more biomarkers or biomarker combinations identified by a method as described herein.

One aspect of the present invention pertains to a method of diagnosis employing one or more biomarkers or biomarker combinations identified by a method as described herein.

Analysis of an Applied Stimulus or Condition The methods described herein, including, for example, methods of classifying a spectrum, method of classifying a sample, methods of classifying a subject, methods of identifying a candidate biomarker or biomarker combination, may be employed in a method of study or analysis of an applied stimulus or condition.

As an illustrative example, an analysis or study may involve collecting NMR spectra for blood serum samples from a number of healthy control subjects (control pool) and from a number of subjects diagnosed with a particular condition, for example, osteoporosis (disease pool). The spectra for the control pool may be taken as representative of normality, and may be used as an equivalent set for the reference state (of healthy subjects). Individual spectra from the disease pool may then be subjected to the methods described herein, for example, a method of classifying a spectrum, a sample, or a subject, in an effort to determine if, and preferably to confirm that, the test subject is "abnormal."Furthermore, once classified as abnormal, the underlying spectrum may be examined to identify the spectral region (s) which give rise to classification as abnormal, and the chemical species associated with the spectral region (s) identified as candidate biomarkers or biomarker combinations, preferably with an associated confidence level.

Also, sets of spectra from the from the disease pool may similarly examined, to permit similar results.

Thus, one aspect of the present invention pertains to a method of analysis of an applied stimulus (e. g., a condition), which method employs : a method of classifying a spectrum, a method of classifying a sample, a method of classifying a subject, or a method of identifying a candidate biomarker or biomarker combination; as described herein; wherein said sample spectrum or spectra are for a sample from an organism which has been subjected to said applied stimulus (e. g., in which a condition is present); and, wherein said set of equivalent spectra for a reference state comprises one or more control spectra for each of one or more samples from each of one or more organisms which have not been subjected to said applied stimulus (e. g., in which a condition is absent).

Such methods may be used, for example, as a method of diagnosis; a method of therapeutic monitoring; a method of evaluating drug therapy and/or drug efficacy; a method of detecting toxic side-effects of drug; a method of characterising and/or identifying a drug in overdose.

Thus, one aspect of the present invention pertains to a method of diagnosis of an applied stimulus or condition, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of therapeutic monitoring of a subject undergoing therapy, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of evaluating drug therapy and/or drug efficacy, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of detecting toxic side-effects of drug, comprising a method of analysis of an applied stimulus or condition, as described herein.

One aspect of the present invention pertains to a method of characterising and/or identifying a drug in overdose, comprising a method of analysis of an applied stimulus or condition, as described herein.

Applied Stimulus and Condition In the context of studies of organisms, the study may be in respect of an applied stimulus. The term"applied stimulus,"as used herein, pertains to a stimulus under study which is applied to, or is present in, an organism (s) under study, and is not applied to, and is absent in, a control organism (s).

The applied stimulus may be referred to as a"condition,"which condition is present in study organism (s), but absent in control organism (s)

As used herein, the term"condition"relates to a state which is, in at least one respect, distinct from the state of normality, as determined by a suitable control population.

A condition may be pathological (e. g., a disease) or physiological (e. g., phenotype, genotype, fasting, water load, exercise, hormonal cycles, e. g., oestrus, etc.).

Included among conditions is the state of"at risk of a condition,"predisposition towards a"condition, and the like, again as compared to the state of normality, as determined by a suitable control population. In this way, osteoporosis, at risk of osteoporosis, and predisposition towards osteoporosis are all conditions (and are also conditions associated with osteoporosis). Where the condition is the state of"at risk of," "predisposition towards,"and the like, a method of diagnosis may be considered to be a method of prognosis.

In this context, the phrases"at risk of,""predisposition towards,"and the like, indicate a probability of being classified/diagnosed (or being able to be classified/diagnosed) with the predetermined condition which is greater (e. g., 1.5x, 2x, 5x, 10x, etc.) than for the corresponding control. Often, a time period (e. g., within the next 5 years, 10 years, 20 years, etc.) is associated with the probability. For example, a subject who is 2x more likely to be diagnosed with the predetermined condition within the next 5 years, as compared to a suitable control, is"at risk of'that condition.

Included among conditions is the degree of a condition, for example, the progress or phase of a disease, or a recovery therefrom. For example, each of different states in the progress of a disease, or in the recovery from a disease, are themselves conditions.

In this way, the degree of a condition may refer to how temporally advanced the condition is. Another example of a degree of a condition relates to its maximum severity, e. g., a disease can be classified as mild, moderate or severe). Yet another example of a degree of a condition relates to the nature of the condition (e. g., anatomical site, extent of tissue involvement, etc.).

Particular examples of applied stimuli (conditions) include, but are not limited to, a xenobiotic, a disease state, and a genetic modification.

The term"xenobiotic,"as used herein, pertains to a substance (e. g., compound, composition) which is administered to an organism, or to which the organism is

exposed. In general, xenobiotics are chemical, biochemical or biological species (e. g., compounds) which are not normally present in that organism, or are normally present in that organism, but not at the level obtained following administration. Examples of xenobiotics include drugs, formulated medicines and their components (e. g., vaccines, immunological stimulants, inert carrier vehicles), infectious agents, pesticides, herbicides, substances present in foods (e. g. plant compounds administered to animals), and substances present in the environment.

The term"disease state,"as used herein, pertains to a deviation from the normal healthy state of the organism. Examples of disease states include, but are not limited to, bacterial, viral, and parasitic infections; cancer in all its forms; degenerative diseases (e. g., arthritis, multiple sclerosis) ; trauma (e. g., as a result of injury); organ failure (including diabetes); cardiovascular disease (e. g., atherosclerosis, thrombosis); and, inherited diseases caused by genetic composition (e. g., sickle-cell anaemia).

The term"genetic modification,"as used herein, pertains to alteration of the genetic composition of an organism. Examples of genetic modifications include, but are not limited to: the incorporation of a gene or genes into an organism from another species; increasing the number of copies of an existing gene or genes in an organism; removal of a gene or genes from an organism; and, rendering a gene or genes in an organism non-functional.

Samples As discussed above, many aspects of the present invention pertain to methods which involve a sample, e. g., a particular sample under study ("study sample"), a test sample, etc.

In general, a sample may be in any suitable form. For methods which involve spectra obtained or recorded for a sample, the sample may be in any form which is compatible with the particular type of spectroscopy, and therefore may be, as appropriate, homogeneous or heterogeneous, comprising one or a combination of, for example, a gas, a liquid, a liquid crystal, a gel, and a solid.

Samples which originate from an organism (e. g., subject, patient) may be in vivo; that is, not removed from or separated from the organism. Thus, in one embodiment, said

sample is an in vivo sample. For example, the sample may be circulating blood, which is"probed"in situ, in vivo, for example, using NMR methods.

Samples which originate from an organism may be ex vivo; that is, removed from or separated from the organism (e. g., an ex vivo blood sample, an ex vivo urine sample).

Thus, in one embodiment, said sample is an ex vivo sample.

In one embodiment, said sample is an ex vivo blood or blood-derived sample.

In one embodiment, said sample is an ex vivo blood sample.

In one embodiment, said sample is an ex vivo plasma sample.

In one embodiment, said sample is an ex vivo serum sample.

In one embodiment, said sample is an ex vivo urine sample.

In one embodiment, said sample is removed from or separated from an/said organism, and is not returned to said organism (e. g., an ex vivo blood sample, an ex vivo urine sample).

In one embodiment, said sample is removed from or separated from an/said organism, and is returned to said organism (i. e.,"in transit") (e. g., as with dialysis methods). Thus, in one embodiment, said sample is an ex vivo in transit sample.

Examples of samples include: a whole organism (living or dead, e. g., a living human); a part or parts of an organism (e. g., a tissue sample, an organ); a pathological tissue such as a tumour; a tissue homogenate (e. g. a liver microsome fraction); an extract prepared from a organism or a part of an organism (e. g., a tissue sample extract, such as perchloric acid extract); an infusion prepared from a organism or a part of an organism (e. g., tea, Chinese traditional herbal medicines); an in vitro tissue such as a spheroid; a suspension of a particular cell type (e. g. hepatocytes); an excretion, secretion, or emission from an organism (especially a fluid) ; material which is administered and collected (e. g., dialysis fluid) ; material which develops as a function of pathology (e. g., a cyst, blisters) ; and, supernatant from a cell culture.

Examples of fluid samples include, for example, blood plasma, blood serum, whole blood, urine, (gall bladder) bile, cerebrospinal fluid, milk, saliva, mucus, sweat, gastric juice, pancreatic juice, seminal fluid, prostatic fluid, seminal vesicle fluid, seminal plasma, amniotic fluid, foetal fluid, follicular fluid, synovial fluid, aqueous humour, ascite fluid, cystic fluid, blister fluid, and cell suspensions; and extracts thereof.

Examples of tissue samples include liver, kidney, prostate, brain, gut, blood, blood cells, skeletal muscle, heart muscle, lymphoid, bone, cartilage, and reproductive tissues.

Still other examples of samples include air (e. g., exhaust), water (e. g., seawater, groundwater, wastewater, e. g., from factories), liquids from the food industry (e. g. juices, wines, beers, other alcoholic drinks, tea, milk), solid-like food samples (e. g. chocolate, pastes, fruit peel, fruit and vegetable flesh such as banana, leaves, meats, whether cooked or raw, etc.).

Organisms, Subjects. Patients For samples which are, or are drawn from, an organism, the organism, in general, may be a prokaryote (e. g., bacteria) or a eukaryote (e. g., protoctista, fungi, plants, animals).

The organism may be an alga or a protozoan.

The organism may be a plant, an angiosperm, a dicotyledon, a monocotyledon, a gymnosperm, a conifer, a ginkgo, a cycad, a fern, a horsetail, a clubmoss, a liverwort, or a moss.

The organism may be a chordate, an invertebrate, an echinoderm (e. g., starfish, sea urchins, brittlestars), an arthropod, an annelid (segmented worms) (e. g., earthworms, lugworms, leeches), a mollusk (cephalopods (e. g., squids, octopi), pelecypods (e. g., oysters, mussels, clams), gastropods (e. g., snails, slugs)), a nematode (round worms), a platyhelminthes (flatworms) (e. g., planarians, flukes, tapeworms), a cnidaria (e. g., jelly fish, sea anemones, corals), or a porifera (e. g., sponges).

The organism may be an arthropod, an insect (e. g., beetles, butterflies, moths), a chilopoda (centipedes), a diplopoda (millipedes), a crustacean (e. g., shrimps, crabs, lobsters), or an arachnid (e. g., spiders, scorpions, mites).

The organism may be a chordate, a vertebrate, a mammal, a bird, a reptile (e. g., snakes, lizards, crocodiles), an amphibian (e. g., frogs, toads), a bony fish (e. g., salmon, plaice, eel, lungfish), a cartilaginous fish (e. g., sharks, rays), or a jawless fish (e. g., lampreys, hagfish).

The organism may be a mammal, a placental mammal, a marsupial (e. g., kangaroo, wombat), a monotreme (e. g., duckbilled platypus), a rodent (e. g., a guinea pig, a hamster, a rat, a mouse), murine (e. g., a mouse), avian (e. g., a bird), canine (e. g., a dog), feline (e. g., a cat), equine (e. g., a horse), porcine (e. g., a pig), ovine (e. g., a sheep), bovine (e. g., a cow), a primate, simian (e. g., a monkey or ape), a monkey (e. g., marmoset, baboon), an ape (e. g., gorilla, chimpanzee, orangutang, gibbon), or a human.

Furthermore, the organism may be any of its forms, for example, a spore, a seed, an egg, a larva, a pupa, or a foetus.

The subject (e. g., a human) may be characterised by one or more criteria, for example, sex, age (e. g., 40 years or more, 50 years or more, 60 years or more, etc.), ethnicity, medical history, lifestyle (e. g., smoker, non-smoker), hormonal status (e. g., pre- menopausal, post-menopausal), etc.

The term"population,"as used herein, refers to a group of organisms (e. g., subjects, patients). If desired, a population (e. g., of humans) may be selected according to one or more of the criteria listed above.

Spectra Typically, spectra used in the methods described herein are spectra obtained following acquisition, including the normal pre-processing associated with the particular type of spectrum (e. g., data processing, compression, baseline correction, signal averaging, Fourier transformation, etc.).

In one embodiment, the spectra are, or comprise, NMR spectra or NMR spectral data.

For example, a spectrum which comprises NMR spectral data may be, for example, spectrum derived from NMR spectral data, e. g., by data processing, compression, baseline correction, signal averaging, Fourier transformation, etc. Furthermore, a spectrum which comprises NMR spectral data may be, for example, a composite spectrum.

For example, in one embodiment, the sample spectrum is, or comprises, a sample NMR spectrum or sample NMR spectral data; and, the set of equivalent spectra is a set of spectra, each of which is, or comprises, an NMR spectrum or NMR spectral data.

For example, in one embodiment, the set of sample spectra is a set of spectra, each of which is, or comprises, a sample NMR spectrum or sample NMR spectral data; and, the set of equivalent spectra is a set of spectra, each of which is, or comprises, an NMR spectrum or NMR spectral data.

Spectroscopy Examples of the types of spectroscopy which give spectra suitable for the application of the methods of the present invention include, but are not limited to, the following : spectroscopies of all regions of the electromagnetic spectrum, including, for example, microwave spectroscopy ; far infrared spectroscopy ; infrared spectroscopy ; Raman and resonance Raman spectroscopy ; visible spectroscopy ; ultraviolet spectroscopy ; far ultraviolet (or vacuum ultraviolet) spectroscopy ; x-ray spectroscopy ; optical rotatory dispersion, circular dichroism (e. g., ultraviolet, visible and infrared); Mossbauer spectroscopy ; atomic absorption and emission spectroscopy ; ultraviolet fluorescence and phosphorescence spectroscopy ; magnetic resonance, including nuclear magnetic resonance (NMR), electron paramagnetic resonance (EPR), and MRI (magnetic resonance imaging); and mass spectrometry, including variations of ionization methods (including electron impact, chemical ionisation, thermospray, electrospray, matrix assisted laser desorption ionization (MALDI), inductively coupled plasma) and detection methods (including sector detection, quadruple detection, ion-trap, time-of-flight, and Fourier transform).

One particularly preferred class of spectroscopy is nuclear magnetic resonance (NMR).

Composite Spectra Many of the methods described herein may also be applied to composite spectra, sets of composite spectra, etc. The term"composite spectrum,"as used herein, pertains to a spectrum (or data vector) which comprises spectral data (e. g., NMR spectral data, e. g., an NMR spectrum) as well as at least one other datum or data vector. Examples of other data vectors include, e. g., one or more other NMR spectral data, e. g., NMR spectra, e. g., obtained for the same sample using a different NMR technique; other types of spectra, e. g., mass spectra, numerical representations of images, etc.; obtained for the another sample, of the same sample type (e. g., blood, urine, tissue, tissue extract), but obtained from the subject at a different timepoint; obtained for another sample of different sample type (e. g., blood, urine, tissue, tissue extract) for the same subject; and the like.

Examples of other data include, e. g., one or more clinical parameters. Clinical parameters which are suitable for use in composite methods include, but are not limited to, the following: (a) established clinical parameters routinely measured in hospital clinical labs : age; sex; body mass index; height; weight; family history; medication history; cigarette smoking; alcohol intake ; blood pressure; full blood cell count (FBCs); red blood cells ; white blood cells ; monocytes; lymphocytes ; neutrophils ; eosinophils ; basophils ; platelets ; haematocrit; haemoglobin ; mean corpuscular volume and related haemodilution indicators; fibrinogen; functional clotting parameters (thromoboplastin and partial thromboplastin) ; electrolytes (sodium, potassium, calcium, phosphate); urea; creatinine; total protein; albumin ; globulin ; bilirubin ; protein markers of liver function (alanine aminotransferase, alkaline phosphatase, gamma glutamyl transferase); glucose ; Hba1c (a measure of glucose-Haemoglobin conjugates used to monitor diabetes); lipoprotein profile; total cholesterol ; LDL; HDL; triglycerides ; blood group.

(b) established research parameters routinely measured in research laboratories but not usually measured in hospitals: hormonal status; testosterone; estrogen; progesterone; follicle stimulating hormone; inhibin; transforming growth factor-beta9 ; Transforming

growth factor-beta2; chemokines; MCP-1; eotaxin; plasminogen activator inhibitor-1; cystatin C.

(c) early-stage research parameters measured in one or a small number of specialist labs : antibodies to sRll ; antibodies to blood group A antigen; antibodies to blood group B antigen; immunoglobulin (IgD) against alpha-gal ; immunoglobulin (IgD) against penta- gal.

NMRSpectroscopy As discussed above, many aspects of the present invention pertain to methods which employ NMR spectra, or data obtained or derived from NMR spectra.

The principal nucleus studied in biomedical NMR spectroscopy is the proton or'H nucleus. This is the most sensitive of all naturally occurring nuclei. The chemical shift range is about 10 ppm for organic molecules. In addition'3C NMR spectroscopy using either the naturally abundant 1.1 %'3C nuclei or employing isotopic enrichment is useful for identifying metabolites. The 13C chemical shift range is about 200 ppm. Other nuclei find special application. These include 15 N (in natural abundance or enriched),'9F for studies of drug metabolism, and 31p for studies of endogenous phosphate biochemistry either in vitro or in vivo.

In order to obtain an NMR spectrum, it is necessary to define a"pulse program". At its simple, this is application of a radio-frequency (RF) pulse followed by acquisition of a free induction decay (FID)-a time-dependent oscillating, decaying voltage which is digitised in an analog-digital converter (ADC). At equilibrium, the nuclear spins are present in a number of quantum states and the RF pulse disturbs this equilibrium. The FID is the result of the spins returning towards the equilibrium state. It is necessary to choose the length of the pulse (usually a few microseconds) to give the optimum response.

This, and other experimental parameters are chosen on the basis of knowledge and experience on the part of the spectroscopist. See, for example, T. D. W. Claridge, High- Resolution NMR Techniques in Organic Chemistry : A Practical Guide to Modern NMR for Chemists, Oxford University Press, 2000. These are based on the observation frequency to be used, the known properties of the nucleus under study (i. e., the

expected chemical shift range will determine the spectral width, the desired peak resolution determines the number of data points, the relaxation times determine the recycle time between scans, etc.). The number of scans to be added is determined by the concentration of the analyte, the inherent sensitivity of the nucleus under study and its abundance (either natural or enhanced by isotopic enrichment).

After data acquisition, a number of possible manipulations are possible. The FID can be multiplied by a mathematical function to improve the signal-to-noise ratio or reduce the peak line widths. The expert operator has choice over such parameters. The FID is then often filled by a number of zeros and then subjected to Fourier transformation.

After this conversion from time-dependent data to frequency dependent data, it is necessary to phase the spectrum so that all peaks appear upright-this is done using two parameters by visual inspection on screen (now automatic routines are available with reasonable success). At this point the spectrum baseline can be curved. To remedy this, one defines points in the spectrum where no peaks appear and these are taken to be baseline. Usually, a polynomial function is fitted to these points, but other methods are available, and this function subtracted from the spectrum to provide a flat baseline. This can also be done in an automatic fashion. Other manipulations are also possible. It is possible to extend the FID forwards or backwards by"linear prediction"to improve resolution or to remove so-called truncation artefacts which occur if data acquisition of a scan is stopped before the FID has decayed into the noise. All of these decisions are also applicable to 2-and 3-dimensional NMR spectroscopy.

An NMR spectrum consists of a series of digital data points with a y value (relating to signal strength) as a function of equally spaced x-values (frequency). These data point values run over the whole of the spectrum. Individual peaks in the spectrum are identified by the spectroscopist or automatically by software and the area under each peak is determined either by integration (summation of the y values of all points over the peak) or by curve fitting. A peak can be a single resonance or a multiplet of resonances corresponding to a single type of nucleus in a particular chemical environment (e. g., the two protons ortho to the carboxyl group in benzoic acid). Integration is also possible of the three dimensional peak volumes in 2-dimensional NMR spectra. The intensity of a peak in an NMR spectrum is proportional to the number of nuclei giving rise to that peak (if the experiment is conducted under conditions where each successive accumulated free induction decay (FID) is taken starting at equilibrium). Also, the relative intensity of

peaks from different analytes in the same sample is proportional to the concentration of that analyte (again if equilibrium prevails at the start of each scan).

Thus, the term"NMR spectral intensity,"as used herein, pertains to some measure related to the NMR peak area, and may be absolute or relative. NMR spectral intensity may be, for example, a combination of a plurality of NMR spectral intensities, e. g., a linear combination of a plurality of NMR spectral intensities.

In the context of NMR spectral intensity, the term"NMR"refers to any type of NMR spectroscopy.

NMR spectroscopic techniques can be classified according to the number of frequency axes and these include 1 D-, 2D-, and 3D-NMR. 1 D spectra include, for example, single pulse ; water-peak eliminated either by saturation or non-excitation; spin-echo, such as CPMG (i. e., edited on the basis of spin-spin relaxation); diffusion-edited, selective excitation of specific spectra regions. 2D spectra include for example J-resolved (JRES); 1 H-1 H correlation methods, such as NOESY, COSY, TOCSY and variants thereof; heteronuclear correlation including direct detection methods, such as HETCOR, and inverse-detected methods, such as 1H-13C HMQC, HSQC, HMBC. 3D spectra, include many variants, all of which are combinations of 2D methods, e. g. HMQC- TOCSY, NOESY-TOCSY, etc. All of these NMR spectroscopic techniques can also be combined with magic-angle-spinning (MAS) in order to study samples other than isotropic liquids, such as tissues, which are characterised by anisotropic composition.

Preferred nuclei include 1H and 13C. Preferred techniques for use in the present invention include water-peak eliminated, spin-echo such as CPMG, diffusion edited, JRES, COSY, TOCSY, HMQC, HSQC, and HMBC.

NMR analysis (especially of biofluids) is carried out at as high a field strength as is practical, according to availability (very high field machines are not widespread), cost (a 600 MHz instrument costs about £500, 000 but a shielded 800 MHz instrument can cost more than £3, 500,000, depending on the nature of accessory equipment purchased), and ability to accommodate the physical size of the instrument.

Maintenance/operational costs do not vary greatly and are small compared to the capital cost of the machine and the personnel costs.

Typically, the'H observation frequency is from about 200 MHz to about 900 MHz, more typically from about 400 MHz to about 900 MHz, yet more typically from about 500 MHz to about 750 MHz.'H observation frequencies of 500 and 600 MHz may be particularly preferred. Instruments with the following'H observation frequencies are/were commercially available : 200,250,270 (discontinued), 300,360 (discontinued), 400,500, 600,700,750,800, and 900 MHz.

Higher frequencies are used to obtain better signal-to-noise ratio and for greater spectral dispersion of resonances. This gives a better chance of identifying the molecules giving rise to the peaks. The benefit is not linear because in addition to the better dispersion, the detailed spectral peaks can move from being"second-order"- where analysis by inspection is not possible, towards"first-order,"where it is. Both peak positions and intensities within multiplets change in a non-linear fashion as this progression occurs. Lower observation frequencies would be used where cost is an issue, but this is likely to lead to reduced effectiveness for classification and identification of biomarkers.

NMR Spectroscopy : Sample Preparation NMR spectra can be measured in solid, liquid, liquid crystal or gas states over a range of temperatures from 120 K to 420 K and outside this range with specialised equipment.

Typically, NMR analysis of biofluids is performed in the liquid state with a sample temperature of from about 274 K to about 328 K, but more typically from about 283 K to about 321 K. An example of a typical temperature is about 300 K.

Lower temperatures would be used to ensure that the biofluid did not suffer from any decomposition or show any effects of chemical or enzymatic reactions during the data acquisition. Higher temperatures may be used to improve detection of certain species.

For example, for plasma or serum, lipoproteins undergo a series of phase changes as the temperature is increased; in particular, the low density lipoprotein (LDL) peak intensities are rather temperature dependent and the lines sharpen and broader more- difficult-to-detect components become visible as the lipoprotein becomes more"liquid." Typically, biofluid samples are diluted with solvent prior to NMR analysis. This is done for a variety of reasons, including: to lessen solution viscosity, to control the pH of the solution, and to allow addition of reagents and reference materials.

An example of a typical dilution solvent is a solution of 0.9% by weight of sodium chloride in D2O. The D20 lessens the overall concentration of H2O and eases the technical requirements in the suppression of the solvent water NMR resonance, necessary for optimum detection of metabolite NMR signals. The deuterium nuclei of the D2O also provides an NMR signal for locking the magnetic field enabling the exact co-registration of successive scans.

Depending on the available amount of the biofluid, typically, the dilution ratio is from about 1: 50 to about 5: 1 by volume, but more typically from about 1: 20 to about 1: 1 by volume. An example of a typical dilution ratio is 3: 7 by volume (e. g., 1 50 KLL sample, 350 uL solvent), typical for conventional 5 mm NMR tubes and for flow-injection NMR spectroscopy.

Typical sample volumes for NMR analysis are from about 50 IlL (e. g., for microprobes) to about 2 mL. An example of a typical sample volume is about 500 uL.

NMR peak positions (chemical shifts) are measured relative to that of a known standard compound usually added directly to the sample. For biofluids such as urine this is commonly a partially deuterated form of TSP, i. e., 3-trimethylsilyl- [2, 2,3,3-2H4]- propionate sodium salt. For biofluids containing high levels of proteins, this substance is not suitable since it binds to proteins and shows a broadened NMR line. Added formate anion (e. g., as a salt) can be used in such cases as for blood plasma.

NMR Spectroscopy : Manipulation of NMR Spectra NMR spectra are typically acquired, and subsequently, handled in digitised form.

Conventional methods of spectral pre-processing of (digital) spectra are well known, and include, where applicable, signal averaging, Fourier transformation (and other transformation methods), phase correction, baseline correction, smoothing, and the like (see, for example, Lindon et al., 1980).

Modern spectroscopic methods often permit the collection of high or very high resolution spectra. In digital form, even a simple spectrum (e. g., signal versus spectroscopic parameter) may have many thousands, if not tens of thousands of data points. It is

often desirable to reduce or compress the data to give fewer data points, for both practical computing methods and also to effect some degree of signal averaging to compensate for physical effects, such as pH variation, compartmentalisation, and the like. The resulting data may be referred to as"spectral data." For example, a typical'H NMR spectrum is recorded as signal intensity versus chemical shift (o) which ranges from about 6 0 to 5 10. At a typical chemical shift resolution of about b 104-10-3 ppm, the spectrum in digital form comprises about 10,000 to 100,000 data points. As discussed above, it is often desirable to compress this data, for example, by a factor of about 10 to 100, to about 1000 data points.

For example, in one approach, the chemical shift axis, o, is"segmented"into"buckets" or"bins"of a specific length. For a 1-D'H NMR spectrum which spans the range from 5 0 to 5 10, using a bucket length, Ob, of 0.04 yields 250 buckets, for example, 5 10.0- 9.96,5 9.96-9.92,5 9.92-9.88, etc., usually reported by their midpoint, for example, 5 9.98,5 9.94,5 9.90, etc. The signal intensity within a given bucket may be averaged or integrated, and the resulting value reported. In this way, a spectrum with, for example, 100,000 original data points can be compressed to an equivalent spectrum with, for example, 250 data points.

A similar approach can be applied to 2-D spectra, 3-D spectra, and the like. For 2-D spectra, the"bucket"approach may be extended to a"patch."For 3-D spectra, the "bucket"approach may be extended to a"volume."For example, a 2-D'H NMR spectrum which spans the range from 5 0 to 5 10 on both axes, using a patch of ho 0.1 x to 0.1 yields 10, 000 patches. In this way, a spectrum with perhaps 10a original data points can be compressed to an equivalent spectrum of 104 data points.

In this context, the equivalent spectrum may be referred to as"a spectral data set,""a data set comprising spectral data,"etc., comprising experimental parameters derived from spectral data.

Software for such processing of NMR spectra, for example AMIX (Analysis of MIXture, V 2.5, Bruker Analytik, Rheinstetten, Germany) is commercially available.

Often, certain spectral regions carry no real diagnostic information, or carry conflicting biochemical information, and it is often useful to remove these"redundant"regions

before performing detailed analysis. In the simples approach, the data points are deleted. In another simple approach, the data in the redundant regions are replaced with zero values.

For example, due to the dynamic range problem with water in comparison with other molecules, the water resonance (around 5 4.7) is suppressed. However, small variations in water suppression remain, and these variations can undesirably complicate analysis. Similarly, variations in water suppression may also affect the urea signal (around 6 6.0), by cross saturation. Therefore, it is often useful to delete certain spectral regions, for example, from about 5 4.5 to 6.0 (e. g., 5 4.52 to 6.00).

Certain metabolites exhibit a strong degree of physiological variation (e. g., diurnal variation, dietary-related variation) that is unrelated to any pathophysiological process.

Such variation may undesirably complicate analysis, and mask more relevant details.

Therefore, it may be useful to delete the spectral regions associated with such compounds. However, it is often possible to isolate these effects in later (e. g., pattern recognition) analysis.

Xenobiotics (e. g., drugs) and their metabolites often give rise to large signals which do not directly correlate to the conditions (e. g., pathologies) which are induced by the xenobiotic. Therefore, it is often useful to delete the spectral regions associated with such compounds.

In general, NMR data is handled as a data matrix. Typically, each row in the matrix corresponds to an individual sample (often referred to as a"data vector"), and the entries in the columns are, for example, spectral intensity of a particular data point, at a particular 5 or ho (often referred to as"descriptors").

It is often useful to pre-process data, for example, by addressing missing data, translation, scaling, weighting, etc.

Multivariate projection methods, such as principal component analysis (PCA) and partial least squares analysis (PLS), are so-called scaling sensitive methods. By using prior knowledge and experience about the type of data studied, the quality of the data prior to multivariate modelling can be enhanced by scaling and/or weighting. Adequate scaling and/or weighting can reveal the important and interesting variation hidden within in the

data, and therefore make subsequent multivariate modelling more efficient. Scaling and weighting may be used to place the data in the correct metric, based on knowledge and experience of the studied system, and therefore reveal patterns already inherently present in the data.

If at all possible, missing data, for example, gaps in column values, should be avoided.

However, if necessary, such missing data may replaced or"filled"with, for example, the mean value of a column ("mean fill") ; a random value ("random fill") ; or a value based on a principal component analysis ("principal component fill"). Each of these different approaches will have a different effect on subsequent PR analysis.

"Translation"of the descriptor coordinate axes can be useful. Examples of such translation include normalisation and mean centering.

"Normalisation"may be used to remove sample-to-sample variation. Many normalisation approaches are possible, and they can often be applied at any of several points in the analysis. Usually, normalisation is applied after redundant spectral regions have been removed. In one approach, each spectrum is normalised (scaled) by a factor of 1/A, where A is the sum of the absolute values of all of the descriptors for that spectrum. In this way, each data vector has the same length, specifically, 1. For example, if the sum of the absolute values of intensities for each bucket in a particular spectrum is 1067, then the intensity for each bucket for this particular spectrum is scaled by 1/1067.

"Mean centering"may be used to simplify interpretation. Usually, for each descriptor, the average value of that descriptor for all samples is subtracted. In this way, the mean of a descriptor coincides with the origin, and all descriptors are"centred"at zero. For example, if the average intensity at 5 10. 0-9.96, for all spectra, is 1.2 units, then the intensity at 5 10.0-9.96, for all spectra, is reduced by 1.2 units.

In"unit variance scaling,"data can be scaled to equal variance. Usually, the value of each descriptor is scaled by 1/StDev, where StDev is the standard deviation for that descriptor for all samples. For example, if the standard deviation at 5 10.0-9.96, for all spectra, is 2.5 units, then the intensity at 5 10.0-9.96, for all spectra, is scaled by 1/2.5 or 0.4. Unit variance scaling may be used to reduce the impact of"noisy"data. For example, some metabolites in biofluids show a strong degree of physiological variation

(e. g., diurnal variation, dietary-related variation) that is unrelated to any pathophysiological process. Without unit variance scaling, these noisy metabolites may dominate subsequent analysis.

"Pareto scaling"is, in some sense, intermediate between mean centering and unit variance scaling. In effect, smaller peaks in the spectra can influence the model to a higher degree than for the mean centered case. Also, the loadings are, in general, more interpretable than for unit variance based models. In pareto scaling, the value of each descriptor is scaled by 1/sqrt (StDev), where StDev is the standard deviation for that descriptor for all samples. In this way, each descriptor has a variance numerically equal to its initial standard deviation. The pareto scaling may be performed, for example, on raw data or mean centered data.

"Logarithmic scaling"may be used to assist interpretation when data have a positive skew and/or when data spans a large range, e. g., several orders of magnitude. Usually, for each descriptor, the value is replaced by the logarithm of that value. For example, the intensity at 5 10.0-9.96 is replaced the logarithm of the intensity at 5 10.0-9.96, for all spectra.

In"equal range scaling,"each descriptor is divided by the range of that descriptor for all samples. In this way, all descriptors have the same range, that is, 1. For example, if, at 5 10.0-9.96, for all spectra, the largest value is 87 units and the smallest value is 1, then the range is 86 units, and the intensity at 5 10.0-9.96, for all spectra, is divided by 86 units. However, this method is sensitive to presence of outlier points.

In"autoscaling,"each data vector is mean centred and unit variance scaled. This technique is a very useful because each descriptor is then weighted equally and, in the case of NMR descriptors, large and small peaks are treated with equal emphasis. This can be important for metabolites present at very low, but still detectable, levels.

Several supervised methods of scaling data are also known. Some of these can provide a measure of the ability of a parameter (e. g., a descriptor) to discriminate between classes, and can be used to improve classification by stretching a separation.

For example, in"variance weighting,"the variance weight of a single parameter (e. g., a descriptor) is calculated as the ratio of the inter-class variances to the sum of the intra-

class variances. A large value means that this variable is discriminating between the classes. For example, if the samples are known to fall into two classes (e. g., a training set), it is possible to examine the mean and variance of each descriptor. If a descriptor has very different mean values and a small variance, then it will be good at separating the classes.

"Feature weighting"is a more general description of variance weighting, where not only the mean and standard deviation of each descriptor is calculated, but other well known weighting factors, such as the Fisher weight, are used.

Spurious or irregular data in spectra ("outliers"), which are not representative, are preferably identified and removed. Common reasons for irregular data ("outliers") include poor phase correction, poor baseline correction, poor chemical shift referencing, poor water suppression, bacterial contamination, shifts in the pH of the biofluid, toxin-or disease-induced biochemical response, and idiosyncratic response to xenobiotics.

Outliers are identified in different ways depending on the method of analysis used. For example, when using principal component analysis (PCA), small numbers of samples lying far from the rest of the replicate group can be identified by eye as outliers. A more objective means of identification for PCA is to use the Hotelling's T Test which is the multivariate version of the well known Student's T test used in univariate statistics. For any given sample, the T2 value can be calculated and this is compared with a standard value within which a chosen fraction (e. g., 95%) of the samples would normally lie.

Samples with T2 values substantially outside this limit can then be flagged as outliers.

Also, when using more sophisticated supervised methods, such as SIMCA or PNNs, a similar method is used. A confidence level (e. g., 95%) is selected and the region of multivariate space corresponding to confidence values above this limit is determined.

This region can be displayed graphically in several different ways (for example by plotting the critical T2 ellipse on a PCA scores plot). Any samples falling outside the high confidence region are flagged as potential outliers. Naturally, such samples are investigated in detail to determine the causes of their outlying nature before removing them from the model.

Implementation The methods of the present invention, or parts thereof, may be conveniently performed electronically, for example, using a suitably programmed computer system.

One aspect of the present invention pertains to a computer system or device, such as a computer or linked computers, operatively configured to implement a method of the present invention, as described herein.

One aspect of the present invention pertains to computer code suitable for implementing a method of the present invention, as described herein, on a suitable computer system.

One aspect of the present invention pertains to a computer program comprising computer program means adapted to perform a method according to the present invention, as described herein, when said program is run on a computer.

One aspect of the present invention pertains to a computer program, as described above, embodied on a computer readable medium.

One aspect of the present invention pertains to a data carrier which carries computer code suitable for implementing a method of the present invention, as described herein, on a suitable computer.

Computers may be linked, for example, internally (e. g., on the same circuit board, on different circuit boards which are part of the same unit), by cabling (e. g., networking, ethernet, internet), using wireless technology (e. g., radio, microwave, satellite link, cell- phone), etc., or by a combination thereof.

Examples of data carriers and computer readable media include chip media (e. g., ROM, RAM, flash memory (e. g., Memory Sticky Compact FlashTM, SmartmediaT""), magnetic disk media (e. g., floppy disks, hard drives), optical disk media (e. g., compact disks (CDs), digital versatile disks (DVDs), magneto-optical (MO) disks), and magnetic tape media.

Applications The methods described herein can be used in the analysis of chemical, biochemical, and biological data.

The methods described herein provide powerful means for the diagnosis and prognosis of disease, for assisting medical practitioners in providing optimum therapy for disease, and for understanding the benefits and side-effects of xenobiotic compounds thereby aiding the drug development process.

Furthermore, the methods described herein can be applied in a non-medical setting, such as in post mortem examinations, forensic science, and the analysis of complex chemical mixtures other than mammalian cells or biofluids.

Examples of these and other applications of the methods described herein include, but are not limited to, the following : Medical Diagnostic Applications (a) Early detection of abnormality/problem. For example, the technique can be used to identify subjects suffering from cerebral edema immediately on arrival in the acute emergency department of a hospital. At present, when patients present with head trauma, it is difficult to tell whether cerebral edema will be a problem : as a result, it may not be possible to intervene until clinical symptoms of cerebral edema become evident, which may be too late to save the patient.

In a similar example, patients arriving at acute emergency departments can be screened for internal bleeding and organ rupture, to facilitate early surgical intervention.

In a third example, the methods described herein can be used to identify a clinically silent disease (e. g., low bone mineral density (e. g., osteoporosis); infection with Helicobacter Pylori) prior to the onset of clinical symptoms (e. g., fracture; development of ulcers).

(b) Diagnosis (identification of disease), especially cheap, rapid, and non-invasive diagnosis. For example, the methods described herein can be used to replace treadmill

exercise tests, echiocardiograms, electrocardiograms, and invasive angiography as the collective method for the identification of coronary heart disease. Since the current tests for coronary heart disease are slow, expensive, and invasive (with associated morbidity and mortality), the methods described herein offer significant advantages.

(c) Differential diagnosis, e. g., classification of disease, severity of disease, etc., for example, the ability to distinguish patients with coronary artery disease affecting 1,2, or all 3 coronary arteries (see example below) ; the ability to distinguish disease at different anatomical sites, e. g., in the left coronary artery versus the circumflex artery, or in the carotid arteries as opposed to the coronary arteries.

(d) Population targeting. A condition (e. g., coronary heart disease, osteoporosis) may be clinically silent for many years prior to an acute event (e. g., heart attack, bone fracture), which may have significant associated morbidity or mortality. Drugs may exist to help prevent the acute event (e. g., statins for heart disease, bisphosphonates for osteoporosis), but often they cannot be efficiently targeted at the population level. The requirements for a test to be useful for population screening are that they must be cheap and non-invasive. The methods described herein are ideally suited to population screening. Screens for multiple diseases with a single blood sample (e. g., osteoporosis, heart disease, and cancer) further improve the cost/benefit ratio for screening.

(e) Classification, fingerprinting, and diagnosis of metabolic diseases (e. g., inborn errors of metabolism).

Identifying, classifying, determining the progress of, and monitoring the treatment of, infectious diseases.

(g) Characterization and identification of drugs used in overdose. For example, a patient may be unconscious following an overdose and/or the nature of the drug taken in overdose may not be known. The methods described herein can be used to characterise the biological consequences of the overdose and to rapidly identify candidate agents, facilitating rapid intervention to reverse the effects. Thus an overdose of opioids could rapidly be countered with naloxone.

(h) Characterization and identification of poisons, and the metabolic or biological consequences of poisoning. Many victims of poisoning (e. g., children) are unaware of

the nature of the substance they have taken. Furthermore, the subject may be unconscious or unable to communicate. The methods described herein can be used to characterise the biological consequences of the poisoning and to rapidly identify candidate poisons. This would facilitate administration of appropriate antidote, which typically must be done as quickly as possible after exposure to (e. g., ingestion of) the toxic substance.

Medical Prognosis Applications (a) Prognosis (prediction of future outcome), including, for example, analysis of"old" samples to effect retrospective prognosis. For example, a sample can be used to assess the risk of myocardial infarction among sufferers of angina, permitting a more aggressive therapeutic strategy to be applied to those at greatest risk of progressing to a heart attack.

(b) Risk assessment, to identify people at risk of suffering from a particular indication.

The methods described herein can be used for population screening (as for diagnosis) but in this case to screen for the risk of developing a particular disease. Such an approach will be useful where an effective prophylaxis is known but must be applied prior to the development of the disease in order to be effective. For example, bisphosphonates are effective at preventing bone loss in osteoporosis but they do not increase pathologically low bone mineral density. Ideally, therefore, these drugs are applied prior to any bone loss occurring. This can only be done with a technique which facilitates prediction of future disease (prognosis). The methods described herein can be used to identify those people at high risk of losing bone mineral density in the future, so that prophylaxis may begin prior to disease inception.

(c) Antenatal screening for a wide range of disease susceptibilities. The methods described herein can be used to analyse blood or tissue drawn from a pre-term fetus (e. g., during chorionic vilus sampling or amniocentesis) for the purposes of antenatal screening.

Aids to Therapeutic Intervention (a) Therapeutic monitoring, e. g., to monitor the progress of treatment. For example, by making serial diagnostic tests, it will be possible to determine whether and to what extent the subject is returning to normal following initiation of a therapeutic regimen.

(b) Patient compliance, e. g., monitoring patient compliance with therapy. Patient compliance is often very poor, particularly with therapies that have significant side- effects. Patients often claim to comply with the therapeutic regimen, but this may not always be the case. The methods described herein permit the patient compliance to be monitored, both by directly measuring the drug concentration and also by examining biological consequences of the drug. Thus, the methods described herein offer significant advantages over existing methods of monitoring compliance (such as measuring plasma concentrations of the drug) since the patient may take the drug just prior to the investigation, while having failed to comply for previous weeks or months.

By monitoring the biological consequences of therapy, it is possible to assess long-term compliance.

(c) Toxicology, including sophisticated monitoring of any adverse reactions suffered, e. g., on a patient-by-patient basis. This will facilitate investigation of idiosyncratic toxicity. Some patients may suffer real, clinically significant side-effects from a therapy which were not seen in the majority. Application of the methods described herein facilitate rapid identification of these rare, idiosyncratic toxicities so that the therapy can be discontinued or modified as appropriate. Such an approach allows the therapy to be tailored to the individual metabolism of each patient.

(d) The methods described herein can be used for"pharmacometabonomics,"in analogy to pharmacogenomics, e. g., subjects could be divided into"responders"and "nonresponders"using the metabonomic profile as evidence of"response,"and features of the metabonomic profile could then be used to target future patients who would likely respond to a particular therapeutic course. For example, patients given statins could be monitored using the methods described herein for beneficial changes in the subtle composition of the lipoproteins which are associated with coronary heart disease. On this basis, the patients could be categorised into"statin responsive"or"statin unresponsive". In a second stage, the methods described herein could be re-applied to the untreated metabonomic fingerprint to identify pattern elements which predict future

responses to statins. Thus, the clinician would know whether or other patients should be treated with statins, without having to wait weeks or months to assess the outcome.

Tools for Drug Development (a) Clinical evaluations of drug therapy and efficacy. As for therapeutic monitoring, the methods described herein can be used as one end-point in clinical trials for efficacy of new therapies. The extent to which sequential diagnostic fingerprints move towards normal can be used as one measure of the efficacy of the candidate therapy.

(b) Detection of toxic side-effects of drugs and model compounds (e. g., in the drug development process and in clinical trials). For example, it will be possible to identify the major sites of toxic effects (e. g., liver, kidney, etc.) for new treatments during Phase I studies, as well as identifying idiosyncratic toxicities during later stage clinical trials.

(c) Improvement in the quality control of transgenic animal models of disease; aiding the design of transgenic models of disease. Transgenic models of various diseases have been useful for the preclinical development of new therapies. Although the transgenic model may recapitulate many of the phenotypic markers of the human disease, it is often unclear whether similar biochemical mechanisms underlie the resulting phenotype.

(d) Other animal models of disease. For example, injection of bovine type 11 collagen into mice has often been used as model of rheumatoid arthritis, resulting in joint swelling and autoantibodies, but the mechanisms resulting in the phenotype have little in common with the human disease. As a result, therapies which are effective in the animal model may be ineffective in man. The methods described herein can be used to examine the metabolic and phenotypic consequences of gene manipulation or other interventions used to yield an animal model of disease, and to compare those with the metabolic and phenotypic changes characteristic of the disease in man, and thereby validate a range of animal models of human diseases.

(e) Searching for new biochemical markers of disease and/or tissue or organ damage.

For example, the NMR bin around 53. 22 was identified as being particularly associated with coronary heart disease (see examples below), and the associated species has

been identified as a novel metabolic marker of coronary heart disease which may be amenable to therapeutic intervention.

Commercial and Other Non-Medical Applications (a) Commercial classification for actuarial assessment, to address the commercial need for insurance companies to assess future risk of disease. Examples include the provision of health insurance and general life cover. This application is similar to prognostic assessment and risk assessment in population screening, except that the purpose is to provide accurate actuarial information.

(b) Clinical trial enrollment, to address the commercial need for the ability to select individuals suffering from, or at risk of suffering from, a particular condition for enrolment in clinical trials. For example, at present to perform a clinical trial to assess efficacy of a drug intended to prevent heart disease it would be necessary to enroll at least 4,000 subjects and follow them for 4 years. If it were possible to select individuals who were suffering from heart disease, it is estimated that it would be possible to use 400 subjects followed for 2 years reducing the cost by 25-fold or more.

(c) Characterization and identification of illicit drugs, and the metabolic or biological consequences of substance abuse. As for monitoring patient compliance with desired therapeutics, the methods described herein can be used to examine the metabolic consequences of illegal substance abuse, permitting confirmation of the use of the substance, even if none of the substance or its metabolites are present in the system at the time of investigation. This circumvents the ability to use proscribed substances chronically, but to temporally suspend their use to avoid being identified. This application could be applied to identification of habitual users of illegal drugs (such as heroin, cocaine, amphetamines, etc.) for police use, or for monitoring use of banned substances in sports (e. g., to detect use of anabolic steroids among athletes, etc.).

(d) Application to pathology and post-mortem studies. For example, the methods described herein could be used to identify the proximate cause of death in a subject undergoing post-mortem examination.

(e) Application to forensic science. For example, the methods described herein can be used to identify the metabolic consequences of a range of actions on a subject (who

may be either dead or alive at the time of the investigation). For example, the methods described herein can be applied to identify metabolic consequences of asphyxiation, poisoning, sexual arousal, or fear.

(f) Analysis of samples other than mammalian cells or biofluids. For example, the methods described herein can be applied to a panel of wines, classified by experts for their quality. By recognising patterns associated with good quality, the methods described herein can be used by wine manufacturers during the preparation of blends, as well as by wine purchasers to facilitate a rapid and independent assessment of the quality of a given wine.

(g) The methods described herein can also be used to identify (known or novel) genotypes and/or phenotypes, and to determine an organism's phenotype or genotype.

This may assist with the choice of a suitable treatment or facilitate assessment of its relevance in a drug development process. For example, the generation of metabonomic data in panels of individuals with disease states, infected states, or undergoing treatment may indicate response profiles of groups of individuals which can be differentiated into two or more subgroups, indicating that an allefic genetic basis for response to the disease, state, or treatment exists. For example, a particular phenotype may not be susceptible to treatment with a certain drug, while another phenotype may be susceptible to treatment. Conversely, one phenotype might show toxicity because of a failure to metabolise and hence excrete a drug, which drug might be safe in another phenotype as it does not exhibit this effect. For example, metabonomic methods can be used to determine the acetylator status of an organism: there are two phenotypes, corresponding to"fast"and"slow"acetylation of drug metabolites. Phenotyping can be achieved on the basis of the urine alone (i. e., without dosing a xenobiotic), or on the basis of urine following dosing with a xenobiotic which has the potential for acetylation (e. g., galactosamine). Similar methods can also be used to determine other differences, such as other enzymatic polymorphisms, for example, cytochrome P450 polymorphism.

The methods described herein may also be used in studies of the biochemical consequences of genetic modification, for example, in"knock-out animals"where one or more genes have been removed or made non-functional ; in"knock-in"animals where one or more genes have been incorporated from the same or a different species; and in animals where the number of copies of a gene has been increased, as in the model which results in the over-expression of the beta amyloid protein in mice brains as a

model for Alzheimer's disease). Genes can be transferred between bacterial, plant and animal species.

The combination of genomic, proteomic, and metabonomic data sets into comprehensive"bionomic"systems may permit an holistic evaluation of perturbed in vivo function.

The methods described herein may be used as an alternative or adjunct to other methods, e. g., the various genomic, pharmacogenomic, and proteomic methods.

EXAMPLES The following examples are provided solely to illustrate the present invention and are not intended to limit the scope of the invention, as described herein.

The methods of the present invention have been exemplified in their application to NMR spectra. Nonetheless, the methods of the present invention are similarly applicable to other types of spectra, such as those discussed above.

Example 1 900'H NMR spectra were collected for Han-Wistar and Sprague-Dawley rat urines using a 600 MHz Bruker DRX600 NMR spectrometer. The rats comprised several different groups, each of which was a control group in a different toxicology study.

Animals were examined to ensure they were healthy before being included, and were housed in metabolism cages with a standard day/night cycle and standard diet, with urine collected either once or twice daily.

The 95% confidence intervals were calculated ; results for 9 spectral regions (e. g., 5 4.12-4.08 reported as 5 4.10) are shown in Table 1.

It may be expected that 45 of the 900 spectra have spectral regions which fall outside the 95% confidence intervals (5% of 900). Corresponding spectral regions for two such "outside"spectra (Test 1, Test 2) are included in Table 1.

Table 1 (Part A) 6 (ppm) 4.10 4.06 4.02 3.98 3.94 Ctrl spectra (x10-2) Median 0. 63 1. 71 0. 83 2. 56 1. 47 95% low 0. 51 1. 33 0. 69 1. 22 1. 17 95%high 0.85 2.37 1.08 3.73 2.23 Test spectra (x10-2) Test 1 0. 68 2 25 1. 10 # 2. 30 2. 45 # Test 2 0. 49 & 1. 67 0. 71 0. 91 & 1. 14 & Table 1 (Part B) 6 (ppm) 3.90 3.86 3.82 3.78 Ctri spectra (xi-2) Median 1. 91 2. 08 1. 86 2. 38 95%low 1. 51 1. 55 1. 49 1. 80 95% high 2. 85 3. 21 2. 92 3. 74 Test spectra (x10-2) Test 1 2.99 # 3.90 # 3. 11 # 4. 50 # Test 2 1. 42 & 1. 68 1. 49 1. 93

# = above 95% high threshold.

& = below 95% low threshold.

The data (for Test 1) are also illustrated in Figures 1 and 2. Figure 1 illustrates the 95% confidence interval high (A) and low (B) spectra, and the Test 1 spectrum (C), for a wide range of chemical shift (6 0-10). The data for a narrow range of chemical shift (5 3.78- 4.10) for Test 1 are also illustrated in Figure 2.

As can be seen from Figure 2, the Test 1 spectrum departs from 95% of the control population in the chemical shift range # 3.78-3.94. Metabolites in this window may be candidates as biomarkers or part of a biomarker combination.

Example 2 1H NMR spectra for a first (control) pool of 450 Han-Wistar rats were collected, as described in Example 1.100 sub-sets of spectra (sampling with replacement) were selected, each sub-set having 45 spectra. Statistics (mean, standard deviation, relative standard deviation, skew, and kurtosis) were calculated for each sub-set ("bootstrap").

95% confidence intervals (L and H) were then calculated for each of the resulting sets of statistics. Results for 9 spectral regions (e. g., # 4.12-4.08 reported as # 4.10) are shown in Table 2.

'H NMR spectra for a second (test) pool of 45 Sprague-Dawley rats were collected, as described in Example 1. Statistics (mean, standard deviation, relative standard deviation, skew, and kurtosis) were calculated for this test set. Again, results for 9 spectral regions (e. g., 5 4.12-4.08 reported as 5 4.10) are shown in Table 2.

Table 2 (Part A) ppm 4. 10 4. 06 4. 02 3. 98 3. 94 Control population-95% confidence intervals for 45 sample sub-sets Mean L 0. 63 1. 53 0. 77 2. 31 1. 35 (x10-2) H 0.69 1.62 0.83 2.68 1.46 Stddev L 0.68 1.12 0.42 4.95 1.39 (x10-3) H 1.30 2.34 1.84 7.19 2.25 Rel std dev L 1.08 0.71 0.55 1.89 0. 98 (x10-1) h 1.88 1.53 2.30 3.00 1.62 L 0. 51-2. 39-0.42-0. 91-0.08 Skewness H 2.20 0. 99 5.56 0. 14 1.77 L-0. 54-0.88-0.67-1.28-0.87 Kurtosis H 7.08 13.30 34.90 0.59 5.28 Test population (45 samples) Mean (x10-2) 0.64 1.90 # 0. 88 # 2.48 1.62 # Std dev (#10-3) 0.98 2.66 # 1.12 7.75 # 3.04 # Rel std dev (x10-') 0.93 & 1. 40 1. 27 3. 12 # 1.87 Skewness 0. 70 0. 40 0. 39 0. 02 1. 34 Kurtosis-0.19-0.04-0.81 &-1.48 & 2. 20 Table 2 (Part B) ppm 3. 90 3. 86 3. 82 3. 78 Control population-95% confidence intervals for 45 samples Mean L 1. 71 1. 79 1. 65 2. 10 (x10-2) H 1.84 1.97 1.77 2.32 Std dev L 1. 53 2. 15 1. 46 2. 65 (x10-3) H 2. 73 3. 58 2. 35 4. 40 Rel std dev L 0. 87 1. 19 0. 86 1. 20 (x10-1) H 1.49 1.89 1.34 1.96 Skewness L 0. 28 0. 35 0. 10 0. 28 H 1.97 1.50 1.40 1.55 Kurtosis L-0. 66-0. 86-0.90-0.79 H 5. 94 3. 12 2. 95 3. 20 Test population (45 samples) Mean (x10-2) 2.18 # 2. 42 # 2. 15 # 2.72 # Std dev (x10-3) 3.64 4. 34 3.91 # 5.34 Rel std dev (x10-') 1. 67 # 1.79 1.82 # 1.96 Skewness 0. 58 1. 27 1. 07 1. 27 Kurtosis 0. 07 2. 25 0. 96 2. 15

# = above 95% high threshold.

& = below 95% low threshold.

The data are also illustrated in Figures 3, 4,5,6, and 7, which illustrate the 95% confidence interval high (A) and low (B) spectra for each of the different statistical properties calculated, and the corresponding statistical property spectrum for the set of test spectra (C), all for a narrow range of chemical shift (o 3.78-4.10).

As can be seen from Figures 3-7, the test population clearly departs from 95% of the control population, in respect of mean, standard deviation, relative standard deviation at various points in the chemical shift range shown. Again, metabolites in the corresponding windows may be candidates as biomarkers or part of a biomarker combination, for example, for the differences between the two populations.

* * *

The foregoing has described the principles, preferred embodiments, and modes of operation of the present invention. However, the invention should not be construed as limited to the particular embodiments discussed. Instead, the above-described embodiments should be regarded as illustrative rather than restrictive, and it should be appreciated that variations may be made in those embodiments by workers skilled in the art without departing from the scope of the present invention as defined by the appended claims.

REFERENCES A number of patents and publications are cited above in order to more fully describe and disclose the invention and the state of the art to which the invention pertains. Full citations for these references are provided below. Each of these references is incorporated herein by reference in its entirety into the present disclosure, to the same extent as if each individual reference was specifically and individually indicated to be incorporated by reference.

Anker, L. S., and Jurs, P. C., 1992,"Prediction of C-13 nuclear magnetic resonance chemical shifts by artificial neural networks,"Anal. Chem., Vol. 64, pp. 1157- 1164.

Anthony, M. L. et al., 1994,"Pattern recognition classification of the site of nephrotoxicity based on metabolic data derived from proton nuclear magnetic resonance spectra of urine,"Mol. Pharmacol., Vol. 46, pp. 199-211.

Anthony, M. L. et al., 1995,"Classification of toxin-induced changes in'H NMR spectra of urine using an artificial neural network,"J. Pharm. Biomed. Anal., Vol. 13, pp. 205-211.

Beckwith-Hall, B. M. et al., 1998,"Nuclear magnetic spectroscopic and principal components analysis investigations into biochemical effects of three model hepatotoxins,"Chem. Res. Tox., Vol. 11, pp. 260-272.

Bishop, C., 1995, Neural Networks for Pattern Recognition, University Press, Oxford, England, pp. 164-193.

Bretthorst, G. L., 1990a,"Bayesian Analysis. 2. Signal-Detection and Model Selection," J. Magn. Reson., Vol. 88, pp. 552-570.

Bretthorst, G. L., 1990b,"Bayesian Analysis. 3. Applicants to NMR Signal-Detection, Model Selection, and Parameter-Estimation,"J. Magn. Reson., Vol. 88, pp. 571-595.

Bretthorst, G. L., Hung, C. C., Davignon, D. A., et al., 1988,"Bayesian-Analysis of Time- Domain Magnetic Resonance Signals,"J. Magn. Reson., Vol. 79, pp. 369-376.

Bro, R., 1997,"PARAFAC. Tutorial and applications,"in Chemometrics and Intelliqent Laboratory Systems, Vol. 38, pp. 149-171.

Broomhead, D. S., and Lowe, D., 1988,"Multi-variable functional interpolation and adaptive networks,"Complex Systems, Vol. 2, pp. 321-355.

Brown, T. R. and Stoyanova, R., 1996,"NMR spectral quantitation by principal- component analysis. 2. Determination of frequency and phase shifts,"J. Magn.

Reson., Series B, Vol. 112, pp. 32-43.

Claridge, T. D. W., High-Resolution NMR Techniques in Organic Chemistry : A Practical Guide to Modern NMR for Chemists, Oxford University Press, 2000.

Confort-Gouny, S., Vion-Dury, J., Nicoli, F., Dano, P., Gastaut, J.-L., and Cozzone, P. J., 1992,"Metabolic characterization of neurological diseases by proton localized nmr-spectroscopy of the human brain,"Comptes Rendus de I'Academie des Sciences Serie III-Sciences de la Vie-Life Sciences, Vol. 315, pp. 287-293.

Fan, T. W.-M., 1996,"Metabolite profiling by one-and two-dimensional NMR analysis of complex mixtures,"Prog. NMR Spectrosc., Vol. 28, pp. 161-219.

Farrant, R. D., et al., 1992,"An automatic data reduction and transfer method to aid pattern-recognition analysis and classification of NMR spectra,"J. Pharm.

Biomed. Anal., Vol. 10, pp. 141-144.

Frank, l. E., et al., 1984,"Prediction of product quality from spectral data using the partial least-squares method,"J. Chem. info. Comp., Vol. 24, p. 20-24.

Garrod, S., Humpher, E., Connor, S. C., Connelly, J. C., Spraul, M., Nicholson, J. K., and Hoimes, E., 2001,"High-resolution H-1 NMR and magic angle spinning NMR spectroscopic investigation of the biochemical effects of 2-bromoethanamine in intact renal and hepatic tissue,"Maqn. Reson. Med., Vol. 45, pp. 781-790.

Garland, K. P. R. et al., 1990a,"A pattern recognition approach to the comparison of'H NMR and clinical chemical data for classification of nephrotoxicity,"J. Pharm.

Biomed. Anal., Vol. 8, pp. 963-968.

Garland, K. P. R. et al., 1990b,"Pattern recognition analysis of high resolution'H NMR spectra of urine. A nonlinear mapping approach to the classification of toxicological data,"NMR in Biomed., Vol. 3, pp. 166-172.

Garland, K. P. R. et al., 1991,"The application of pattern recognition methods to the analysis and classification of toxicological data derived from proton NMR spectroscopy of urine,"Mol. Pharmacol., Vol. 39, pp. 629-642.

Geisow, M. J., 1998,"Proteomics: One small step for a digital computer, one giant leap for humankind,"Nature Biotechnology, Vol. 16, p. 206.

Gygi, S. P.; Rochon, Y.; Franza, B. R.; Aebersold, R, 1999,"Correlation between protein and mRNA abundance in yeast,"Molecular and Cellular Biology, Vol. 19, pp.

1720-1730.

Hare, B. J., and Prestegard, J. H., 1994,"Application of neural networks to automated assignment of NMR spectra of proteins,"J. Biomol. NMR, Vol. 4, pp. 35-46.

Holmes, E. et al., 1998a,"Development of a model for classification of toxin-induced lesions using'H NMR spectroscopy of urine combined with pattern recognition," NMR in Biomed., Vol. 11, pp. 235-244.

Holmes, E. et al., 1998b,"The identification of novel biomarkers of renal toxicity using automatic data reduction techniques and PCA of proton NMR spectra of urine," Chemomet. & Intel. Lab Systems, Vol. 44, pp. 245-255.

Holmes, E., et al., 1992,"NMR spectroscopy and pattern recognition analysis of the biochemical processes associated with the progression and recovery from nephrotoxic lesions in the rat induced by mercury (ll) chloride and 2-bromo- ethanamine,"Mol. Pharmacol., Vol. 42, pp. 922-930.

Holmes, E., et al., 1994,"Automatic data reduction and pattern recognition methods for analysis of'H NMR spectra of human urine from normal and pathological states,"Anal. Biochem., Vol. 220, pp. 284-296.

Howells, S. L., Maxwell, R. J., Howe, F. A., Peet, A. C., Stubbs, M., Rodrigues, L. M., Robinson, S. P., Baluch, S., and Griffiths, J. R., 1993,"Pattern-recognition of P-31 magnetic-resonance spectroscopy tumor spectra obtained in-vivo,"NMR Biomed., Vol. 6, pp. 237-241.

Joreskog, K. G., and Wold, H., 1982 Systems under Indirect Observation, North Holland, Amsterdam.

Klenk, H. P., et al., 1997,"The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus,"Nature, Vol. 390, pp. 364- 370.

Kopka, P. Dormann, T. Altmann, R. N. Trethewey and L. Willmitzer, 2000,"Metabolic profiling for plant functional genomics,"Nature Biotechnology, Vol. 18, pp. 1157- 1161.

Kowalski, B. R., Sharaf, M. and Illman D., Chemometrics (John Wiley & Sons, Chichester, 1986).

Kuesel, A. C., Stoyanova, R., Aiken, N. R., Li, C.-W., Szwergold, B. S., Shaller, C. and Brown, T. R., 1996,"Quantitation of resonances in biological P-31 NMR spectra via principal component analysis: Potential and limitations,"NMR Biomed., Vol.

9, pp. 93-104.

Lindon, J. C., et al., 1980,"Digitisation and Data Processing in Fourier Transform NMR," Progress in NMR Spectroscopy, Vol. 14, pp. 27-66.

Lindon, J. C., et al., 1999,"NMR spectroscopy of biofluids,"in Annual Reports on NMR Spectroscopy (Webb, G. A., ed.), Academic Press (London), Vol. 38, pp. 1-88.

Lindon, J. C.; Holmes, E.; Nicholson, J. K., 2001,"Pattern recognition methods and applications in biomedical magnetic resonance,"Progress in NMR Spectroscopy,"Vol. 39, pp. 1-40.

Martin, G. J., 1998,"Recent advances in site-specific natural isotope fractionation studied by nuclear magnetic resonance,"Isotopes in Environmental and Health Studies, Vol. 34, pp. 233-243.

Martin, M. L. and Martin, G. J., 1999,"Site-specific isotope effects and origin inference," Analysis, Vol. 27, p. 209-213.

Moka, D., et al., 1998,"Biochemical classification of kidney carcinoma biopsy samples using magic angle spinning NMR spectroscopy,"J. Pharm. Biomed. Anal., Vol.

17, pp. 125-132.

Nicholson, J. K. et al., 1989,"High resolution proton magnetic resonance spectroscopy of biological fluids,"Proq. NMR Spectrosc., Vol. 21, pp. 449-501.

Nicholson, J. K., et al., 1999,"Metabonomics-understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data,"Xenobiotica, Vol. 29, pp. 1181-1189.

Nillson, N. J., 1965, Learning Machines, McGraw-Hill, New York.

Parzen, E., 1962,"On estimation of a probability density function and mode,"Ann.

Mathemat. Stat., Vol. 33, p. 1065-1076.

Patterson, D., 1996, Artificial Neural Networks, Prentice Hall, Singapore.

Quinlan, J. R., 1986,"Induction of decision trees,"Machine Learning, Vol. 1, pp. 81-106.

Somorjai, R. L., Nikulin, A. E., Pizzi, N., Jackson, D., Scarth, G., Dolenko, B., Gordon, H., Russell, P., Lean, C. L., Delbridge, L., Mountford, C. E., and Smith, I. C. P., 1995, "Computerized consensus diagnosis-a classification strategy for the robust analysis of MR spectra. 1. application to H-1 spectra of thyroid neoplasms," Magn. Reson. Med., Vol. 33, pp. 257-263.

Speckt, D. F., 1990,"Probabilistic Neural Networks,"Neur. Networks, Vol. 3, pp. 109-118.

Spraul, M. et al., 1994,"Automatic reduction of NMR spectroscopic data for statistical and pattern recognition classification of samples,"J. Pharm. Biomed. Anal., Vol.

12, pp. 1215-1225.

Stoyanova, R., Kuesel, A. C., and Brown, T. R., 1995,"Application of principal- component analysis for NMR spectral quantitation,"J. Magn. Reson.. Series A, Vol. 115, pp. 265-269.

Sze, D. Y., et al., 1994,"High-resolution proton NMR studies of lymphocyte extracts," lmmunomethods, Vol. 4, pp. 113-126.

Tomlins, A. M. et al., 1998,"High resolution magic angle spinning'H NMR analysis of intact prostatic hyperplastic and tumour tissues,"Anal. Comm., Vol. 35, pp. 113- 115.

Tranter, G. E., et al., 1999,"Metabonomic prediction of drug toxicity via probabilistic neural network analysis of NMR biofluid data,"Abstr. 9"North American ISSX Meeting, Oct 24-28,1999, p. 246.

Wasserman, P. D., 1989, Neural Computing : Theory and Practice, (Van Nostrand, ed.) Reinhold, New York, USA.

Weber, O. M., Duc, C. O., Meier, D., and Boesiger, P., 1998,"Heuristic optimization algorithms applied to the quantification of spectroscopic data,"Magn. Reson.

Med., Vol. 39, pp. 723-730.

Wold, H., 1966, in Multivariate Analysis (P. R. Krishnaiah, Ed.) Academic Press, New York.

Wold, S., 1976,"Pattern recognition by means of disjoint principal components models," Pattern Recog., Vol. 8, pp. 127-139.

Previous Patent: SEMIMANUFACTURE FOR A SENSOR FOR MEASURING A MAGNETIC FIELD

Next Patent: AN INERTIAL/GPS NAVIGATION SYSTEM