Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MOLECULAR MARKERS FOR CANCER PROGNOSIS
Document Type and Number:
WIPO Patent Application WO/2009/132928
Kind Code:
A3
Abstract:
The present invention is predicated on a method of identification of a panel of genes informative for the outcome of disease which can be combined into an algorithm for a prognostic or predictive test. The algorithm makes use of gene expression data from biological samples and classifies patients as having a high risk or low risk, e.g. in cancer patients a metastasis bad outcome or good outcome group. Reference patterns of gene expression are obtained for the high risk and low risk groups, respectively. A sample of an unknown patient is analyzed and classified as belonging to the high risk or low risk group, respectively, depending on correlation to the high risk reference pattern or low risk reference pattern.

Inventors:
STROPP UDO (DE)
VON TOERNE CHRISTIAN (DE)
GEHRMANN MATHIAS (DE)
KRONENWETT RALF (DE)
Application Number:
PCT/EP2009/054034
Publication Date:
December 23, 2009
Filing Date:
April 03, 2009
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS HEALTHCARE DIAGNOSTICS (DE)
STROPP UDO (DE)
VON TOERNE CHRISTIAN (DE)
GEHRMANN MATHIAS (DE)
KRONENWETT RALF (DE)
International Classes:
C12Q1/68
Domestic Patent References:
WO2005033699A22005-04-14
WO2007030611A22007-03-15
WO2006136314A12006-12-28
Foreign References:
US20030158399A12003-08-21
Other References:
GLINSKY G V ET AL: "Classification of human breast cancer using gene expression profiling as a component of the survival predictor algorithm", CLINICAL CANCER RESEARCH, THE AMERICAN ASSOCIATION FOR CANCER RESEARCH, US, vol. 10, no. 7, 1 April 2004 (2004-04-01), pages 2272 - 2283, XP002309961, ISSN: 1078-0432
SORLIE T ET AL: "Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF USA, NATIONAL ACADEMY OF SCIENCE, WASHINGTON, DC, US, vol. 98, no. 19, 11 September 2001 (2001-09-11), pages 10869 - 10874, XP002215483, ISSN: 0027-8424
VEER VAN 'T L J ET AL: "Gene expression profiling predicts clinical outcome of breast cancer", NATURE, NATURE PUBLISHING GROUP, LONDON, UK, vol. 415, no. 6871, 31 January 2002 (2002-01-31), pages 530 - 536, XP002259781, ISSN: 0028-0836
Attorney, Agent or Firm:
MAIER, Daniel (München, DE)
Download PDF:
Claims:

Patent claims

1. A method for predicting an outcome of a patient suffering from or at risk of developing a neoplastic disease, said method comprising the steps of:

a) obtaining a biological sample from said patient; b) quantifiably determining the gene expression levels of a plurality of genes, thereby obtaining a pattern of expression levels of said plurality of genes; c) comparing the pattern of expression levels determined in

(b) with at least a first reference pattern of expression levels indicative of a first outcome and belonging to a first outcome category and a second pattern of expression levels indicative of a second outcome and belonging to a second outcome category; and d) predicting an outcome of a patient from the comparison in step (c) by use of a mathematical function to determine the similarity of said pattern of expression levels with said first reference pattern and said second reference pattern wherein said outcome is predicted according to which of said first and second reference pattern bears greater similarity with said pattern of expression levels .

2. Method of claim 1, wherein said first reference pattern and second reference pattern is obtained by a method comprising the steps of: a) classifying a plurality of reference patients as belonging to the first outcome or second outcome category according to clinical data and/or a known outcome of disease; b) determining from biological samples from said reference patients the gene expression level for each gene of the plurality of genes from each reference patient; and c) determining an average level of gene expression for each gene for those patients belonging to the first outcome category and for those patients belonging to the second

outcome category; wherein the first reference pattern comprises the average gene expression levels of patients belonging to the first outcome category and the second reference pattern comprises the average gene expression levels of patients belonging to the second outcome category.

3. Method of any of the preceding claims, wherein first outcome and second outcome are indicative of a high risk or low risk, respectively, of developing metastasis.

4. Method of any of the preceding claims, wherein the expression level is determined by

a) a hybridization based method; b) a PCR based method; c) determining the protein level, d) a method based on the electrochemical detection of particular molecules, and/or by e) an array based method.

5. Method of any of the preceding claims, characterized in that the expression level of at least one of the said ligands of is determined in formalin and/or paraffin fixed tissue samples.

6. Method of any of the preceding claims, wherein, after lysis, the samples are treated with silica-coated magnetic particles and a chaotropic salt, in order to purify the nucleic acids contained in said sample for further determination .

7. Method of any of the preceding claims, wherein the gene expression level of 3 to 15 genes is determined.

8. Method of any of the preceding claims, wherein the gene expression of at least one of the genes of table 3 is determined.

9. Method of any of the preceding claims, wherein the gene expression of at least one of the genes of table 3 is determined.

10. Method of claim 9, wherein an average gene expression level of a plurality of the genes is determined by using a group of probes containing probes specific for each of said plurality of genes wherein said group of probes yield one combined detectable signal indicative of said average gene expression level.

11. Method of any of the preceding claims, wherein a nucleic acid having a sequence of at least one of sequence protocol SEQ ID NO 1-180 or a fragment thereof is used as a probe to determine a gene expression level.

12. Method of any of the preceding claims, wherein predicting the outcome of a patient by classifying said pattern of expression levels of said patient as belonging to the first outcome or second outcome category is performed by using a mathematical discriminant function

13. Method of any of the preceding claims, wherein predicting the outcome of a patient by classifying said pattern of expression levels of said patient as belonging to the first outcome or second outcome category comprises determining Pearson's correlation coefficient for said first reference pattern and said second reference pattern.

14. Method of any of the preceding claims, wherein the neoplastic disease is breast cancer.

15. Use of a sequence of any of SEQ ID NO 1 to 180 or a fragment thereof in the method according to any of claims 1 to 14.

Description:

Description

Molecular markers for cancer prognosis

The present invention relates to methods for prediction of an outcome of neoplastic disease or cancer. More specifically, the present invention relates to a method for the prediction of breast cancer.

Cancer is a genetically and clinically complex disease with multiple parameters determining outcome and suitable therapy of disease. It is common practice to classify patients into different stages, grades, classes of disease status and the like and to use such classification to predict disease outcome and for choice of therapy options. It is for example desirable to be able to predict a risk of recurrence of disease, risk of metastasis and the like.

The metastatic potential of primary tumors is the chief prognostic determinant of malignant disease. Therefore, predicting the risk of a patient developing metastasis is an important factor in predicting the outcome of disease and choosing an appropriate treatment.

As an example, breast cancer is the leading cause of death in women between the ages of 35-55. Worldwide, there are over 3 million women living with breast cancer. OECD (Organization for Economic Cooperation & Development) estimates on a worldwide basis 500,000 new cases of breast cancer are diagnosed each year. One out of ten women will face the diagnosis breast cancer at some point during her lifetime. Breast cancer is the abnormal growth of cells that line the breast tissue ducts and lobules and is classified by whether the cancer started in the ducts or the lobules and whether the cells have invaded (grown or spread) through the duct or lobule, and by the way the cells appear under the microscope (tissue histology) . It is not unusual for a single breast tumor to have a mixture of invasive and in situ cancer.

According to today's therapy guidelines and current medical practice, the selection of a specific therapeutic intervention is mainly based on histology, grading, staging and hormonal status of the patient. Many aspects of a patient's specific type of tumor are currently not assessed - preventing true patient-tailored treatment. Another dilemma of today's breast cancer therapeutic regimens is the practice of significant over-treatment of patients; it is well known from past clinical trials that 70% of breast cancer patients with early stage disease do not need any treatment beyond surgery. While about 90% of all early stage cancer patients receive chemotherapy exposing them to significant treatment side effects, approximately 30% of patients with early stage breast cancer relapse. These types of problems are common to other forms of cancer as well. As such, there is a significant medical need to develop diagnostic assays that identify low risk patients for directed therapy. For patients with medium or high risk assessment, there is a need to pinpoint therapeutic regimens tailored to the specific cancer to assure optimal success. Breast Cancer metastasis and disease-free survival prediction is a challenge for all pathologists and treating oncologists. A test that can predict such features has a high medical and diagnostic need.

About 80% of all breast cancers diagnosed in the US and

Europe are node-negative. What is needed are diagnostic tests and methods which can assess certain disease-related risks, e.g. risk of development of metastasis.

Technologies such as quantitative PCR, microarray analysis, and others allow the analysis of genome-wide expression patterns which provide new insight into gene regulation and are also a useful diagnostic tool because they allow the analysis of pathologic conditions at the level of gene expression. Quantitative reverse transcriptase PCR is currently the accepted standard for quantifying gene expression. It has the advantage of being a very sensitive method allowing the detection of even minute amounts of mRNA.

Microarray analysis is fast becoming a new standard for quantifying gene expression.

Curing breast cancer patients is still a challenge for the treating oncologist as the diagnosis relies in most cases on clinical data such as etiopathological and pathological data like age, menopausal status, hormonal status, grading, and general constitution of the patient, and some molecular markers like Her2/neu, p53, and some others. Unfortunately, until recently, there was no test in the market for prognosis or therapy prediction that comes up with a more elaborated recommendation for the treating oncologist whether and how to treat patients. Two assay systems are currently available for prognosis, Genomic Health's OncotypeDX and Agendia's Mammaprint assay. In 2007, the company Agendia got FDA approval for their Mammaprint microarray assay that can predict with the help of 70 informative genes and a bundle of housekeeping genes the prognosis of breast cancer patients from fresh tissue (Glas A.M. et al . , Converting a breast cancer microarray signature into a high-throughput diagnostic test, BMC Genomics. 2006 Oct 30; 7:278) . The Genomic Health assay works with formalin-fixed and paraffin-embedded tumor tissue samples and uses 21 genes for the prognosis, presented as a risk score (Esteva FT et al . Prognostic role of a multigene reverse transcriptase-PCR assay in patients with node-negative breast cancer not receiving adjuvant systemic therapy. Clin Cancer Res 2005; 11: 3315-3319) .

Both these assays use a high number of different markers to arrive at a result and require a high number of internal controls to ensure accurate results. What is needed is a simple and robust assay for prediction and/or prognosis of cancer.

Objective of the invention

It is an objective of the invention to provide a method for the prediction and/or prognosis of cancer relying on a limited number of markers.

It is a further objective of the invention to provide a kit for performing the method of the invention.

Definitions

The term "neoplastic disease", "neoplastic region", or

"neoplastic tissue" refers to a tumorous tissue including carcinoma (e.g. carcinoma in situ, invasive carcinoma, metastasis carcinoma) and pre-malignant conditions, neomorphic changes independent of their histological origin, cancer, or cancerous disease.

The term "cancer" is not limited to any stage, grade, histomorphological feature, aggressivity, or malignancy of an affected tissue or cell aggregation. In particular, solid tumors, malignant lymphoma and all other types of cancerous tissue, malignancy and transformations associated therewith, lung cancer, ovarian cancer, cervix cancer, stomach cancer, pancreas cancer, prostate cancer, head and neck cancer, renal cell cancer, colon cancer or breast cancer are included. The terms "neoplastic lesion" or "neoplastic disease" or

"neoplasm" or "cancer" are not limited to any tissue or cell type. They also include primary, secondary, or metastatic lesions of cancer patients, and also shall comprise lymph nodes affected by cancer cells or minimal residual disease cells either locally deposited or freely floating throughout the patient's body.

The term "predicting an outcome" of a disease, as used herein, is meant to include both a prediction of an outcome of a patient undergoing a given therapy and a prognosis of a patient who is not treated. The term "predicting an outcome" may, in particular, relate to the risk of a patient developing metastasis.

The term "prediction", as used herein, relates to an individual assessment of the malignancy of a tumor, or to the expected survival rate (DFS, disease free survival) of a patient, if the tumor is treated with a given therapy. In contrast thereto, the term "prognosis" relates to an individual assesment of the malignancy of a tumor, or to the expected survival rate (DFS, disease free survival) of a patient, if the tumor remains untreated.

A "discriminant function" is a function of a set of variables used to classify an object or event. A discriminant function thus allows classification of a patient, sample or event into a category or a plurality of categories according to data or parameters available from said patient, sample or event. Such classification is a standard instrument of statistical analysis well known to the skilled person. E.g. a patient may be classified as "high risk" or "low risk", "high probability of metastasis" or "low probability of metastasis", "in need of treatment" or "not in need of treatment" according to data obtained from said patient, sample or event. Classification is not limited to "high vs. low", but may be performed into a plurality categories, grading or the like. Classification shall also be understood in a wider sense as a discriminating score, where e.g. a higher score represents a higher likelihood of distant metastasis, e.g. the (overall) risk of a distant metastasis. Examples for discriminant functions which allow a classification include, but are not limited to functions defined by support vector machines (SVM) , k-nearest neighbors (kNN) , (naive) Bayes models, or piecewise defined functions such as, for example, in subgroup discovery, in decision trees, in logical analysis of data (LAD) and the like. In a wider sense, continuous score values of mathematical methods or algorithms, such as correlation coefficients, projections, support vector machine scores, other similarity-based methods and the like are examples for illustrative purpose.

An "outcome" within the meaning of the present invention is a defined condition attained in the course of the disease. This disease outcome may e.g. be a clinical condition such as "recurrence of disease", "development of metastasis", "development of nodal metastasis", development of distant metastasis", "survival", "death", a disease stage or grade or the like.

A "risk" is understood to be a probability of a subject or a patient to develop or arrive at a certain disease outcome. The term "risk" in the context of the present invention is not meant to carry any positive or negative connotation with regard to a patient's wellbeing but merely refers to a probability or likelihood of an occurrence or development of a given condition.

The term "clinical data" relates to the entirety of available data and information concerning the health status of a patient including, but not limited to, age, sex, weight, menopausal/hormonal status, etiopathology data, anamnesis data, data obtained by in vitro diagnostic methods such as blood or urine tests, data obtained by imaging methods, such as x-ray, computed tomography, MRI, PET, spect, ultrasound, electrophysiological data, genetic analysis, gene expression analysis, biopsy evaluation, intraoperative findings.

The term "etiopathology" relates to the course of a disease, that is its duration, its clinical symptoms, signs and parameters, and its outcome.

The term "anamnesis" relates to patient data gained by a physician or other healthcare professional by asking specific questions, either of the patient or of other people who know the person and can give suitable information (in this case, it is sometimes called heteroanamnesis) , with the aim of obtaining information useful in formulating a diagnosis and providing medical care to the patient. This kind of

information is called the symptoms, in contrast with clinical signs, which are ascertained by direct examination.

In the context of the present invention a "biological sample" is a sample which is derived from or has been in contact with a biological organism. Examples for biological samples are: cells, tissue, body fluids, lavage fluid, smear samples, biopsy specimens, blood, urine, saliva, sputum, plasma, serum, cell culture supernatant, and others.

A "biological molecule" within the meaning of the present invention is a molecule generated or produced by a biological organism or indirectly derived from a molecule generated by a biological organism, including, but not limited to, nucleic acids, protein, polypeptide, peptide, DNA, mRNA, cDNA, and so on .

A "probe" is a molecule or substance capable of specifically binding or interacting with a specific biological molecule. The term "primer", "primer pair" or "probe", shall have ordinary meaning of these terms which is known to the person skilled in the art of molecular biology. In a preferred embodiment of the invention "primer", "primer pair" and "probes" refer to oligonucleotide or polynucleotide molecules with a sequence identical to, complementary too, homologues of, or homologous to regions of the target molecule or target sequence which is to be detected or quantified, such that the primer, primer pair or probe can specifically bind to the target molecule, e.g. target nucleic acid, RNA, DNA, cDNA, gene, transcript, peptide, polypeptide, or protein to be detected or quantified. As understood herein, a primer may in itself function as a probe. A "probe" as understood herein may also comprise e.g. a combination of primer pair and internal labeled probe, as is common in many commercially available qPCR methods.

A "gene" is a set of segments of nucleic acid that contains the information necessary to produce a functional RNA

product. A "gene product" is a biological molecule produced through transcription or expression of a gene, e.g. an mRNA or the translated protein.

An "mRNA" is the transcribed product of a gene and shall have the ordinary meaning understood by a person skilled in the art. A "molecule derived from an mRNA" is a molecule which is chemically or enzymatically obtained from an mRNA template, such as cDNA.

The term "specifically binding" within the context of the present invention means a specific interaction between a probe and a biological molecule leading to a binding complex of probe and biological molecule, such as DNA-DNA binding, RNA-DNA binding, RNA-RNA binding, DNA-protein binding, protein-protein binding, RNA-protein binding, antibody- antigen binding, and so on.

The term "expression level" refers to a determined level of gene expression. This may be a determined level of gene expression compared to a reference gene (e.g. a housekeeping gene) or to a computed average expression value (e.g. in DNA chip analysis) or to another informative gene without the use of a reference sample. The expression level of a gene may be measured directly, e.g. by obtaining a signal wherein the signal strength is correlated to the amount of mRNA transcripts of that gene or it may be obtained indirectly at a protein level, e.g. by immunohistochemistry, CISH, ELISA or RIA methods. The expression level may also be obtained by way of a competitive reaction to a reference sample.

A "reference pattern of expression levels", within the meaning of the invention shall be understood as being any pattern of expression levels that can be used for the comparison to another pattern of expression levels. In a preferred embodiment of the invention, a reference pattern of expression levels is, e.g., an average pattern of expression

levels observed in a group of healthy or diseased individuals, serving as a reference group.

The term "complementary" or "sufficiently complementary" means a degree of complementarity which is - under given assay conditions - sufficient to allow the formation of a binding complex of a primer or probe to a target molecule. Assay conditions which have an influence of binding of probe to target include temperature, solution conditions, such as composition, pH, ion concentrations, etc. as is known to the skilled person.

The term "hybridization-based method", as used herein, refers to methods imparting a process of combining complementary, single-stranded nucleic acids or nucleotide analogues into a single double stranded molecule. Nucleotides or nucleotide analogues will bind to their complement under normal conditions, so two perfectly complementary strands will bind to each other readily. In bioanalytics, very often labeled, single stranded probes are used in order to find complementary target sequences. If such sequences exist in the sample, the probes will hybridize to said sequences which can then be detected due to the label. Other hybridization based methods comprise microarray and/or biochip methods. Therein, probes are immobilized on a solid phase, which is then exposed to a sample. If complementary nucleic acids exist in the sample, these will hybridize to the probes and can thus be detected. Hybridization is dependent on target and probe (e.g. length of matching sequence, GC content) and hybridization conditions (temperature, solvent, pH, ion concentrations, presence of denaturing agents, etc.) . A "hybridizing counterpart" of a nucleic acid is understood to mean a probe or capture sequence which under given assay conditions hybridizes to said nucleic acid and forms a binding complex with said nucleic acid. Normal conditions refers to temperature and solvent conditions and are understood to mean conditions under which a probe can hybridize to allelic variants of a nucleic acid but does not

unspecif ically bind to unrelated genes. These conditions are known to the skilled person and are e.g. described in

"Molecular Cloning. A laboratory manual", Cold Spring Harbour Laboratory Press, 2. Aufl., 1989. Normal conditions would be e.g. hybridization at 6 x Sodium Chloride/sodium citrate buffer (SSC) at about 45°C, followed by washing or rinsing with 2 x SSC at about 50 0 C, or e.g. conditions used in standard PCR protocols, such as annealing temperature of 40 to 60 0 C in standard PCR reaction mix or buffer.

The term "array" refers to an arrangement of addressable locations on a device, e.g. a chip device. The number of locations can range from several to at least hundreds or thousands. Each location represents an independent reaction site. Arrays include, but are not limited to nucleic acid arrays, protein arrays and antibody-arrays. A "nucleic acid array" refers to an array containing nucleic acid probes, such as oligonucleotides, polynucleotides or larger portions of genes. The nucleic acid on the array is preferably single stranded. A "microarray" refers to a biochip or biological chip, i.e. an array of regions having a density of discrete regions with immobilized probes of at least about 100/cm 2 '

A "PCR-based method" refers to methods comprising a polymerase chain reaction PCR. This is a method of exponentially amplifying nucleic acids, e.g. DNA or RNA by enzymatic replication in vitro using one, two or more primers. For RNA amplification, a reverse transcription may be used as a first step. PCR-based methods comprise kinetic or quantitative PCR (qPCR) which is particularly suited for the analysis of expression levels, ) .

The term "determining a protein level" refers to any method suitable for quantifying the amount, amount relative to a standard or concentration of a given protein in a sample. Commonly used methods to determine the amount of a given protein are e.g. immunohistochemistry, CISH, ELISA or RIA methods, etc.

The term "reacting" a probe with a biological molecule to form a binding complex herein means bringing probe and biologically molecule into contact, for example, in liquid solution, for a time period and under conditions sufficient to form a binding complex.

The term "label" within the context of the present invention refers to any means which can yield or generate or lead to a detectable signal when a probe specifically binds a biological molecule to form a binding complex. This can be a label in the traditional sense, such as enzymatic label, fluorophore, chromophore, dye, radioactive label, luminescent label, gold label, and others. In a more general sense the term "label" herein is meant to encompass any means capable of detecting a binding complex and yielding a detectable signal, which can be detected, e.g. by sensors with optical detection, electrical detection, chemical detection, gravimetric detection (i.e. detecting a change in mass) , and others. Further examples for labels specifically include labels commonly used in qPCR methods, such as the commonly used dyes FAM, VIC, TET, HEX, JOE, Texas Red, Yakima Yellow, quenchers like TAMRA, minor groove binder, dark quencher, and others, or probe indirect staining of PCR products by for example SYBR Green. Readout can be performed on hybridization platforms, like Affymetrix, Agilent, Illumina, Planar Wave Guides, Luminex, microarray devices with optical, magnetic, electrochemical, gravimetric detection systems, and others. A label can be directly attached to a probe or indirectly bound to a probe, e.g. by secondary antibody, by biotin- streptavidin interaction or the like.

The term "combined detectable signal" within the meaning of the present invention means a signal, which results, when at least two different biological molecules form a binding complex with their respective probes and one common label yields a detectable signal for either binding event.

Summary of the invention

The present invention is predicated on a method of identification of a panel of genes informative for the outcome of disease which can be combined into an algorithm for a prognostic or predictive test. The inventive method makes use of gene expression data from biological samples and classifies patients as having a first or second outcome, a high risk or low risk for a certain outcome, such as e.g. metastasis, bad outcome, or good outcome group.

In general terms, the invention is based on two separate steps :

Step one is a classifier training comprising: a) Obtaining data in a patient cohort (e.g. data relating to a clinical outcome, gene expression data, other clinical data) ; b) Determination of classes in said patients according to at least a first and a second outcome; c) Selection of inputs (e.g. selection of features that will be used in an algorithm for classifying samples in step two, i.e. a subset of the said data, e.g. gene expression values); d) Determination of an algorithmic measure of similarity; e) Determination of class reference profiles, one each for said first outcome and said second outcome.

Step two is the classification of an unknown sample as belonging to a first outcome group or second outcome group comprising : a) Obtaining data needs as in Ic) in said patient; b) Determination of class similarity by using said data to compute similarity (Id) to said classes (Ib); c) Use of a mathematical discriminant function to obtain a classification of said sample from the said class

similarities (2b) . In a preferred embodiment, the mathematical function is the selection of the class with the highest similarity to the unknown sample.

In particular, the present invention relates to a method for predicting an outcome of a patient suffering from or at risk of developing a neoplastic disease, said method comprising the steps of:

a) obtaining a biological sample from said patient; b) quantifiably determining the gene expression levels of a plurality of genes, thereby obtaining a pattern of expression levels of said plurality of genes; c) comparing the pattern of expression levels determined in (b) with at least a first reference pattern of expression levels indicative of a first outcome and belonging to a first outcome category and a second pattern of expression levels indicative of a second outcome and belonging to a second outcome category; and d) predicting an outcome of a patient from the comparison in step (c) by use of a mathematical function to determine the similarity of said pattern of expression levels with said first reference pattern and said second reference pattern wherein said outcome is predicted according to which of said first and second reference pattern bears greater similarity with said pattern of expression levels .

According to an aspect of the present invention, said first reference pattern and second reference pattern is obtained by a method comprising the steps of:

a) classifying a plurality of reference patients as belonging to the first outcome or second outcome category according to clinical data and/or a known outcome of disease; b) determining from biological samples from said reference patients the gene expression level for each gene of the

plurality of genes from each reference patient; and c) determining an average level of gene expression for each gene for those patients belonging to the first outcome category and for those patients belonging to the second outcome category; wherein the first reference pattern comprises the average gene expression levels of patients belonging to the first outcome category and the second reference pattern comprises the average gene expression levels of patients belonging to the second outcome category.

According to an aspect of the present invention the categories of first outcome or second outcome are indicative of a high risk or low risk, respectively, of developing a metastasis.

According to a further aspect of the present invention the expression level is determined by

a) a hybridization based method; b) a PCR based method; c) determining the protein level, d) a method based on the electrochemical detection of particular molecules, and/or by e) an array-based method.

According to a further aspect of the present invention, the expression level of at least one of the said ligands of is determined in formalin and/or paraffin fixed tissue samples.

According to a further aspect of the present invention, after lysis, the samples are treated with silica-coated magnetic particles and a chaotropic salt, in order to purify the nucleic acids contained in said sample for further determination.

According to a further aspect of the present invention, the gene expression level of 3 to 15 genes is determined, preferably 3 to 10, more preferably 3 to 6.

According to a further aspect of the present invention, the gene expression of at least 1, 3, or 3 to 15 of the genes of table 3 is determined.

According to a further aspect of the present invention, the gene expression of at least one of the genes of table 1 is determined.

According to a further aspect of the present invention, the gene expression of at least one of the genes of table 3 and the gene expression of at least one of the genes of table 1 is determined.

According to a further aspect of the present invention, gene expression of at least three genes is determined, said three genes being selected from the genes listed in table 3.

According to a further aspect of the present invention, an average gene expression level of a plurality of the genes is determined by using a group of probes containing probes specific for each of said plurality of genes wherein said group of probes yield one combined detectable signal indicative of said average gene expression level. According to a further aspect of the present invention said plurality of genes is selected from the genes listed in table 1.

According to a further aspect of the present invention, a nucleic acid having a sequence of at least one of sequence protocol SEQ ID NO 1-180 or a fragment thereof is used as a probe to determine a gene expression level.

According to a further aspect of the present invention, predicting the outcome of a patient by classifying said pattern of expression levels of said patient as belonging to

the first outcome or second outcome category is performed by using a mathematical discriminant function.

According to a further aspect of the present invention, the outcome of a patient by classifying said pattern of expression levels of said patient as belonging to the first outcome or second outcome category is performed by using Pearson's correlation coefficient. In statistics, the Pearson correlation coefficient (sometimes known as the Pearson product-moment correlation coefficient, PMCC) (r) is known as a common measure of the correlation between two variables X and Y. Pearson's correlation reflects the degree of linear relationship between two variables. It ranges from -1 to +1. A correlation of +1 means that there is a perfect positive linear relationship between variables. A correlation of -1 means that there is a perfect negative linear relationship between variables. A correlation of 0 means there is no linear relationship between the two variables.

According to a further aspect of the present invention, the neoplastic disease is breast cancer.

In the predictor training, the mean of the gene expression values (e.g. expressed as measured CT (cycle thresholds) values in RT PCR) is formed for all informative genes in both groups (first outcome or second outcome or, high and low risk group) . In the test, a new, unclassified patient is measured for the same informative genes which are compared against the first outcome or second outcome (or low risk and high risk) profiles by a distance metric induced by e.g. Pearson's correlation coefficient (distance (a, ref) = 1 - corr (a, ref ) ) . The closer reference profile identifies the new patient as a patient with low (higher correlation value to good profile group) or high risk.

While initially, simple proximity in the light of a simple correlation method was used to classify as good or bad outcome, the method can deliberately be biased towards either

side, thereby classifying more patients as "good" (or "bad", respectively) to achieve higher specificity (sensitivity) by sacrificing sensitivity (specificity, respectively) . The panel consists of two or more marker genes; it works also for twenty four marker genes (see example below, but less markers are desirable) . Based on the same ideas, the method can be optimized (optimal case method) by identifying the above distance comparison as a difference of two covariance values, and by bi-linearity of the covariance as a linear optimization with one linear and one nonlinear constraint. Solving this optimization problem yields the best results, both in terms of accuracy and in terms of a minimal informative gene set which is desirable.

A diagnostic test uses accessible features of a patient or subject to assess e.g. the likeliness of a patient having a certain condition, i.e. illness. These "features" can comprise clinical data, e.g. etiopathological data and anamnesis data and can be from a single source or multiple sources, they range from optical inspection, the use of simple instruments or imaging technologies (x-ray, computed tomography, PET, SPECT, MRI, ultrasound, etc.) to biochemical laboratory tests (e.g. on body fluids, tissue, feces) and other means. The basic idea always is to find certain characteristics that have frequently been observed in patients known to have a specific disease, a specific condition, or also a specific stage or grade of disease, and then comparing the patient with unknown disease status against these "references".

The approach of the algorithm described herein is comparable: Based on gene expression profiles measured from tumor tissue excised from a patient with an unknown risk, e.g. an unknown risk of developing a metastasis, the hypothesis is that this risk assessment can be done by comparing the genetic profile of this patient with reference profiles of patients known to be at high risk and of patients known to be at low risk.

Furthermore, it is envisioned that this decision can reliably be based on a small number of gene expression values.

Speaking more mathematically, a predictor is a mapping from observable before-the-fact features to either a continuous risk score (e.g. higher score means higher risk) or a risk class (e.g. first outcome or second outcome, "low risk", "intermediate risk", or "high risk") , thus predicting the likeliness of a future event based on presently available data. To obtain such a predictor, one first needs a set of features thought to be relevant for the given disease. Starting from a given retrospective dataset (patients that are known to have the disease or known not to have it, the set of features being available) , one can then pick a model of the mapping, usually in terms of a mathematical model which may or may not incorporate prior knowledge from earlier experiments, literature, etc. Examples of mathematical models are linear regression, logistic regression, support vector machines (SVM) , decision trees, fuzzy trees. The model will usually contain parameters that need yet to be determined. This is done in a general optimization process where parameters are chosen in a way that the model makes a precise prediction on the available training data. Special care has to be taken such that the model generalizes well, that is, that the results will be valid for all patients and not just for those in the training (a well-known phenomenon called "overf itting" ) , which can be achieved e.g. by using a cross validation (CV) procedure.

The objective for the predictor described in this example is used to predict the risk of a node-negative breast-cancer patient to develop a metastasis within 10 years after the surgical removal of the tumor if no further therapy is given. By this, two outcome groups are defined, one with a metastasis within 10 years, and one group that is metastasis free for more than 10 years. These groups will be called "cases" and "controls", respectively, and in this example represent the first and second outcome, respectively.

The model that is chosen in this example is based in the hypothesis that all cases are "similar" to each other, so that it is possible to contruct a "case reference profile" or "first outcome reference profile" from given training data. In the same way, it would be possible to construct a "control reference profile" or "second outcome reference profile" for the controls in the training set.

The mentioned "similarity" is a choice that can be made; preferrably this choice should be compatible with the choice of the reference profiles.

One obvious example for such a choice would be the L2 distance measure (euclidean norm) . In this case, similarity of a profile P 1 of patient i {i = l..N P if there are N P patients) to the case reference profile P refcase would be defined as \\l= ∑(P hJ - P ref,case J

[P 1 , being the expression of the j-th gene { j = \..N G ) in patient i, and N G being the number of genes) . The similarity to the control reference profile is defined accordingly.

One straight -forward definition of the reference profile

P ref would be the mean of all expression values of a gene over all cases, as will be shown in the following. Consider the following definition of P ref case :

P lease = argmin ∑disttf, P ref )

=

Observe that the argument of arg min on right hand side of this equation is dif ferentiable and unbounded in P ref . Observe

furthermore that the minimum exists and is unique. The unique minimum can be determined by computing the partial differentials of the argument of the right hand side with respect to P refk which should necessarily vanish, which yields for any k = \..N G

l ref J in σcase&s .-O

0

which is equivalent to

P - \ Vp r ef ' k #cases ^ ' "

where "teases" denotes the number of cases. P ref is therefore the mean of all expression vectors over all cases.

This fact is known to any person skilled in the art in the context of k-means clustering approaches .

The choice of the euclidean metric, while possible, is not necessarily an optimal one for markers using gene expression data. It has its definite advantages in purely mathematical areas and in terms of simple interpretability .

In the following example which is useful for the method of the invention, another similarity measure is proposed, this time based on Pearson's correlation coefficient. It also has some remarkable features and, as will be presented, it performs very well on a data set for breast cancer prognosis.

Using the notation of the first example, define

dist{P,,P^ ca J := 1 - corr(P lt P^ case )

where corr(x,y) is Pearson's correlation coefficient between profiles x and y well-known to a person skilled in the art, defined by

where

A predictor based on the most similar reference profile (of two given reference profiles) is given by

M*ffl -| [co "n"trol if

Equivalently, one can compute the auxiliary quantity

diffiPJ := - dist(P t ,P reLcontrol ) ( * )

and classify by

In addition, it is possible to bias the quantity diff(P t ) by adding a real constant θ (which will also be determined in training) towards classification as a case (for positive values of θ) or towards classification as a control (for negative values of θ) , dependening on the requirements.

Adding θ and inserting the definition of distance in (*) , we obtain

diff * (P lt θ) = (l - corriP^^ - il -coMP^^^ + θ corr{P t , P ref >control ) - COn-(P n P n , >case ) + θ

and inserting the definition of the correlation coefficient, we have

2SK, j ~ K ) ' (Kef, control, j ~ Kef, control ) 2- 1 ( ', J ~ K ) ' (Kef ,case,j ~ Kef , case )

We observe that all parameters we seek to optimize (except for θ ) are contained in r. Note that, by construction, r has zero mean. Scaling r by some positive scalar a yields an equivalent predictor if we replace θ by a-θ.

It follows that the identification of the reference profiles P ref ewe an d P ref control ^ s ^ 11 fact the determination of single vector r of the same dimension with the properties

∑r j =0, ∑ή=\. (**)

If pre-processing is applied to the gene expression profiles P 1 by

P -P.

Qu = J '

,-P 1 ) 2

we have

diff * (P,,θ) = ∑Qy r J + θ

On given training data, the parameter vector r and the scalar value θ (theta) are determined such that diff is negative for controls and positive for cases. The constraints (**) may be implemented using penalization, e.g. solving

where the penalty parameters ^,A 2 are large numbers (e.g. 10 λ 5) .

All genes listed in the following and also further genes not listed can be used in correlation algorithms described. All genes listed here can be used in further algorithms to predict prognosis of breast cancer patients. A correlation algorithm can be used to make therapy predictions for breast cancer patients. A correlation algorithm can also be used for prognosis and prediction of other cancers.

Examples

Most algorithms rely on many genes, to be measured by chip technology or PCR-based (16 to 70 or more different genes in currently available commercial assays) and a complicated normalization of data (normalization against 5 to several hundred of housekeeping genes) by not less complicated algorithm that combines all data to a final score or risk prediction, e.g. the Agendia Mammaprint Assay (70 genes and hundreds of normalization gene) , or Genomic Health Oncotype Assay (16 genes and 5 normalization genes) . The here presented correlation algorithm does not need normalization of gene expression values since it is based on correlation of information which is invariant under monotonous increasing affine transformation, so adding a constant vector to all expression values will yield exactly the same prediction. It

needs only a set of few genes measured by RT-PCR (at least two, preferably 3 to 15, more preferred 3 to 6) . The RNA to be analyzed does not need to be adjusted to a certain concentration or CT value for one gene as long as all genes can be robustly measured (of three genes, for example, two should be located in the linear range of the PCR measurement (app. 20 - 30 cycles, wherein typically 32 cycles = Limit of Detection) , one gene can be located at a cycle threshold (CT) between 30 or 40 or greater than 40.

To obtain reference patterns, a collection of FFPE (formalin-fixed, paraffin-embedded) tumor samples of node- negative breast cancer patients with long-term follow-up data was used to prepare RNA and measure the amount of RNA of several breast cancer informative genes by quantitative RT- PCR. Then two groups of patients were classified: a bad outcome group and a good outcome group (as defined above) and calculated the model parameters (reference profiles by averaging in the simple case, differential profile in the optimal case) . Once these have been obtained, each incoming patient sample can be correlated to one of these profiles The advantage here lies in the extremely low number of genes used (as low as two to four genes) and the simplicity of the classification procedure while maintaining very good sensitivity and specificity. The method described here was generated using data from 190 samples. In a very conservative 50:50 cross validation procedure (with half of the samples being used for training while the remaining samples were used for testing, 2-fold cross validation) , a combination of CXCL13 and UBE2C in conjunction with any housekeeping gene e.g. GAPDH, or a housekeeping mixture always showed exquisite performance, with an average Youden Index (maximum of sensitivity + specificity - 1 in the ROC curve) of 0.4 which is highly significant. Slightly better results were obtained when the IGKC gene was also used as part of same set of genes .

Technically, the assay used in these examples relies on two core technologies: 1.) Isolation of total RNA from tumor tissue (or tissue suspected to contain tumor tissue) and 2.) quantification of mRNA by e.g. kinetic RT-PCR.

The assay results can be linked together by a mathematical algorithm computing the likely risk of getting metastasis as low, (intermediate) or high, which may be implemented in a software .

In case of RNA analysis to determine an expression level, RNA may be isolated by any known suitable method, e.g. using commercially available RNA isolation kits.

According to a preferred embodiment of the present invention, after lysis the samples are treated with silica-coated magnetic particles and a chaotropic salt, in order to purify nucleic acids contained in the sample for further analysis. This method of RNA isolation has been shown to yield satisfactory results even when RNA is extracted from fixed tissue samples. This method which allows successful purification of mRNA out of fixed tissue samples is disclosed in WO 030058649, WO 2006136314A1 and DE10201084A1, the content of which is incorporated herein by reference.

This RNA extraction method comprises the use of magnetic particles coated with silica (silicon dioxide, SiO 2 ) . These highly pure magnetic particles coated with silica are used for isolating nucleic acids, including DNA and RNA from cell and tissue samples, the particles, the particles may be retrieved from a sample matrix or sample solution by use of a magnetic field. These particles are particularly well-suited for the automatic purification of nucleic acids, mostly from biological samples for the purpose of detecting nucleic acids.

The selective binding of nucleic acids to the surface of these particles is due to the affinity of negatively charged

nucleic acids to silica containing media in the presence of chaotropic salt like guanidine isothiocyanate . The binding properties are for example described in EP 819696.

This approach is particularly useful for the purification of mRNA from formalin and/or paraffin fixed tissue samples. In contrast to most other approaches which may result in small fragments that are not well-suited for later gene expression analysis by PCR and/or hybridization technologies, this approach creates mRNA fragments which are large enough to allow such analysis.

This approach allows a highly-specific determination of gene expression levels with one of the above-mentioned methods, in particular hybridization-based methods, PCR-based methods and/or array-based methods even in formalin and/or paraffin- fixed tissue samples and is therefore highly efficient in the context of the present invention, because it allows the use of fixed tissue samples which often are readily available in tissue banks and connected to clinical data bases, e.g. for follow-up studies to allow retrospective analysis.

Measurement of the expression level can be performed on the mRNA level by any suitable method, e.g. qPCR or gene expression array platforms, including, but not limited to, commercially available platforms, such as TaqMan®, Lightcycler®, Affymetrix, Illumina, Luminex, planar wave guide, electrochemical microarray chips, microarray chips with optical, magnetic, electrochemical or gravimetric detection systems and others or on a protein level by immunochemical techniques such as ELISA.

The term "planar waveguide" relates to methods, wherein the presence or amount of a target molecule is determined by using a planar wave guide detector which emits an evanescent field in order to detect the binding of a labeled target molecule, such as e.g. described in WO200113096-A1, the content of which is incorporated by reference herein.

The term "optical detection" relates to methods which detect the presence or amount of a target molecule through a change in optical properties, e.g. fluorescence, absorption, reflectance, chemiluminescence, as is well known in the art.

The term "magnetic detection" relates to methods which detect the presence or amount of a target molecule or label through a change in magnetic properties, e.g. through the use of XMR sensors, GMR sensors or the like.

The term "electrochemical detection" relates to methods which make use of an electrode system to which molecules, particularly biomolecules like proteins, nucleic acids, antigens, antibodies and the like, bind under creation of a detectable signal. Such methods are for example disclosed in WO0242759, WO200241992 and WO2002097413, the content of which is incorporated by reference herein. These detectors comprise a substrate with a planar surface and electrical detectors which may adopt, for example, the shape of interdigital electrodes or a two dimensional electrode array. These electrodes carry probe molecules, e.g. nucleic acid probes, capable of binding specifically to target molecules, e.g. target nucleic acid molecules and allowing a subsequent electrochemical detection of bound target nucleic acids.

The term "gravimetric detection" relates to methods which make use of a system that is able to detect changes in mass which occur e.g. when a probe binds its target.

In a representative example, quantitative reverse transcriptase PCR was performed according to the following protocol :

Primer/Probe Mix:

50 μl 100 μM Stock Solution Forward Primer

50 μl 100 μM Stock Solution Reverse Primer

25 μl 100 μM Stock Solution Taq Man Probe

bring to 1000 μl with water

10 μl Primer/Probe Mix (1:10) are lyophilized, 2.5 h RT

RT-PCR Assay set-up for 1 well:

3.1 μl Water

5.2 μl RT qPCR MasterMix (Invitrogen) with ROX dye 0.5 μl MgSO4 (to 5.5 mM final concentration) 1 μl Primer/Probe Mix dried

0.2 μl RT/Taq Mx (-RT: 0.08 μL Taq) 1 μl RNA (1:2)

Thermal Profile:

RT step

50 0 C 30 Min *

8 0 C ca. 20 Min *

95 0 C 2 Min

PCR cycles (repeated for 40 cycles)

95 0 C 15 Sec.

60 0 C 30 Sec.

It has also been found, that one or several housekeeping genes can be used as part of the plurality of genes to be analyzed or measured in the method of the present invention. Housekeeping genes are known as control genes, which are selected because of their stable and constant expression in a wide variety of tissues or cells. In gene expression analysis, it is known to use internal controls for standardization or normalization purposes. For these purposes, often so called housekeeping genes are used. Many housekeeping genes are vital to the metabolism of viable cells and are therefore constantly expressed. Many housekeeping genes encode enzymes or structural RNAs, such as ribosomal RNAs, which perform essential metabolic functions (hence the name "housekeeping") and are therefore constantly

expressed. However, it has been found that expression of these genes may vary considerably, especially in tumor cells, and their expression level may also vary over time. Therefore, the expression level of housekeeping genes may be informative for disease status.

According to another aspect of the invention, rather than using a single housekeeping gene as one of the plurality of genes to be determined for gene expression, a plurality of housekeeping genes can be analyzed. This can be achieved e.g. by using a mix of probes for different housekeeping genes yielding one combined detectable signal.

In experiments to identify the best housekeeping gene mixtures to be used as normalization genes for breast cancer prognosis or prediction, it was surprisingly found that mixtures of primers and probes of different housekeepers perform similar to separate measurements of the single genes by qPCR. For example, primers and probes where mixed for four housekeeping genes into one PCR well (ACTGl, CALM2, OAZl und PPIA) . With one fourth of the concentration of each single reaction (1+1+1+1) it was surprisingly found that the cycle threshold value of the mean of single reactions was nearly identically to the measured cycle threshold value of the mixed reaction.

Single housekeeping genes in the mixture can also be weighted by raising the relative amount of primers and probes compared to other genes in the mixture. Mastermix, enzymes, MgSO4 are present in standard concentrations or are adjusted for optimized primer and probe performances.

Experiments showed that using the housekeeping gene mixtures (HKM) listed below produced consistently superior results as opposed to using single housekeeping genes as internal reference (data not shown) . In each HKM the same detectable label (a FAM/TAMRA labeled probe) was used, thus producing a combined detectable signal. This combined signal was a more

stable signal and thus better suited to be used as internal control .

To find an optimal housekeeping gene or mixture we propose a supervised approach that assigns each gene its optimal housekeeper or mixture.

In order to determine which house keeping gene or housekeeping gene mixture is best suited for statistical analysis and allows the best separation of patients with respect to expression of a gene of interest (GOI), separation was performed by using an algorithm, e.g. correlation coefficient, fisher score or t-test.

For each gene the (at least one further) housekeeper is selected in a way that, by subtracting its CT-value vector from the gene's CT-value vector, maximizes the information content of that gene with respect the output vector (here the output vector is a step function, λ l' for positive patients and λ 0' for control patients) . As a measure for the information content we use 3 criteria, correlation coefficient, Fisher score and t-test.

The approach was tested under a 100 times 10-fold cross validation setting to account for the stability of housekeeper-assignments. Combined housekeeping gene mixtures, as listed below, are chosen much more frequently than individual housekeeping genes (data not shown) .

In particular, the one or more of the following housekeeping genes can be used as one of the plurality of genes to be determined in the method of the invention.

Table 1 Housekeeping genes

Gene: Accession Number/identifier RPL37A NM_000998 GAPDH NM 02046

ACTGl NM 001614

CALM2 NM 001743

OAZl NM 004152

PPIA NM 021130

Combinations of different housekeeping genes to be used in the method of the invention are referred to as housekeeping mixtures (HKM) . The mixture includes respective forward and reverse primer and qPCR-Probe. In particular, the following combinations of housekeeping genes can be used as elements of the plurality of genes to be determined in the method of the invention :

Table 2 : Housekeeping gene mixtures HKM = ACTG1+CALM2+OAZ1+PPIA HKM-2 = CALM2+OAZ1+PPIA HKMs=OAZl+PPIA+GAPDH HKMx = ACTG1+CALM2+OAZ1+PPIA+GAPDH+RPL37A

In a retrospective analysis, RNA from 360 FFPE node negative breast cancer patient samples was prepared by automated magnetic bead technology as described above and RT-PCR analyzed for a set of genes listed below. Patients were discretized into a high and low risk category according to clinical data. Mean CT values were calculated for low and high risk groups for each gene to which each of the 360 samples were correlated. A closer correlation to the high risk group predicts for the respective patient a risk of getting a metastasis within 10 years after surgical removal of the tumor. A closer correlation to the low risk group predicts for the respective patient a low risk of developing a metastasis within 10 years after surgical removal of the tumor. A diagnostic kit can be formed from these assays and algorithm to identify for example low risk patients which probably would not need chemotherapy. In this case the algorithm could be adjusted to higher sensitivity, subtracting for example a constant value of the correlation value of the low risk group so that for very close decisions

patient would be classified as high risk would therefore not be spared for chemotherapy.

Table 3 Genes and Gene mixtures used for analysis in the inventive method

Table 4 shows the results of the analysis for 6 patients according to the method of the invention. RNA from FFPE node negative samples was analyzed for gene expression of genes of table 3 and compared to the reference pattern of expression, i.e. average expression values for the high risk group and low risk group. Patients 1 to 3 had developed metastasis in the course of the disease ("classification metastasis = 1") , patients 4 to 6 did not develop metastasis ( ("classification metastasis = 0") . Pearson's correlation coefficients were calculated to determine whether patients match the high risk reference pattern of expression or the low risk reference pattern more closely.

Table 4: Analysis of patient samples with regard to risk of development of metastasis

Patients 1 to 3 had a consistently higher correlation coefficient for the high risk reference pattern of gene expression (positive Delta of [high] - [low] ) . Patients 4 to 6 had a consistently higher correlation coefficient for the low risk reference pattern of gene expression (negative Delta of [high] - [low] ) . Even when the correlation coefficient for the matching or "expected" reference profile was low (e.g. patient 5, where the coefficient for the matching "low risk" profile was only 0,776) , the coefficient for the mismatched or unexpected profile was even lower (again, patient 5, where the coefficient for the mismatched "high risk" profile was at an even lower 0,723) . This demonstrates the robustness and reliability of the here presented method.

Example 1 :

As mentioned above, RNA from 360 FFPE node-negative breast cancer patient samples were prepared by an automated magnetic bead technology and RT-PCR analyzed for a set of genes listed below. Samples were discretized into a high risk group developing metastasis within 10 years after surgical removal of the tumor and a low risk group developing no metastasis within 10 years after surgical removal of the tumor. Mean values were calculated of low and high risk patient groups for each gene (low and high risk profiles) . Patients to be classified were correlated to the values of the two risk profile groups. A correlation can be done for example by Pearson correlation with a value of beween 0.7 and 1.0 meaning that a high correlation exists between the two sets of parameters (there is no correlation when the correlation coefficient is between 0 and 0.2 for example) . The correlation algorithm presented here is not strictly dependent on the absolute correlation coefficients; it compares two correlation coefficients (one for the high risk group and another for the low risk group) and identifies the one that has a higher value (the one that is closer to 1) . A higher correlation to the high risk group profile group (compared to the low risk profile group) predicts for the respective patient a risk of getting a metastasis within 10 years after surgical removal of the tumor. A higher correlation to the low risk group predicts for the respective patient a low risk of getting a metastasis within 10 years after surgical removal of the tumor. A diagnostic kit can be formed from these assays and an algorithm to identify for example low risk patients. In this case the algorithm could be adjusted to higher sensitivity, subtracting for example a constant value of the correlation value of the low risk group so that for very close decisions patient would be classified as high risk.

RT-PCR Assay set-up for 1 well :

3.1 μl Water 5.2 μl RT qPCR Master Mix (Invitrogen) with ROX dye

0.5 μL MgSO4 (5.5 mM final concentration in the assay)

1 μl Primer/Probe Mix dried

0.2 μl RT/Taq Mx (-RT: 0.08 μl Taq)

1 μl RNA (1:2) Ad. 10 μl

Primer/Probe Mix:

50 μl 100 μM Stock Solution Upper Primer 50 μl 100 μM Stock Solution Lower Primer 25 μl 100 μM Stock Solution TaqMan Probe bring to 1000 μl with water

10 μL PP (1:10) lyophilized, 2.5 h RT

Thermal Profile:

50 "C 30 Mm *

8 0 C ca. 20 Min *

95 0 C 2 Min

95 0 C 15 Sec.

60 0 C 30 Sec.

40 X cycles

Table 5 Tested genes used in the examples given in Figs 1 to 21, (names of primers are given in parentheses) :

Genes + Mixtures

CLEC2B (R120) CXCL13 (R109) DHRS2 (CAGMCl 98) ERBB2 (FPE_044) ESRl (BC170) H2AFZ (R123) IGHGl (R72) IGKC (R61) KCTD3 (CAGMC217) MLPH (R49) MMPl (mavl) PGR (BC172) SOX4 (R124) T0P2A (R70) UBE2C (R65)

HKM=housekeeper mixtures :

HKM = 2-6 genes of the following list:

RPL37A (mup) GAPDH (FPE029) ACTGl (R113) CALM2 (RIl 7) OAZl (BC268) PPIA (R115)

HKM = ACTG1+CALM2+0AZ1+PPIA

HKM-2 = CALM2+0AZ1+PPIA

HKMs=OAZl+PPIA+GAPDH

HKMx = ACTG1+CALM2+OAZ1+PPIA+GAPDH+RPL37A

Table 6: Sensitivity and Specificity of correlation algorithm with three different gene sets in patient cohort:

Table 7: Gene sets 1 to 3, as referred to in Table 6

The example of table 6 shows the robustness of the correlation algorithm; three different gene sets show very similar sensitivities and specificities and stay stable even in the verification cohort.

Example 2 :

Obtaining samples, RNA isolation and detection was performed as in Example 1. As in the above example, we choose the correlation coefficient as a measure of similarity which has values between (and including) -1 and 1. Values around or at one indicate greater similarity, values around or at zero indicating no similarity, and values around or at -1 indicate strong dissimilarity.

Patients are divided into two classes, e.g. patients that develop a distant metastasis within a given time frame of e.g. 10 years (denoted by "HR" as in "High Risk"), and those who do not develop a metastasis within this time frame (denoted consequently by "LR" as in "Low Risk") . An exhaustive search of combinations in a predefined set of potentially predictive genes (list again given in example 1) was performed, with combination of at least three genes. For each gene combination, a cross validation procedure was used by randomly dividing the set of available patients into two disjoint classes, one denoted "training set", and the other "test set". In this case, 50% of the patients were assigned to the training set, the other 50% into the test set (2-fold cross validation) .

On the training set, the single reference profile is determined by solving the optimization (see previous chapter of this document for details) where violation of the two constraints was implemented as penalty terms. The result of the optimization is a vector r and a constant theta for each cross validation. Computations were carried out using the Matlab software (Version 2007b, The Mathworks, Cambridge, MA, USA) .

On the test set, sample scores are computed as their correlation to the reference profile which in turn were used to compute receiver operator characteristic (ROC) curves, one for each cross validation run.

The procedure was repeated 100 times for each gene combination .

The ROC curves were consequently used to compare different gene sets after all gene combination and all cross validations had been computed. Gene sets were ranked by the following scores:

1) Youden ' s index averaged over all cross validations

2) Specificity at a sensitivity of 90% averaged over all cross validations

The optimal algorithms in both cases were chosen in terms of best scores, and in terms of lower complexity (less inputs being used) if scores were about the same (statistically indistinguishable) .

The optimal algorithm in terms of Youden ' s index and low complexity consists of three genes only, namely CXCL13, UBE2C, and GAPDH. In this case, risk assessment of an unknown patient sample is done by

1) Measurement of said target genes, obtaining CT values CT(CXCL13), CT(UBE2C), CT (GAPDH)

2) Transformation of each CT value by z-score normalization across the three genes:

z (CT(CXCL13) ) = (CT (CXCL13) -mean) /std z(CT(UBE2C)) = (CT (UBE2C) -mean) /std Z(CT(GAPDH)) = (CT (GAPDH) -mean) /std

where "mean" denotes the mean of the three CT values, and "std" their standard deviation.

3) Computation of the score by

Score = 0.80 * Z(CT(GAPDH) ) - 0.26 * z (CT (CXCL13) ) - 0.53 * z (CT(UBE2C) )

The vector of the values (0.80, -0.26, -0.53) is the optimal choice on our data for the vector r, each value rounded off for two decimals.

4) Assessing the risk by comparing with the threshold value theta = 1.97. If the score exceeds this value, the patient is classified as "high risk", and as "low risk" if the score is smaller than this value. Again, the value of theta has been rounded to two decimals.

The receiver operator characteristic (ROC) curve of the score of different combinations of genes is given in the following Figs. 1 - 21.

The receiver operator characteristic (ROC) curve of the score of only three genes in the optimal case algorithm is given in Fig. 1:

Fig. 1: CXCL13 (R109, ) UBE2C (R65) , GAPDH (FPE 029) , wherein the terms in parentheses indicate the primer pairs that were used.

The aforementioned three genes may be replaced with alternative genes, yielding algorithms of still very good quality (coefficients rounded as above)

Fig. 2 The combination of (RPL37A, CXCL13, UBE2C) with the weights r = (0.80, -0.28, -0.52) and threshold theta = 2.07

Fig. 3 The combination of (HKM, CXCL13, UBE2C) with the weights r = (0.79, -0.22, -0.57) and threshold theta = 2.00

Other choices for the housekeeper are possible and yield good results. Furthermore, UBE2C can be replaced by any other gene expressed in proliferation, e.g. TOP2A.

Other classifiers with high Youden ' s index are:

Fig. 4 Combination of (CXCL13, IGHGl, MMPl) with weights r = (0.23, 0.57, -0.79) and threshold theta = 1.33

Fig. 5 Combination of (CXCL13, UBE2C, IGKC, SOX4) with weights r = (0.22, -0.32, -0.60, 0.70), and threshold theta = 2.14.

Fig. 6 Combination of (CXCL13, UBE2C, H2AFZ) with weights r = (-0.17, -0.61, 0.78) and threshold theta = 1.61

Fig. 7 Combination of (HKM, CXCL13, SOX4) with weights r = (0.82, -0.41, -0.41) with threshold theta = 1.90

Fig. 8 Combination of (HKM, CXCL13, UBE2C,

GAPDH) with weights r = (0.76, -0.30,-0.56, 0.10) with threshold theta = 2.77

For the highest average specificity at 90% sensitivity, the following choices are optimal:

Fig. 9 Combination of (ESRl, IGHGl, TOP2A) with weights r = (0.29, 0.51, -0.81) and a threshold of theta = 0.92

Fig. 10 Combination of (CXCL13, MLPH , TOP2A) with weights r = (0.58, 0.20, -0.79) and a threshold of theta = 0.38

Fig. 11 Combination of (CXCL13, MLPH, SOX4, GAPDH) with weights r = (-0.42, 0.04, -0.42, 0.80) and a threshold of theta = 2.30

Fig. 12 Combination of (CXCL13, H2AFZ, GAPDH) with weights r = (-0.38, -0.43, 0.82) and a threshold of theta = 1.76

Fig. 13 Combination of (CXCL13, UBE2C, IGKC, MMPl, GAPDH) with weights r = (-0.05, -0.26, -0.23, -0.33, 0.88) with a threshold of theta = 3.61

Fig. 14 Combination of (HKM, ESRl, SOX4) with weights r = (0.81, -0.32, -0.49) and a threshold of theta = 1.99

Fig. 15 Combination of (HKM, CXCL13, MLPH, SOX4) with weights r = (0.54, 0.12, 0.16, -0.82) and a threshold theta = 0.94

Fig. 16 Combination of (HKM, CXCL13, H2AFZ) with weights r = (0.81, -0.34, -0.48) with threshold theta = 1.84

Fig. 17 Combination of (HKM, CXCL13, H2AFZ, SOX4) with weights r = (0.82, -0.48, -0.32, -0.01) and threshold theta = 2.91

Fig. 18 Combination of (HKM, CXCL13, H2AFZ, MLPH) r = (0.84, -0.33, -0.41, -0.10) and threshold theta = 2.49

Fig. 19 Combination of (HKM, CXCL13, H2AFZ, IGKC, MMPl) with weights r = (0.55, 0.12, -0.82, 0.02, 0.13) with threshold theta = 0.62

Fig. 20 Combination of (HKM, CXCL13, UBE2C, IGKC, MLPH) with weights r = (0.83, -0.07, -0.38, -0.40, 0.02) and threshold theta = 3.47

Fig. 21 Combination of (HKM, CXCL13, UBE2C, H2AFZ, IGKC) with weights r = (0.88, -0.13, -0.23, -0.19, -0.34) and a cutoff of theta = 3.50

Further, reference profile values were determined fort the set of genes consisting of HKM, UBE2C, CXCL13 with regard to the endpoints metastasis within 5 years, metastasis within 10 years, death after Recurrence within 5 years, and death after recurrence within 10 years:

Endpoint: metastasis within 5 years: s = (0.793, -0.564, -0.229), theta = 1.910

Endpoint: metastasis within 10 years: s = (0.776, -0.608, -0.168), theta = 1.931

Endpoint: death after recurrence within 5 years: s = (0.808, -0.506, 0-302), theta = 1.839

Endpoint: death after recurrence within 10 years: s = (0.786, -0.584, -0.203), theta = 1.902

Table 8: Further genes that can be used in the method of the invention :

Gene Primer ID RefSeq ID

CAl 2 (R99 ) NM_001218

CCNDl (R18 ) NM_053056

CDC2 (BC103 ) NM_001786

CHPT l (R138 ) NM_020244

CTSB (R128 ) NM 147780

CTSLl (R129) NM_001912

DCN (R88) NM_001920

EIF4B (BC283) NM_001417

FBXO28 (FBXO28) NM_015176

GABRP (R134-2) NM_014211

IGFBP3 (BC108) NM_000598

KIAAOlOl (R135) NM_014736

KRT 17 (R131-2) NM_000422

NATl (RlOl) NM_000662

NEK2 (NEK2) NM_002497

NPYlR (R133) NM_000909

NR2F2 (R136) NM_021005

PCNA (R130) NM_002592

PDLIM5 (R137) NM_006457

PRCl ( (PRCl) NM_003981

RACGAPl R125-2) NM_013277

RGS 5 (R-RGS5) NM_003617

SFRPl (CAGMC227) NM_003012

VEGFa (SC016) NM 001025366

Table 9: PCR Probes used for the genes of table 3

Gene SEQ

List ID

No. NO Gene-ID RefSeq ID Probe ID FAM 5 'Sequence 3 1 TAMRA

1 1 ACTG 1 NM_001614 R113-ACTG1 AGGCCCCCCTGAACCCCAAG

2 2 PPIA NM_021 130 R1 15-PPIA TGGTTGGATGGCAAGCATGTGGTG

3 3 CALM2 NM_001743 R117-CALM2 TCGCGTCTCGGAAACCGGTAGC

4 4 OAZ1 NM_004152 BC268-RNA TGCTTCCACAAGAACCGCGAGGA

AAG GTG AAG GTC GGA GTC AAC GGA TTT

5 5 GAPD NM_002046 R15 G

6 6 RPL37A NM_000998 R16 TGGCTGGCGGTGCCTGGA

7 7 CLEC2B NM 005127 R120-CLEC2B AGTTTATGCCCCTATGATTGGATTGGTTTCC

8 8 Cxcl13 NM_006419 R109 TGGTCAGCAGCCTCTCTCCAGTCCA

9 9 DHRS2 NMJ 82908 CAG MC 198 AG GCAG AGTCTG CCATTCTG CCAGAC

10 10 ERBB2 NM_004448 FPE_044 CAGATTGCCAAGGGGATGAGCTACCTG

11 11 ESR1 NM_000125 BC170 ATGCCCTTTTGCCGATGCA

12 12 H2AFZ NM_002106 rev R123 CCAGCCATTTCGAATTCCGCTGAA

13 13 IgHGI BC067091 R72 TGACAAAACTCACACATGCCCACCG

14 14 IgKC affy21 1645 R61 AGCAG CCTGCAG CCTG AAGATTTTG C

15 15 KCTD3 NM_016121 CAGMC217 CACCTGTTGATCGTCTCGCTCTCCAA

16 16 MLPH NM_024101 R49 CCAAATGCAGACCCTTCAAGTGAGGC

AGAGAGTACAACTTACATCGTGTTGCGGCT

17 17 MMP1 NM_002421 R-MMP1 mav1 CA

18 18 PGR NM_000926 BC172 TTGATAGAAACGCTGTGAGCTCGA

19 19 SOX4 NM_003107 rev R124 CGCCGCTCGATCTGCGACC

20 20 TOP2A NM_001067 R70 CAGATCAGGACCAAGATGGTTCCCACAT

21 21 UBE2C NM_007019 R65 TGAACACACATGCTGCCGAGCTCTG

22 22 CA12 NM_001218 R99 CGGCCCCAGTGAACGGTTCCA

23 23 CCND1 nm 053056rev R18 TCG CACTTCTGTTCCTCG CAGACCT

TCGAAAATGTTAATCTATGATCCAGCCAAAC

24 24 CDC2 NM_001786 BC103 GA

25 25 CHPT1 NM_020244 R138 CCACGGCCACCGAAGAGGCAC

26 26 CTSB NMJ47780 R128 CGGGCACAACTTCTACAACGTGGACAT

27 27 CTSL1 NM_001912 R129 AAGGCGATGCACAACAGATTATACGG

28 28 DCN NM 001920 R88 TCTTTTCAGCAACCCGGTCCA

29 EIF4B NM_001417 BC283 CCCACCACTTGTAGGGGACTGCT

30 FBXO28 NM_015176 R141 ACGACGAAATTAG CCAG CTCCG C

31 GABRP NM_014211 R 134-2 ACAGTTCCTTACAGCAGATGGCAGCCA

32 IGFBP3 NM_000598 BC108 TGGGCTGCTCTCCCGGAGGC

33 KIAA0101 NM_014736 R135 TCGAGCCCCCAGAAAGGTGCTTG

34 KRT17 NM_000422rev R131 -2 CACCTCGCGGTTCAGTTCCTCTGT

35 NAT1 NM_000662 R101 CCTGGTTGCCGGCTGAAATAAC

36 NEK2 NM_002497 R140 TCCTGAACAAATGAATCGCATGTCCTACAA

37 NPY1 R NM_000909 R133 CATTCCCTTGAACTGAACAATCCTCTTTGGA

38 NR2F2 NM_021005 R136 CCTCAAGGCCATAGTCCTGTTCACCTCA

39 PCNA NM_002592 R130 AAATACTAAAATGCGCCGGCAATGA

CATTGGACTTTGAGCCATTAGAACCATGAG

40 PDLIM5 NM_006457 R137 C

41 PRC1 NM_003981 R139 TCTGCAGACATACTATGGACTCCTCCGCC

RACGAP

42 1 N M_013277 rev R125-2 ACTGAGAATCTCCACCCGGCGCA

43 RGS5 NM 003617 R-RGS CCAGAAAACCTCGCTGGACGAGG

44 SFRP1 NM_003012 CAGMC AAGCCCCAAGGCACAACGGTGT

45 VEGFa NM 001025366 BC302/ CACCATGCAGATTATGCGGATCAAACCT

Table 10: 5' Primers used for the genes of table 3

Gene

List SEQ ID

No. NO Gene-ID 5' Primer ID 5' Sequence 3'

1 46 ACTG1 R1 13-ACTG1for CTGGCACCACACCTTCTACAAC

2 47 PPIA R1 15-PPIAfor TTTCATCTGCACTGCCAAGACT

3 48 CALM2 R1 17-CALM2for GAGCGAGCTGAGTGGTTGTG

4 49 OAZ1 BC268-RNAfor CGAGCCGACCATGTCTTCAT

5 50 GAPD R15for GCC AGC CGA GCC ACA TC

6 51 RPL37A R16for TGTGGTTCCTGCATGAAGACA

7 52 CLEC2B R120-CLEC2Bfor TTATTACTCTGATAGTTAAACTAACTCGAGATTCTC

8 53 Cxcl13 R109for CGACATCTCTGCTTCTCATGCT

9 54 DHRS2 CAGMC198for G GTGTCTAG GTGATCATTTG GATCT

10 55 ERBB2 HER-2/neu FP 2607F CCAGGACCTGCTGAACTGGT

11 56 ESR1 BC170for GCCAAATTGTGTTTGATGGATTAA

12 57 H2AFZ R123for CGGAGTCCTTTCCAGCCTTA

13 58 IgHGI R72for GGACAAGAAAGTTGAGCCCAAA

14 59 IgKC R61for GATCTGGGACAGAATTCACTCTCA

15 60 KCTD3 CAGMC217for GCTATTCCTCTGGAAATGACATAGGA

16 61 MLPH R49for TCGAGTGGCTGGGAAACTTG

17 62 MMP1 MMP1 -FWmav1 AGATGAAAGGTGGACCAACAATTT

18 63 PGR BC172for AGCTCATCAAGGCAATTGGTTT

19 64 SOX4 R124for GGCGACTGCTCCATGATCTT

20 65 TOP2A R70for CATTGAAGACG CTTCGTTATG G

21 66 UBE2C R65for CTTCTAGGAGAACCCAACATTGATAGT

22 67 CA12 R99for TGCTCCTGCTGGTGATCTTAAA

23 68 CCND1 R18for GAAGCGGTCCAGGTAGTTCATG

24 69 CDC2 BC103for GAAG CCTAG CATCCCATGTCA

25 70 CHPT1 R138for CGCTCGTGCTCATCTCCTACT

26 71 CTSB R128for G GTCAACTATGTCAACAAACG GAAT

27 72 CTSL1 R129for AGAGGCACAGTGGACCAAGTG

28 73 DCN R88for AAGGCTTCTTATTCGGGTGTGA

29 74 EIF4B BC283for CTCGATCTCAGAGCTCAGACACA

30 75 FBXO28 R141for CCATCGAGAACATCCTCAGCTT

76 GABRP R134-2for G CCTTG CTAGAATATG CAGTTG CT

77 IGFBP3 BC108for AGGAGCTCACGCCCAGAGA

78 KIAA0101 R135for AGTGTTCCAGGCACTTACAGAAAA

79 KRT17 R131-2for CGAGGATTGGTTCTTCAGCAA

80 NAT1 R101for TGGGAGGATTGCATTCAGTCT

81 NEK2 R140for ATTTGTTGGCACACCTTATTACATGT

82 NPY1 R R133for TATTCCCCATATTGGAATCCATTT

83 NR2F2 R136for AGGCGCTGCACGTTGAC

84 PCNA R130for GGGCGTGAACCTCACCAGTA

85 PDLIM5 R137for CGGACCCGAGCATATTTCAT

86 PRC1 R139for CCCATATTTCCCGAAGGTGAT

87 RACGAP1 R125-2for TCGCCAACTGGATAAATTGGA

88 RGS5 R-RGS5for GGTGACCTTGTCATTCCGTACA

89 SFRP1 CAGMC227for ACGTCTGCATCGCCATGA

90 VEGFa BC302/SC016for GCCCACTGAGGAGTCCAACA

Table 11: 3' Primers used for the genes of table 3

Gene List

No. SEQ ID NO Gene-ID 3' Primer ID 5' Sequence 3'

1 91 ACTG 1 R1 13-ACTG1 rev CAAACATAATCTGAGTCATCTTCTCTCTGT

2 92 PPIA R115-PPIArev TATTCATGCCTTCTTTCACTTTGC

3 93 CALM2 R117-CALM2rev AGTCAGTTGGTCAGCCATGCT

4 94 OAZ1 BC268-RNArev AAGCCCAAAAAGCTGAAGGTT

5 95 GAPD R15rev CCA GGC GCC CAA TAC G

6 96 RPL37A R16rev GTGACAGCGGAAGTGGTATTGTAC

7 97 CLEC2B R120-CLEC2Brev TTGAATTCCAATCTCCTTCTTCTTTAGA

8 98 Cxcl13 R109rev AGCTTGTGTAATAGACCTCCAGAACA

9 99 DHRS2 CAGMC198rev TGAGTAAGCCCCCAAATTGC

10 100 ERBB2 HER-2/neu RP 2678R TGTACGAGCCGCACATCC

11 101 ESR1 BC170rev GACAAAACCGAGTCACATCAGTAATAG

12 102 H2AFZ R123rev TCTCTGCCTTGCTTGCTTGA

13 103 IgHGI R72rev GGGTCCGGGAGATCATGAG

14 104 IgKC R61 rev GCCGAACGTCCAAGGGTAA

15 105 KCTD3 CAGMC217rev TGATGGGAACAACTTTCTGGATAA

16 106 MLPH R49rev AGATAGGGCACAGCCATTGC

17 107 MMP1 MMP1 -REmav1 CCAAGAGAATGGCCGAGTTC

18 108 PGR BC172rev ACAAGATCATGCAAGTTATCAAGAAGTT

19 109 SOX4 R124rev CACATCAAGCGACCCATGAA

20 110 TOP2A R70rev CCAGTTGTGATG GATAAAATTAATCAG

21 111 UBE2C R65rev GTTTCTTGCAGGTACTTCTTAAAAGCT

22 112 CA12 R99rev CCCCATCAGGACCAAAATAAGTC

23 113 CCND1 R18rev AGATCGTCGCCACCTGGAT

24 114 CDC2 BC103rev CAGTGCCATTTTGCCAGAAA

25 115 CHPT1 R138rev CCCAGTGCACATAAAAGGTATGTC

26 116 CTSB R128rev AAGGTACCACATAGCCTCTTCAAGTAG

27 117 CTSL1 R129rev CTCTCCTCCATCCTTCTTCATTCA

28 118 DCN R88rev TGGATGGCTGTATCTCCCAGTA

29 119 EIF4B BC283rev GCATTCATCCCATCTACTTTATTTTCAT

30 120 FBXO28 R141 rev TCTCTGGCAGACCAAGTCCAT

31 121 GABRP R134-2rev ACTTCCTTTGTTGTCCCCCTATC

122 IGFBP3 BC108rev AG CCTGACTTTG CCAGACCTT 123 KIAA0101 R135rev CTCGATGAAACTGATGTCGAATTAGT 124 KRT17 R131-2rev ACTCTG CACCAG CTCACTGTTG 125 NAT1 R101 rev TG CTTCTTCCTG GCTTGAATTC 126 NEK2 R140rev AAGCAGCCCAATGACCAGATA 127 NPY1 R R133rev CCACCCTTCCTTCTTTAATAAGCA 128 NR2F2 R136rev CTGAGACTTTTCCTGCAAGCTTT 129 PCNA R130rev CTTCGGCCCTTAGTGTAATGATATC 130 PDLIM5 R137rev AGGAGCTGGGCCAACCA 131 PRC1 R139rev CCGCCATGAGGAGAAGTGA 132 RACGAP1 R125-2rev GAATGTGCGGAATCTGTTTGAG 133 RGS5 R-RGS5rev AAGTCCATAGTTGTTCTGCAGGAGTT

134 SFRP1 CAGMC227rev TGTTCAATGATGGCCTCAGATT 135 VEGFa BC302/SC016rev TCCTATGTGCTGGCCTTGGT

Table 12: Amplicons obtained for the genes of table 3

Gene

List SEQ

No. ID NO Gene-ID Amplicon

CTGGCACCACACCTTCTACAACGAGCTGCGCGTGGCCCCGGAGGAGCA

CCCAGTGCTGCTGACCGAGGCCCCCCTGAACCCCAAGGCCAACAGAGA

1 136 ACTG 1 GAAGATGACTCAGATTATGTTTG

TTTCATCTGCACTGCCAAGACTGAGTGGTTGGATGGCAAGCATGTGGTG

2 137 PPIA TTTGGCAAAGTGAAAGAAGGCATGAATA

GAGCGAGCTGAGTGGTTGTGTGGTCGCGTCTCGGAAACCGGTAGCGCT

3 138 CALM2 TGCAGCATGGCTGACCAACTGACT

TTGCTTCCACAAGAACCGCGAGGACAGAGCCGCCTTGCTCCGAACCTT

4 139 OAZ1 CAGCTTTTTGGGCTT

GCCAGCCGAGCCACATCGCTCAGACACCATGGGGAAGGTGAAGGTCG

5 140 GAPD GAGTCAACGGATTTGGTCGTATTGGGCGCCTGG

TGTGGTTCCTGCATGAAGACAGTGGCTGGCGGTGCCTGGACGTACAAT

6 141 RPL37A ACCACTTCCGCTGTCAC

TTATTACTCTGATAGTTAAACTAACTCGAGATTCTCAGAGTTTATGCCCC

TATGATTGGATTGGTTTCCAAAACAAATGCTATTATTTCTCTAAAGAAGAA

7 142 CLEC2B GGAGATTGGAATTCAA

TCGACATCTCTGCTTCTCATGCTGCTGGTCAGCAGCCTCTCTCCAGTCC

8 143 Cxcl13 AAGGTGTTCTGGAGGTCTATTACACAAGCT

GGTGTCTAGGTGATCATTTGGATCTGGAGGCAGAGTCTGCCATTCTGCC

9 144 DHRS2 AGACTAGCAATTTGGGGGCTTACTCA

CCAGGACCTGCTGAACTGGTGTATGCAGATTGCCAAGGGGATGAGCTA

10 145 ERBB2 CCTGGAGGATGTGCGGCTCGTACA

GCCAAATTGTGTTTGATGGATTAATATGCCCTTTTGCCGATGCATACTAT

11 146 ESR1 TACTGATGTGACTCGGTTTTGTC

CGGAGTCCTTTCCAGCCTTACCGCCAGCCATTTCGAATTCCGCTGAAGC

12 147 H2AFZ TCAAG CAAG CAAG GCAG AG A

GGACAAGAAAGTTGAGCCCAAATCTTGTGACAAAACTCACACATGCCCA

CCGTGCCCAGCACCTGAACTCCTGGGGGGACCGTCAGTCTTCCTCTTC

13 148 IgHGI CCCCCAAAACCCAAGGACACCCTCATGATCTCCCGGACCC

GATCTGGGACAGAATTCACTCTCACAATCAGCAGCCTGCAGCCTGAAGA

14 149 IgKC TTTTGCAACTTATTACTGTCTACAGCATAATAGTTACCCTTGGACGTTCG

GC

GCTATTCCTCTGGAAATGACATAGGACCTTTTGGAGAGCGAGACGATCA 150 KCTD3 ACAGGTGTTTATCCAGAAAGTTGTTCCCATCA

TCGAGTGGCTGGGAAACTTGGCAAGAGACCAGAGGACCCAAATGCAGA

151 MLPH CCCTTCAAGTGAGGCCAAGGCAATGGCTGTGCCCTATCT

AGATGAAAGGTGGACCAACAATTTCAGAGAGTACAACTTACATCGTGTT

152 MMP1 GCGGCTCATGAACTCGGCCATTCTCTTGG

AGCTCATCAAGGCAATTGGTTTGAGGCAAAAAGGAGTTGTGTCGAGCTC

153 PGR ACAGCGTTTCTATCAACTTACAAAACTTCTTGATAACTTGCATGATCTTGT

GGCGACTGCTCCATGATCTTGCGCCGCTCGATCTGCGACCACACCATG

154 SOX4 AAGGCGTTCATGGGTCGCTTGATGTG

CATTGAAGACGCTTCGTTATGGGAAGATAATGATTATGACAGATCAGGA

CCAAGATGGTTCCCACATCAAAGGCTTGCTGATTAATTTTATCCATCACA

155 TOP2A ACTGG

CTTCTAGGAGAACCCAACATTGATAGTCCCTTGAACACACATGCTGCCG

156 UBE2C AGCTCTGGAAAAACCCCACAGCTTTTAAGAAGTACCTGCAAGAAAC

TGCTCCTGCTGGTGATCTTAAAGGAACAGCCTTCCAGCCCGGCCCCAG 157 CA12 TGAACGGTTCCAAGTGGACTTATTTTGGTCCTGATGGGG

GAAGCGGTCCAGGTAGTTCATGGCCAGCGGGAAGACCTCCTCCTCGCA

158 CCND1 CTTCTGTTCCTCGCAGACCTCCAGCATCCAGGTGGCGACGATCT

GAAGCCTAGCATCCCATGTCAAAAACTTGGATGAAAATGGCTTGGATTT

GCTCTCGAAAATGTTAATCTATGATCCAGCCAAACGAATTTCTGGCAAAA 159 CDC2 TGGCACTG

CGCTCGTGCTCATCTCCTACTGTCCCACGGCCACCGAAGAGGCACCAT

160 CHPT1 ACTGGACATACCTTTTATGTGCACTGGG

GGTCAACTATGTCAACAAACGGAATACCACGTGGCAGGCCGGGCACAA

161 CTSB CTTCTACAACGTGGACATGAGCTACTTGAAGAGGCTATGTGGTACCTT

AGAGGCACAGTGGACCAAGTGGAAGGCGATGCACAACAGATTATACGG 162 CTSL1 CATGAATGAAGAAGGATGGAGGAGAG

AAG GCTTCTTATTCG G GTGTGAGTCTTTTCAG CAACCCG GTCCAGTACT

163 DCN G GG AG ATACAG CCATCCA

CTCGATCTCAGAGCTCAGACACAGAGCAGCAGTCCCCTACAAGTGGTG

GGGGAAAAGTAGCTCCAGCTCAACCATCTGAGGAAGGACCAGGAAGGA 164 EIF4B AAG ATGAAAATAAAGTAG ATGG GATG AATG C

CCATCGAGAACATCCTCAGCTTTATGTCCTACGACGAAATTAGCCAGCT

165 FBXO28 CCGCCTGGTTTGTAAAAGAATGGACTTGGTCTGCCAGAGA

GCCTTGCTAGAATATGCAGTTGCTCACTACAGTTCCTTACAGCAGATGG

31 166 GABRP CAGCCAAAGATAGGGGGACAACAAAGGAAGT

AGGAGCTCACGCCCAGAGACTGGGCTGCTCTCCCGGAGGCCAAACCC

32 167 IGFBP3 AAGAAGGTCTGGCAAAGTCAGGCT

AGTGTTCCAGGCACTTACAGAAAAGTGGTGGCTGCTCGAGCCCCCAGA

AAGGTGCTTGGTTCTTCCACCTCTGCCACTAATTCGACATCAGTTTCATC

33 168 KIAA0101 GAG

CGAGGATTGGTTCTTCAGCAAGACAGAGGAACTGAACCGCGAGGTGGC

34 169 KRT17 CACCAACAGTGAGCTGGTGCAGAGT

TGGGAGGATTGCATTCAGTCTAGTTCCTGGTTGCCGGCTGAAATAACCT

35 170 NAT1 GAATTCAAGCCAGGAAGAAGCA

ATTTGTTGGCACACCTTATTACATGTCTCCTGAACAAATGAATCGCATGT

36 171 NEK2 CCTACAATGAGAAATCAGATATCTG GTCATTG GG CTG CTT

TATTCCCCATATTGGAATCCATTTACCAAAATTATTCTGAATTCTTCATTC

CCTTGAACTGAACAATCCTCTTTGGAATTTGTCTTTTTCGCTCCTGCTTA

37 172 NPY1 R TTAAAGAAG GAAG GGTG G

AGGCGCTGCACGTTGACTCAGCCGAGTACAGCTGCCTCAAGGCCATAG

TCCTGTTCACCTCAGATGCCTGTGGTCTCTCTGATGTAGCCCATGTGGA

38 173 NR2F2 AAGCTTGCAGGAAAAGTCTCAG

GGGCGTGAACCTCACCAGTATGTCCAAAATACTAAAATGCGCCGGCAAT

39 174 PCNA G AAGATATCATTACACTAAG GG CCG AAG

CGGACCCGAGCATATTTCATTTTCTGTCATTGGACTTTGAGCCATTAGAA

40 175 PDLIM5 CCATGAGCAACTACAGTGTGTCACTGGTTGGCCCAGCTCCT

CCCATATTTCCCGAAGGTGATTTAGGGCTTTCTGCAGACATACTATGGA

41 176 PRC1 CTCCTCCGCCAGCACCTCACTTCTCCTCATGGCGG

TCGCCAACTGGATAAATTGGACTTCATTTCCTTCACTGAGAATCTCCACC

42 177 RACGAP 1 CGGCGCACAAGCTGCTCAAACAGATTCCGCACATTC

GGTGACCTTGTCATTCCGTACAATGAGAAGCCAGAGAAACCAGCCAAGA

CCCAGAAAACCTCGCTGGACGAGGCCCTGCAGTGGCGTGATTCCCTGG

43 178 RGS5 ACAAACTCCTGCAGAACAACTATGGACTT

ACGTCTGCATCGCCATGACGCCGCCCAATGCCACCGAAGCCTCCAAGC

CCCAAGGCACAACGGTGTGTCCTCCCTGTGACAACGAGTTGAAATCTGA

44 179 SFRP1 GGCCATCATTGAACA

GCCCA CTGAGGAGTC CAACATCACC ATGCAGATTA TGCGGATCAA

45 180 VEGFa ACCTCACCAA GGCCAGCACA TAGGA

In summary, the present invention is predicated on a method of identi fication of a panel of genes informative for the

outcome of disease which can be combined into an algorithm for a prognostic or predictive test. The algorithm makes use of gene expression data from biological samples and classifies patients as having a high risk or low risk, e.g. in cancer patients a metastasis bad outcome or good outcome group. Reference patterns of gene expression are obtained for the high risk and low risk groups, respectively. A sample of an unknown patient is analyzed and classified as belonging to the high risk or low risk group, respectively, depending on correlation or similarity to the high risk reference pattern or low risk reference pattern. The above examples are for illustrative purposes only. Many modifications and variations of the invention as defined by the scope of the claims are possible .