Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
YIN-YANG GENE EXPRESSION RATIO MODELS FOR GENERATION OF CLINICAL PROGNOSIS SIGNATURES FOR LUNG CANCER PATIENTS
Document Type and Number:
WIPO Patent Application WO/2016/011524
Kind Code:
A1
Abstract:
A method for selection of a set of Yin genes and Yang genes for use in development of cancer patient prognosis, comprising: (i) screening gene expression data sets collected from a plurality of cancer tissue samples of patients to identify a first plurality of genes expressed in cancerous cells; (ii) screening gene expression data sets collected from a plurality of healthy subjects or cancer adjacent normal tissue samples to identify a second plurality of genes expressed in healthy cells; and (iii) selecting a set of Yin genes and a set of selected Yang genes using pathway and gene functional analysis; (iv) building a Yin and Yang gene expression ratio (YMR) model as a cancer prognosis signature. Implementing prognosis signature software for a cancer patient by: (v) detecting the expression values of selected Yin genes and selected Yang genes; and (vi) calculating the YMR for predicting the patient's clinical outcomes.

Inventors:
XU WAYNE (CA)
Application Number:
PCT/CA2014/050704
Publication Date:
January 28, 2016
Filing Date:
July 24, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV MANITOBA (CA)
International Classes:
C04B40/06; C07H21/04; C12Q1/68; C40B30/04; G16B20/20; G16B25/10
Domestic Patent References:
WO2014176687A12014-11-06
Other References:
BHATTACHARJEE, A. ET AL.: "Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES USA., vol. 98, no. 24, 20 November 2001 (2001-11-20), pages 13790 - 13795, XP003011749, ISSN: 0027-8424, DOI: doi:10.1073/pnas.191502998
XU, W. ET AL.: "Yin Yang gene expression ratio signature for lung cancer prognosis.", PLOS ONE., vol. 8, no. 7, 17 July 2013 (2013-07-17), pages 1 - 11, ISSN: 1932-6203
Attorney, Agent or Firm:
POLONENKO, Daniel R. et al. (Box 30 Suite 2300 - 550 Burrard Stree, Vancouver British Columbia V6C 2B5, CA)
Download PDF:
Claims:
CLAIMS

1. A method for identification and selection of a set of Yin genes and Yang genes for use in development of a prognosis for a cancer patient, the method comprising:

screening gene expression data sets collected from tissue samples collected from a plurality of cancer patients to identify a first plurality of genes expressed in said patients' cancerous cells;

screening gene expression data sets collected from one of: (i) healthy tissue samples adjacent to cancerous cells collected from a plurality of cancer patients, or (ii) tissue samples collected from a plurality of healthy subjects, to identify a second plurality of genes expressed in said healthy subjects' cells;

using pathway functional analysis to identify and select: (iii) a set of Yin genes and (iv) a set of Yang genes;

building a Yin gene and Yang gene expression ratio (YMR) model for use as a cancer prognosis signature.

2. A method for preparation of a prognosis signature for a cancer patient, the method comprising the steps of:

collecting samples from the patient's cancerous tissues;

using qPCR, detecting expression in the patient's tissue samples of Yin genes and Yang genes selected by the method of claim 1;

using the YMR model to calculate a cancer prognosis signature for the patient.

3. A first set of gene probes for Yin genes and a second set of gene probes for Yang genes for use in preparation of a prognosis signature for a lung cancer patient wherein:

the first set of gene probes comprises the 31 nucleotide sequences set forth in the listings of SEQ ID NO:2, SEQ ID NO:5, SEQ ID NO:9, SEQ ID NO:10, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 16, SEQ ID NO: 17, SEQ ID NO:23, SEQ ID NO:28, SEQ ID NO:29, SEQ ID NO:31, SEQ ID NO:32, SEQ ID NO:34, SEQ ID NO:35, SEQ ID NO:36, SEQ ID NO:38, SEQ ID NO:41, SEQ ID NO:43, SEQ ID NO:44, SEQ ID NO:45, SEQ ID NO:49, SEQ ID NO:50, SEQ ID NO:52, SEQ ID NO:58, SEQ ID NO:61, SEQ ID NO:62, SEQ ID NO:63, SEQ ID NO:64, SEQ ID NO:67, SEQ ID NO:69; and

the second set of gene probes comprises the 32 nucleotide sequences set forth in the listings of SEQ ID NO:73, SEQ ID NO:77, SEQ ID NO:79, SEQ ID NO:83, SEQ ID NO:90, SEQ ID NO: 101, SEQ ID NO:105, SEQ ID NO: 112, SEQ ID NO: 117, SEQ ID NO: 120, SEQ ID NO: 122, SEQ ID NO:124, SEQ ID NO: 125, SEQ ID NO: 130, SEQ ID NO: 131, SEQ ID NO: 134, SEQ ID NO:136, SEQ ID NO: 140, SEQ ID NO: 141, SEQ ID NO: 142, SEQ ID NO: 143, SEQ ID NO:148, SEQ ID NO: 149, SEQ ID NO: 155, SEQ ID NO: 161, SEQ ID NO: 164, SEQ ID NO: 165.

4. A first set of gene probes for Yin genes and a second set of gene probes for Yang genes according to claim 3, wherein:

the first set of gene probes comprises the 4 nucleotide sequences set forth in the listings of SEQ ID NO: 16, SEQ ID NO:28, SEQ ID NO:31, SEQ ID NO:61 and

the second set of gene probes comprises the nucleotide sequences set forth in the listings of SEQ ID NO:94, SEQ ID NO: 120, SEQ ID NO: 125, SEQ ID NO: 136, SEQ ID NO: 161, SEQ ID NO: 171.

5. Use of the first set of gene probes and the second set of gene probes of claim 3, for the development of a prognosis signature for a lung cancer patient.

Description:
TITLE: YIN-YANG GENE EXPRESSION RATIO MODELS FOR GENERATION OF CLINICAL PROGNOSIS SIGNATURES FOR LUNG CANCER PATIENTS

TECHNICAL FIELD

The present disclosure pertains to development of prognoses for lung cancer patients. More particularly, the present disclosure pertains to systems, methods and tools for assessment of gene expression in lung cancer cells.

BACKGROUND

Lung cancer is the leading cause of cancer-related deaths in North America. While there has been a decrease in lung cancer deaths among men due to a reduction in tobacco use over the past 50 years, it still accounts for 29% of all male cancer deaths in 2010 (Jemal et al, 2010, Cancer statistics 2010. CA Cancer J. Clin. 60(5):277-300). The 5-year overall survival rate for lung cancer is as low as 16% and has not significantly improved over the past 30 years (Jemal et al, 2010). Non-small cell lung cancer (NSCLC) is the most common lung cancer category accounting for 85%-90% of annual cases. About 25% to 30% of NSCLC patients present with early stage I disease and receive surgical intervention. However, more than 20% of these patients relapse within five years (Rinewalt et al, 2012, Development of a serum biomarker panel predicting recurrence in stage I non-small cell lung cancer patients. J. Thorac. Cardiovasc. Surg. 144: 1344-51). Adjuvant therapy has improved survival of a subset of patients with stage II and III disease. However, it is not known which patients are more likely to relapse and benefit more from additional therapies.

To improve clinical outcomes, researchers have invested much effort into identifying lung cancer biomarkers that would allow clinicians to make an early diagnosis and to predict disease course and treatment effect. Genome-wide expression profiling using microarray techniques has identified potential gene signatures to classify patients into different survival outcome cohorts. All these previously reported models were built by learning the correlation coefficients between gene expression and patients' survival time from training data sets and they require that the new test data sets be normalized to the training data. Consequently, these signatures have low reproducibility and are impractical in a clinic setting and there is little evidence that any of the reported gene expression signatures are ready for clinical application. SUMMARY

The exemplary embodiments of the present disclosure pertain to methods for developing biomarker signatures for use in providing prognoses for cancer patients. The biomarker signatures are based on a Yin Yang hypothesis that the imbalance of two opposing effects in cancer cells determines a patient's prognosis. Specifically, the expression values of selected Yin genes and Yang genes are extracted from a patient's microarray expression data and then calculated as a mean ratio of the Yin (Y) gene expression and the Yang (y) gene expression. This mean ratio is referred to herein as the "YMR signature".

One embodiment of the present disclosure is a method for developing a protocol for development of a Yin-Yang set of genes for use in the development of a prognosis for a cancer patient. One step of the method pertains to the identification and selection of a group of genes predominantly functions in normal cells and terming these genes as the "Yang Genes". Another step of the method pertains to the identification and selection of a group of genes that is predominantly functioning in selected cancerous cells of terming these genes as the "Yin genes". The ratio of expression levels of these two groups of genes indicates the status of the cells as functioning normally or alternatively, functioning in a cancer-mode modality. The cancer prognosis is a multi-dimensional complex process. The essence of the Yin-Yang theory is that it simplifies the multi-dimensional complications into simple two opposing dimensions. The extent of the two-dimensional (Yin and Yang) imbalance would indicate the severity or progression of the cancer thus, predicts the patients' survival time.

An aspect of the present disclosure pertains to a model for determination of predictive signature models for cancer patients. The model is a mathematic formula with coefficients obtained from training data. This model produces a risk score for each patient that enables stratification of the patient into a high-risk group or a low-risk group. According to another aspect, there are four steps for development of a cancer prognosis signature. The first step is identifying and selecting genes that can be used for use assessing a cancer prognosis. The second step is use of the selected genes to build a cancer prognosis signature model. The third step is validate the cancer prognosis signature with independent data sets. The fourth step is to confirm the clinical relevance of the cancer prognosis signature. According to another aspect, the cancer prognosis signature may be optimized by adding in tow additional steps between steps three and four, wherein the first additional step comprises optimizing the signature around thirty or less Yin genes and Yang genes that produce the highest performance in all data sets. The second addition step comparing the optimized signature with previously reported lung cancer signatures.

The YMR signatures disclosed herein and the methods disclosed herein for deriving the YMR signatures contrasts with all previous signature models that are based on significant data training, and provide a more precise insight into the biology of cancer development. The YMR signatures may be used for the development of qRT-PCR kits for detection of Yin gene expression and Yang gene expression in tissue samples and/or cell samples collected from a patient. The calculated YMR risk score can help the clinical therapy decision making regarding the disease stages. This study can also have potential in drug development by modulating the expressions of selected Yin genes and Yang genes, or by altering other target gene expression so that a lower YMR can be achieved. In addition to lung cancer, the YMR approach to biomarker discovery can also be used for preparation of prognoses for other types of cancers and/or diseases. DESCRIPTION OF THE DRAWINGS

The present disclosure will be described in conjunction with reference to the following drawings in which:

Fig. 1 is a schematic flowchart illustrating exemplary steps taken to identify and validate the methods for producing the YMR signature model disclosed herein; Fig. 2 is a schematic flowchart illustrating use of the YMR model disclosed herein for predicting a lung cancer patient's survival or recurrence-free time period;

Fig. 3 is a schematic flowchart illustrating a process for optimization of a Yin and Yang gene list for use in the YMR model disclosed herein;

Fig. 4 is a schematic illustration of 2-D Euclidean clustering analysis for identification of candidate Yin genes showing a complete linkage setting for both gene (12,625 genes on HG-U95av2) and 100 samples of Bhattacharjee data set (2001, Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad Sci. USA 98(24): 13790-13795). The region was selected where the genes upregulated in normal samples but downregulated in almost all different types of lung cancers. The region where genes were downregulated in one or few cancer types was not selected;

Fig. 5 is a schematic illustration of an expanded section from the 2-D Euclidean clustering analysis shown in Fig. 4; Fig. 6 is a schematic illustration of 2-D Euclidean clustering analysis for identification of candidate Yang genes showing a complete linkage setting for both gene (12,625 genes on HG-U95av2) and 100 samples of Bhattachaqee data set. The region was selected where the genes upregulated in normal samples but downregulated in almost all different types of lung cancers. The region where genes were downregulated in one or few cancer types was not selected;

Fig. 7 is a schematic illustration of an expanded section from the 2-D Euclidean clustering analysis shown in Fig. 4;

Fig. 8 shows a combination of the cluster analyses from Figs. 1 and 3 for identification and selection of Yin genes and Yang genes of particular relevance in lung cancer. The probe sets are shown in the vertical dendrogram and DNA sample data are shown in the horizontal dendrogram. The expression indices of all 12,625 probe sets of the 100 sample data sets were summarized with the RMA algorithm and then further normalized by itemwise Z-normalization. 74 up-regulated genes (bottom half rows) and 108 (top half rows) down-regulated genes in cancer tissues were selected from the 2D clustering regions. The preselected 74 and 108 probe sets were displayed by re-clustering;

Fig. 9 is 3-D depiction of an IPA analysis of Yin gene probe sets. The Molecular Mechanisms of cancer canonical pathway are highlighted by green lines;

Fig. 10 is a schematic illustration of the selection of Yin genes (bottom portion) and Yang genes (top portion) using functional analysis. The genes highlighted by the same color are in the same interaction network;

Fig. 11 is 3-D depiction of an IPA analysis of Yang gene probe sets. The RAR Activation pathway and the Hepatic Stellate Cell Activation pathway were highlighted by green lines; Fig. 12 is a chart showing boxplots of the distributions of the mean ratios (YMR) of the expressions of Yin genes (Y) relative to the Yang genes (y ) in normal lung samples and lung cancer samples. The YMR were derived from microarray gene expression data sets described in Table S7 (six normal lung data sets and seven different lung cancer type data sets;

Figs. 13A-13D are charts showing validation of YMR in four data sets using Kaplan- Meier estimates of the survivor function: Fig. 13 A is a chart showing a free-recurrence time function curve (low risk n= 60; high risk n= 65) of the adenocarcinomas patients from Bhattacharjee et al. (2001, Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad Sci. USA 98(24): 13790-13795); Fig. 13B is a chart showing overall survival time function curve of the adenocarcinomas patients (low risk n= 27; high risk n= 31) from Bild et al. (2006, Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(19):353- 357); Fig. 13C is a chart showing probability of patient survival prognoses when identified as "low risk" (n= 248) or "high risk" (n= 194) from the DCC project (data was downloaded from the NCI caArray database

https://array.nci. nih.gov/caarray/proiect/details.action?proiect.id=l 82);

Fig. 13D is a chart showing RNA-seq samples from "low risk" (n= 121) and "high risk" (n= 137) from data downloaded from the TCGA Data Portal (http://tcga- data.nci.nih.gov/tcga/tcgaDownload.isp). In all four charts, low YMR scores (in green) correspond to the highest predicted survival probability and while the high YMR scores (in red) correspond to the greatest predicted risk;

Figs. 14A-14D are charts showing random group gene expression ratios. 500 groups of 31 genes and 500 groups of 32 genes randomly picked up from 12,625 genes among 125 Adenocarcinomas of the Bhattacharjee data set: Fig. 14A shows a histogram of 500 p-values of random group ratios as continuous variable, Fig. 14B shows a histogram of 500 hazard ratios (HR) of random group ratios as continuous variable, Fig. 14C shows a histogram of 500 p-values of random group ratios as dichotomous (ratio >2.0) variable, and Fig. 14D shows a histogram of 500 hazard ratios (HR) of random group ratios as dichotomous (ratio >2.0) variable. The stratification of these 500 random ratios (>2.0) was tested by the Cox proportional hazard ratio model; Figs. 15A-15B are charts showing the effects of dropping Yin genes on continuous and dichotomous YMR and gYMR signatures developed from 442 samples from the DCC data set. In Fig. 15 A, "orig" is the original 31 Yin gene, dropping one gene a time, dropping two genes ("24_10", i.e. HIST1H4J, 214463_x; CDC25A, 204696_s), as well as dropping three genes (24-10-7, i.e. HIST1H4J; CDC25A; and IGFBP5, 203425_s). These three genes were chosen because they showed best performance in gYMR after they were dropped. Fig. 15B shows the effects on HR using the same genes as in 15 A;

Figs. 16A-16B are charts showing the effects of dropping Yin genes on continuous and dichotomous YMR and gYMR signatures developed from 442 samples from the DCC data set. In Fig. 16A, "orig" is the original 31 Yin gene, dropping one gene a time, dropping two genes ("24_10", i.e. HIST1H4J, 214463_x; CDC25A, 204696_s), as well as dropping three genes (24-10-7, i.e. HIST1H4J; CDC25A; and IGFBP5, 203425_s). These three genes were chosen because they showed best performance in gYMR after they were dropped. Fig. 16B shows the effects on HR using the same genes as in 16 A; Figs. 17A-17B are charts showing the effects on HR of dropping Yang genes on continuous and dichotomous YMR and gYMR signatures developed from 442 samples from the DCC data set. Fig. 17A shows the effects on YMR, "orig" is the original 31 Yin gene, dropping one gene a time, dropping two genes ("24_10", i.e. HIST1H4J, 214463_x; CDC25A, 204696_s), as well as dropping three genes (24-10-7, i.e. HIST1H4J; CDC25A; and IGFBP5, 203425_s). Fig. 17B shows the effects on gYMR using the same genes as in 17 A;

Figs. 18A-18D are charts showing comparisons of YMR to the 15-gene signature wherein Fig. 18A shows a comparison of a 15-gene signature (Zhu et al, 2010, Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung cancer. J. Clin Oncol. 28(29):4417-4424) for the DCC sample data (low=231; high=211), 18B shows the 15-gene signature for Bild data (low=35; high=23), Fig. 18C shows YMR for the same DCC sample data (low=248; high=195), and 18D shows YMR for the same Bild data (low=27; high=31);

Figs. 19A-19F are charts showing Kaplan-Meier estimates of the survivor function of the gYMR signature in different groups of patients from the DCC data set: Fig. 19A is a chart showing YMR signatures for "low-risk" patients (n= 122) and "high risk" patients (n= 177) at Stage I only; Fig. 19B is a chart showing YMR signatures for "low-risk" patients (n= 13) and "high risk" patients (n= 28) after receiving chemotherapy at Stage I; Fig. 19C is a chart showing YMR signatures for "low-risk" patients (n= 79) and "high risk" patients (n= 95) who did not receive chemotherapy during Stage I; Fig. 19D is a chart showing YMR signatures for "low-risk" patients (n= 63) and "high risk" patients (n= 78) assessed during Stage II and Stage III; 6E is a chart showing YMR signatures for "low-risk" patients (n= 24) and "high risk" patients (n= 23) who received chemotherapy during Stage II and Stage III; and 6F is a chart showing YMR signatures for "low-risk" patients (n= 27) and "high risk" patients (n= 31) who did not receive chemotherapy during Stage I or Stage III. Low gYMR scores (in green) correspond to the highest predicted survival probability and high gYMR scores (in red) correspond to the greatest predicted risk;

Figs. 20A-20C are chart showing Kaplan-Meier estimates of the survivor function of patients with or without chemotherapy after diagnosis wherein Fig. 20A shows all Stage I patient samples from the DCC project, Fig. 20B shows low YMR Stage II and Stage III patients, and Fig. 20C shows high YMR Stage II and Stage III patients;

Figs. 21A-21C show the optimization of YMR signature sizes using a multiple permutation process (MPP) wherein Fig. 21 A shows the occurrences of signature tests that have p-values < 0.05, Fig. 21 B shows the YMR 75 th percentile p-value of 1000 random data sets that are less than 0.5, and Fig. 21 C shows the YMR signatures' 85 th percentile p-value of 1000 random data sets that are less than 0.5. The Y-axis represents the proportion of p-values less than 0.05, while the X-axis represents that signature size ranges from the smallest (2-2) to the largets (231-32). The maximum difference of the Yin gene and the Yang gene numbers in each signature is 2;

Fig. 22 shows the optimization of YMR signature genes using MPP. Bars denoted by a "*" underneath are Yang genes while the remaining bars are the Yin genes;

Figs. 23(A), 23(B) show the Kaplan-Meier survival curves for the Bhattacharjee data (23(A)) and the DCC data (23(B));

Figs. 24(A), 24(B) show the Kaplan-Meier survival curves for the TCGA RNAseq data (24(A)) and the MTAB_923 data (24(B)); Figs. 25(A), 25(B) show the Kaplan-Meier survival curves for the GSE42127 data (25(A)) and the GSE41271 data (25(B));

Figs. 26(A), 26(B) show the Kaplan-Meier survival curves for the GSE31210 data (26(A)) and the GSE14814 data (26(B)); Figs. 27(A), 27(B) show the Kaplan-Meier survival curves for the GSE13213 data

(27(A)) and the GSE11969 data (27(B));

Figs. 28(A), 28(B) show the Kaplan-Meier survival curves for the total combined data of 1,664 patient samples (28(A)) and data for all Stage I patients (28(B));

Figs. 29(A)-29(D) show the Kaplan-Meier survival curves for different groups of patients from the 1,664 patient samples, wherein Fig. 29(A) shows the patient cohorts who underwent treatment after diagnosis, Fig. 29(B) show the patient cohorts who did not undergo treatment after diagnosis, Fig. 29(C) shows all Stage II patients, and Fig. 29(D) shows all Stage III/IV patients; and

Fig. 30 is a schematic illustration of the network connection of ten selected Yin genes and Yang genes. The Ying genes are colored red and Yang genes are colored green. The Yin and Yang genes were connected by two canonical pathway components that are colored in yellow.

DETAILED DESCRIPTION

The exemplary embodiments of the present disclosure pertain to systems, methods and tools for identifying suitable sets of genes that express in a "Yin" and ' ang" manner in lung cancer cells. Specifically, the expression values of selected Yin genes and Yang genes are extracted from a patient's microarray expression data and then calculated as a mean ratio of the Yin (Y) gene expression and the Yang (y) gene expression. This mean ratio is referred to herein as the ' MR signature". It is well known that variations in gene expression determine phenotype changes and these expression variations can be caused by factors such as the DNA mutations or epigenetics. Gene expression can be used as a surrogate measurement of cancer disease phenotype. Expression variation can be correlated to disease aggressiveness, which can be used to determine patient prognosis. The utility of gene expression-based methods to guide cancer therapy has already been used for breast cancer, where the ONCOTYPE DX ® test (http://www.oncotypedx.com/; ONCOTYPE DX is a registered trademark of Genomic Health Inc., Redwood City, CA, USA) helps define patients most likely to benefit from adjuvant chemotherapy. Many studies used the gene expression signature for lung cancer prognosis prediction. Generally, there are four steps in a prognostic signature development:

(i) find the genes that can be used for the signature development; (ii) use the genes to build a signature model; (iii) validate the signature with independent data sets; (iv) clinical relevance.

Most previous work on development of lung cancer prognoses used Cox regression analysis to select those genes whose expressions correlate to survival time. For example, (i) Lu et al, 2006, A gene expression signature predicts survival of patients with stage I non- small cell lung cancer. PLoS Med. 3(12):3e467; (ii) Lu et al, 2012, Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients. PLoS ONE 7(1): e30880, (iii) Shedden et al, 2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature Medicine 14(8): 822-827, (iv) Chen et al, 2007, A five-gene signature and clinical outcome in non- small-cell lung cancer . N. Engl. J. Med. 356(1): 11-20.

This feature selection approach ranks the performance of individual genes and selects the top-ranked features. It does not consider the relationship between genes, nor redundancy of information. Other studies preclustered or grouped genes and then selected the clusters or "metagenes" by Cox regression. For example, (i) Raponi et al, 2006, Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res 66(15):7466-7472, (ii) Shedden et al, 2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature Medicine 14(8): 822-827, (iii) Zhu et al, 2010, Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung cancer. J Clinical Oncology 28(29):4417- 4424.

Two-group supervised statistical analysis has also been used to select genes that were differentially expressed between high-risk and low-risk patients. For example, (i) Gordon et al, 2002, Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62:4963-7, and

(ii) Wan et al., 2010, Hybrid models identified a 12-gene signature for lung cancer prognosis and chemoresponse prediction. PLoS One 5(8):el2222. A weighted gene co-expression network analysis (WGCNA) approach has been used to identify survival-related expression modules and gene signatures in lung adenocarcinoma. The genes enriched in the biological process of the cell cycle were selected (Li et al, 2013, Network-based approach identified cell cycle genes as predictor of overall survival in lung adenocarcinoma patients. Lung Cancer, doi: 10.1016/j).

If the same idea (gene association with survival time) was used in gene selection then the selected signature gene lists would be similar across different studies. However, published signature genes showed little overlap. For example, among 327 genes reported from 8 studies, only 5 genes overlapped in two of five data sets (Fig. 1 ; Roepman et al., 2009, An immune response enriched 72-gene prognostic profile for early-stage non-small-cell lung cancer. Clin Cancer Res 15(l):284-290).

One reason is the biological intratumor heterogeneity, likely caused by the diversity of tumor microenvironments and cell populations. Ein-Dor et al. (2006, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA 103:5923-5928) demonstrated that biological heterogeneity leads to thousands of samples being required to identify robust and reproducible subsets for most tumor types. The other reason is that these gene selections were influenced by variations in sample collection, sample size, data processing, and microarray platform. Kratz et al (2012, A practical molecular assay to predict survival in resected non-squamous, non-small-cell lung cancer: development and international validation studies. Lancet. 3;379(9818):823-32) pooled 217 genes identified by 6 previously published microarray and PCR-based studies of prognosis in early stage lung cancer, even though they are not all overlapped in these studies, and selected 11 genes by feature selection approach. Although most studies focus on the prediction performance of molecular signatures, the biological relevance of genes in the signature is important. For example, the nuclear receptors can define prognosis markers for lung cancer that are probably of functional relevance and might be potential therapeutic targets. All 11 genes selected by Kratz et al (2012) are intricately related to molecular lung cancer pathways. The signature genes selected by Lu Y et al (2012) are related to cell adhesion, apoptosis and cell proliferation. Some previous signatures implicated immune-response genes. Evidence suggests that a subpopulation of cells within the tumor (so-called cancer stem cells) plays an important part in metastasis and treatment resistance and that lung cancer, with increased numbers of such cells, have a worse prognosis.

It should be understood that a predictive signature model is not just a gene list. It is a mathematic formula with coefficients obtained from training data. This model produces a risk score for each patient and the patients will be stratified into high or low risk groups. Most models combined the Cox proportional coefficient of each signature gene multiplied with the gene expression value as the patient risk scores. Some models computed the probability of a patient falling into the low-risk or high-risk class as the patient risk scores. In building the signature model, the gene expression values and the patients' survival time of the training data set were used to calculate the coefficients and the same coefficients are supposed to use for risk score calculation of new patients' data. However, substantial gene expression variations exist within individual subjects. Some genes associated with other aggressive diseases may be present in a subject's tumor. Similarly, a subject might die from a secondary clinical condition. In these instances, a correlation between gene expression in cancer and subject survival does not exist and their data should not be included in the training set.

Sometimes a simpler method is superior to a complex model. For example, the ratio of two-gene expression within the same individual patients has been reported for biomarker signature development in lung cancer diagnosis and prognosis as well as for breast cancer prognosis. The single two-gene ratio or geometric mean of several two-gene ratios was selected between the treatment failures and the treatment responders from the training data samples.

The first clinical study of microarray data as a predictor of benefit from chemotherapy in NSCLC was conducted in the JBR.10 study (Winton et al, 2005, Vinorelbine plus Cisplatin vs. Observation in resected nonsmall-cell lung cancer. N. Engl. J. Med. 352:2589- 97). 482 patients with completely resected stages IB and II NSCLC were randomly assigned to receive four cycles of adjuvant cisplatin plus vinorelbine or observation alone. A 15-gene signature stratified high-risk group, treatment with vinorelbine plus cisplatin conferred significant survival benefit compared with observation alone (p=0.0005), whereas in the low- risk group, patients who received the same treatment had shorter survival compared with observation alone (p=0.0133). Staunton et al. (2001, Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. USA 98: 10787-92) used DNA microarrays to measure gene expression in the NCI-60 panel and compared untreated gene expression profile of each cell line. They were able to predict drug sensitivity in an independent test set of cell lines. Potti et al. (2006, Genomic signatures to guide the use of chemotherapeutics . Nat Med.12: 1294-300) used molecular profiles from cell lines to establish sensitivity to chemotherapy. The usefulness of this approach is that one tumor sample can be interrogated for response to many agents on the basis of cell-line derived signatures. Using a panel of 17 NSCLC cell lines, Beane et al. found a significant association between docetaxel resistance and PI3-kinase inhibitor (LY-294002) sensitivity, suggesting its use as a second-line therapy (2009, Clinical impact of highthroughput gene expression studies in lung cancer. J. Thorac. Oncol. 4: 109-18). It is well known that there are significant problems with the validation and reproducibility of the above-noted proposed methods and tools for developing prognoses for lung cancer patients. Many studies tested and validated prognostic signatures using small numbers ofcohorts with methods that are not applicable to clinical situations. For example, Shedden et al. (2008) were unsuccessful in their attempts to validate the signatures reported by Chen et al. (2007). Subramanian et al. (2010, Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use? J. Nat. Cane. Inst. 102(7): 464-474) assessed the prognostic signatures reported by Chen et al. (2007) and Lau et al. (2007, Three- gene prognostic classifier for early-stage non small-cell lung cancer. J. Clin. Oncol. 25(35): 5562-5569) for Stage IA and Stage IB samples using the data of Shedden et al. (2008). In each case, the signatures did not demonstrate statistically significant differences in outcome among the predicted risk groups. Hou et al. (2010, Gene Expression-Based Classification of Non-Small Cell Lung Carcinomas and Survival Prediction. PLoS ONE 5(4):el0312) validated previously reported 14 signatures using two independent data sets; none of the signatures exhibits significant risk group stratifications for either data set. Recently, Lu et al. (2012) built a 51-gene signature using the coefficients of a training data set. When this signature was validated in other 4 independent data sets, the gene coefficients were retrained for each data set, and therefore, they were not actually independent data validations.

In a clinical setting, the ideal prediction model in a clinical setting, should be applicable to any single patient by providing an informative risk score for that patient. A limitation of all previous prediction models is that the signature gene expression values of new samples have to be comparable to those of the training sample data in terms of data preprocess, analysis platform, and data normalization. For example, Shedden et al. (2008) normalized the entire training and testing data sets together. This is not practical for clinical use. Although statistical methods can deal with a large number of genes signatures, the number of genes determines the feasibility and cost of assay development, and use in clinical practice. The intent is to reduce the number of genes in a signature while achieving similar prediction performance is crucial in the development of a practical assay. More practical would be the use of a small number of genes by qRT-PCR, even though these qRT-PCR data also need to be normalized before the same models can be applied.

The exemplary embodiments of the present disclosure pertain to use of the Yin Yang paradigm for development of cancer biomarker signatures. The Yin Yang imbalance indicates illness and the relationship between Yin and Yang forms the general basis for all diagnoses and treatment protocols in Chinese medicine. The core of Yin and Yang theory is the "global" effects of the perturbation. An important aspect of the present disclosure is the realization that a group of genes predominantly functions in normal lung cells and terming these genes as the ' ang Genes", whereas a group of genes predominantly function in lung cancer cells and are termed herein as the "Yin genes". The ratio of expression levels of these two groups of genes would indicate the status of the cells as functioning normally or alternatively, functioning in a cancer-mode modality. The lung cancer prognosis is a multi-dimensional complex process. The essence of the Yin Yang theory is that it simplifies the multi-dimensional complications into simple two opposing dimensions. The extent of the two-dimensional (Yin and Yang) imbalance would indicate the severity or progression of the cancer thus predict the patients' survival time.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. In order that the invention herein described may be fully understood, the following terms and definitions are provided herein.

The word "comprise" or variations such as "comprises" or "comprising" will be understood to imply the inclusion of a stated integer or groups of integers but not the exclusion of any other integer or group of integers.

The term "Yin gene" as used herein refers to a group of genes whose expressions and functions dominate in cancerous lung cells.

The term "Yang gene" as used herein refers to whose expressions and functions dominate in normal lung cells.

The term "YMR" as used herein refers to a mean ratio of the concurrent expression of Yin genes and Yang genes in a patient's lung cells.

The term "signature" as used herein means a group of genes whose expression status distinguishes high risk from low risk patients.

The term "abrogate" as used herein means to suppress and/or interfere with and/or prevent and/or eliminate.

The term "effective amount" as used herein means an amount effective, at dosages and for periods of time necessary to achieve the desired results (e.g. the modulation of collagen synthesis). Effective amounts of a molecule may vary according to factors such as the disease state, age, sex, weight of the animal. Dosage regimes may be adjusted to provide the optimum therapeutic response. For example, several divided doses may be administered daily or the dose may be proportionally reduced as indicated by the exigencies of the therapeutic situation. The term "subject" as used herein includes all members of the animal kingdom, and specifically includes humans.

The term "a cell" includes a single cell as well as a plurality or population of cells. Administering an agent to a cell includes both in vitro and in vivo administrations.

The term "about" or "approximately" means within 20%, preferably within 10%, and more preferably within 5% of a given value or range.

The term "nucleic acid" refers to a polymeric compound comprised of covalently linked subunits called nucleotides. Nucleic acid includes polyribonucleic acid (RNA) and poly deoxyribonucleic acid (DNA), both of which may be single-stranded or double-stranded. DNA includes cDNA, genomic DNA, synthetic DNA, and semisynthetic DNA. The term "gene" refers to an assembly of nucleotides that encode a polypeptide, and includes cDNA and genomic DNA nucleic acids.

The term "recombinant DNA molecule" refers to a DNA molecule that has undergone a molecular biological manipulation. The term "vector" refers to any means for the transfer of a nucleic acid into a host cell. A vector may be a replicon to which another DNA segment may be attached so as to bring about the replication of the attached segment. A "replicon" is any genetic element (e.g., plasmid, phage, cosmid, chromosome, virus) that functions as an autonomous unit of DNA replication in vivo, i.e., capable of replication under its own control. The term "vector" includes plasmids, liposomes, electrically charged lipids (cytofectins), DNA-protein complexes, and biopolymers. In addition to a nucleic acid, a vector may also contain one or more regulatory regions, and/or selectable markers useful in selecting, measuring, and monitoring nucleic acid transfer results (transfer to which tissues, duration of expression, etc.).

The term "cloning vector" refers to a replicon, such as plasmid, phage or cosmid, to which another DNA segment may be attached so as to bring about the replication of the attached segment. Cloning vectors may be capable of replication in one cell type, and expression in another ("shuttle vector"). A cell has been "transfected" by exogenous or heterologous DNA when such DNA has been introduced inside the cell. A cell has been "transformed" by exogenous or heterologous DNA when the transfected DNA effects a phenotypic change. The transforming DNA can be integrated (covalently linked) into chromosomal DNA making up the genome of the cell. The term "nucleic acid molecule" refers to the phosphate ester polymeric form of ribonucleosides (adenosine, guanosine, uridine or cytidine; "RNA molecules") or deoxyribonucleosides (deoxyadenosine, deoxyguanosine, deoxythymidine, or deoxycytidine; "DNA molecules"), or any phosphoester anologs thereof, such as phosphorothioates and thioesters, in either single stranded form, or a double-stranded helix. Double stranded DNA- DNA, DNA-RNA and RNA-RNA helices are possible. The term nucleic acid molecule, and in particular DNA or RNA molecule, refers only to the primary and secondary structure of the molecule, and does not limit it to any particular tertiary forms.

Modification of a genetic and/or chemical nature is understood to mean any mutation, substitution, deletion, addition and/or modification of one or more residues. Such derivatives may be generated for various purposes, such as in particular that of enhancing its production levels, that of increasing and/or modifying its activity, or that of conferring new pharmacokinetic and/or biological properties on it. Among the derivatives resulting from an addition, there may be mentioned, for example, the chimeric nucleic acid sequences comprising an additional heterologous part linked to one end, for example of the hybrid construct type consisting of a cDNA with which one or more introns would be associated. Likewise, for the purposes of the invention, the claimed nucleic acids may comprise promoter, activating or regulatory sequences, and the like.

The term "promoter sequence" refers to a DNA regulatory region capable of binding RNA polymerase in a cell and initiating transcription of a downstream (3' direction) coding sequence. For purposes of defining the present invention, the promoter sequence is bounded at its 3' terminus by the transcription initiation site and extends upstream (5' direction) to include the minimum number of bases or elements necessary to initiate transcription at levels detectable above background.

A coding sequence is "under the control" of transcriptional and translational control sequences in a cell when RNA polymerase transcribes the coding sequence into mRNA, which is then trans-RNA spliced (if the coding sequence contains introns) and translated into the protein encoded by the coding sequence.

The term "homologous" in all its grammatical forms and spelling variations refers to the relationship between proteins that possess a "common evolutionary origin," including homologous proteins from different species. Such proteins (and their encoding genes) have sequence homology, as reflected by their high degree of sequence similarity. This homology is greater than about 75%, greater than about 80%, greater than about 85%. In some cases the homology will be greater than about 90% to 95% or 98%.

"Amino acid sequence homology" is understood to include both amino acid sequence identity and similarity. Homologous sequences share identical and/or similar amino acid residues, where similar residues are conservative substitutions for, or "allowed point mutations" of, corresponding amino acid residues in an aligned reference sequence. Thus, a candidate polypeptide sequence that shares 70% amino acid homology with a reference sequence is one in which any 70% of the aligned residues are either identical to, or are conservative substitutions of, the corresponding residues in a reference sequence. The term "polypeptide" refers to a polymeric compound comprised of covalently linked amino acid residues. Amino acids are classified into seven groups on the basis of the side chain R: (1) aliphatic side chains, (2) side chains containing a hydroxylic (OH) group, (3) side chains containing sulfur atoms, (4) side chains containing an acidic or amide group, (5) side chains containing a basic group, (6) side chains containing an aromatic ring, and (7) proline, an imino acid in which the side chain is fused to the amino group. A polypeptide of the invention preferably comprises at least about 14 amino acids.

The term "protein" refers to a polypeptide which plays a structural or functional role in a living cell.

The term "corresponding to" is used herein to refer to similar or homologous sequences, whether the exact position is identical or different from the molecule to which the similarity or homology is measured. A nucleic acid or amino acid sequence alignment may include spaces. Thus, the term "corresponding to" refers to the sequence similarity, and not the numbering of the amino acid residues or nucleotide bases.

The term "derivative" refers to a product comprising, for example, modifications at the level of the primary structure, such as deletions of one or more residues, substitutions of one or more residues, and/or modifications at the level of one or more residues. The number of residues affected by the modifications may be, for example, from 1, 2 or 3 to 10, 20, or 30 residues. The term derivative also comprises the molecules comprising additional internal or terminal parts, of a peptide nature or otherwise. They may be in particular active parts, markers, amino acids, such as methionine at position -1. The term derivative also comprises the molecules comprising modifications at the level of the tertiary structure (N-terminal end, and the like). The term derivative also comprises sequences homologous to the sequence considered, derived from other cellular sources, and in particular from cells of human origin, or from other organisms, and possessing activity of the same type or of substantially similar type. Such homologous sequences may be obtained by hybridization experiments. The hybridizations may be performed based on nucleic acid libraries, using, as probe, the native sequence or a fragment thereof, under conventional stringency conditions or preferably under high stringency conditions.

The steps that may be used to develop a YMR signature model for developing a prognosis for early-stage lung cancer patents is illustrated in Fig. 1. The first step is to identify a set of suitable Yin genes and Yang genes isolated from lung cancer tissue samples by comparing the total gene compliments extracted from lung cancer patients' tissue samples with a database comprising gene sets isolated from normal lung samples from healthy subjects, and gene sets isolated from lung cancer samples collected from patients of mixed tumour stages with different survival times. A suitable exemplary data set is shown in Table 1. The second step comprises identification of the Yin gene candidates and Yang gene candidates, for example up to about 30 or 40 or 50 each of candidate Yin genes and candidate Yang genes, and then calculating mean ratio of the concurrent expression of Yin genes and Yang genes in cancerous lung tissue samples, i.e., the "YMR signature". The third step comprises validation of the YMR signature by comparisons to YMR signatures calculated for other lung cancer patients using other reported models. The last step is translation of the YMR model and signature into clinical assays.

Fig. 2 sets out in more detail the methods used for the first three steps for developing the YMR signature model, wherein (i) the expression values of selected Yin genes and Yang genes for each patient sample are extracted from the expression data sets and then correlated with clinical overall survival (OS) or relapse-free (RF) information for each sample, (ii) the geometric means of all Yin gene expression and geometric mean of all Yang gene expression are computed, and then the ratio of these geometric means (YMR) for each patient sample are calculated, (iii) the Cox regression of the continuous YMR values is then evaluated, and (iv) the YMR signature value is used a risk score to stratify patients into low-risk and high-risk groups. Algorithms may be used to model the balance of Yin genes and Yang genes for prediction of a patient's survival or recurrence-free time. The expression values of the selected Yin genes and Yang genes may be extracted from microarray expression data or RNA-Seq data, after which the geometry means of Yin gene and Yang gene expression (YMR) may be computed for each patient. The Cox regression of the continuous covariate YMR is preferably evaluated by univarite analysis. The YMR score as a continuous covarite will test if each unit increases in YMR results in proportional scaling of the hazard rate. This YMR signature model does not integrate any coefficients between gene expression and patients' survival time. A median YMR cutoff score or alternatively, graphical diagnostic plots may be used to find optimal YMR signature cutoff scores for stratifying patients into low-risk or high-risk groups.

An additional optional fourth step may be incorporated in an exemplary method used to develop a YMR signature model as illustrated in Fig. 1, wherein the YMR signature is optimized around an aggregate of about 30 or less Yin genes and Yang genes that produce the highest performance in the data sets assessed in steps 1 and 2, generally following the flowchart outlined in Fig. 3. The first step is to extract the previously defined Yin genes and Yang genes from expression data sets. The second step is to pick up a set of 1 million combinations of random size of Yin genes and random size of Yang gene, and then to test each combination of Yin gene and Yang gene against the set of 1 million randomly permutated expression data sets with clinical information, wherein each data set comprises 100 to 200 samples. Only those gene lists that have Cox regression rank p-value less than 0.05 are retained for further assessment. Then, a determination is made as to the Yin and Yang gene size produces the highest number of p-values less than 0.05, after which, the set of 1 million of the defined size of Yin gene and Yang gene list is purmutated and then each of these the combinations in this list is tested against the set of 1 million of randomly permutated expression data sets with clinical information. The genes are selected based on their occurrences in these gene lists. The selected "best" genes are further assessed against the set of 1 million of permutated data sets. Finally, the lowest p-value with hazard ratio (HR) greater 2 is determined and used to define the YMR signature. An optional fifth step (Fig. 1) comprising comparing the defined YMR signature with previously reported lung cancer signatures.

The following examples are provided to enable a better understanding of the disclosure described herein.

EXAMPLES

Example 1;

1.1 Lung cancer patient sample data

This work was focused on adenocarcinoma as it is a more common lung cancer and the gene expression data with associated clinical information is more readily available. The sample data from Bhattacharjee et al. (2001, Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad Sci. USA 98(24): 13790-13795) has been described previously. It consists of 203 lung cancer patient samples including 139 adenocarcinomas, 20 pulmonary carcinoids, 21 squamous cell carcinomas, 6 small-cell lung cancers, and 17 normal lung tissue samples from adjacent sections. Among the 139 adenocarcinomas, 125 patient samples were associated with clinical follow up information of survival time and recurrence. The sample data from Bild et al. (2006, Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(19): 353-357) contains 58 primary adenocarcinomas collected through the Duke Lung Cancer Prognostic Laboratory. These samples were collected during follow up with patients for durations of 1 to 6 years after their cancer diagnoses. The National Cancer Institute Director's Challenge Consortium (DCC) for the Molecular Classification of Lung Adenocarcinoma samples consists of 442 adenocarcinomas with patients' clinical information accessible from:

https://arrav.nci. nih.gov/caarrav/proiect/details. action?proiect.id=l 82. These samples were collected and processed in 4 independent Institutions: (i)

Canada/Dana-Farber Cancer Institute (CAN/DF), (ii) University of Michigan Cancer Center (UM), (iii) H.L. Moffitt Cancer Center (HLM), and (iv) Memorial Sloan-Kettering Cancer Center (MSK). Stages I, II and III adenocarcinomas were collected, with approximately 60% of samples from Stage I tumors. None of the patients received preoperative chemotherapy or radiation and at least 2 years of follow-up information was available. 288 lung adenocarcinoma (LUAD) samples from The Cancer Genome Atlas (TCGA) Project have comprehensive clinical information accessible from: http://tcga- data.nci.nih.gov/tcga/tcgaDownload.isp. Excluded from the dataset were 20 patients whose survival times are not available, and 9 living patients with follow up periods of less than 2 days.

1.2 Gene expression data

The gene expression of the Bhattacharjee samples was detected by AFFYMETRIX ® HU_U95Av2 GENECHIP ® (AFFYMETRIX and GENECHIP are registered trademarks of Affymetrix Inc., Santa Clara, CA, USA). The raw hybridization intensity data files (CEL) were downloaded from http ://www. broadinstitute. org/ mpr/lung/. The gene expression indexes were processed with the MAS5.0 algorithm by using the EXPRESSIONIST ® Refiner module (GeneData Inc, San Francisco, CA, USA; EXPRESSIONIST is a registered trademark of GeneData AG, Basil, Switzerland). No further normalization was done within each data set in order to keep the individual sample independent in gene biomarker detection. Except in clustering analysis for differentially expressed gene identifications, the Robust Multi-array Average (RMA) derived and normalized expression measurements were calculated from the raw CEL files. The gene expression of the Bild samples was detected by AFFYMETRIX* HU_U133plus2 GENECHIP* and the signal intensity was calculated by MAS5.0 algorithm. The data set was downloaded from NCBI GEO database (GSE3141). The DCC raw HG U133A CEL files were downloaded from NCI caArray database (https://array.nci. nih.gov/caarray/proiect/details.action?proiect.id=l 82). MAS5.0 algorithm was used for gene expression summarization. No normalization or prefiltering was applied to samples or genes. The 259 RNA-seq data were downloaded from TCGA Data Portal (http://tcga-data.nci.nih.gov/tcga/tcgaDownload.isp). The gene expression RKPM (reads per kilobase per million mapped reads) values were extracted from sample files.

1.3 Identification and selection of signature genes The expression indexes were summarized by RMA ® algorithm (RMA is a registered trademark of the Risk Management Association, Philadelphia, PA, USA) and further normalized by itemwise Z-normalization using GENEDATA ® Analyst module (GeneData, Inc, San Francisco, CA, USA; GENEDATA is a registered trademark of GeneData AG, Basil, Switzerland)). 2-D hierarchical Euclidean L2 distance clustering with complete linkage setting for both genes and samples was performed to explore the differentially expressed biomarker genes in lung tumors. Un-regulated and down-regulated genes in cancer tissues were selected from 2D clustering. Genes that were expressed higher in normal lung tissues than in lung cancer cells were called "Yang" gene candidates, conversely genes expressed higher in lung cancer cells than normal lung tissues were called "Yin" gene candidates. These two gene lists were inputted into IPA9.0 (Ingenuity Systems Inc., Redwood City, CA, USA) for interaction network and pathway analysis. The networks are built by direct interactions. The networks with significant scores were selected for further analysis.

1.4 Gene signature classifier development

The expression values of the selected Yin genes and Yang genes were extracted from published microarray expression data. Initially, the Yin (Y) and Yang (y) expression mean ratios (YMR) were calculated as a signature classifier for each sample. In HG-U95A platform of Bhattacharjee data set, the 31 and 32 probe sets were used to extract the Yin gene expression and Yang gene expression values respectively. In extracting Yin and Yang genes from different platforms, the best match probe sets were downloaded from Affymetrix (http : //www. affy metrix. com) and the gene symbols to match the gene IDs. For non- Affymetrix platforms, gene symbols were used for gene IDs. For multiple IDs within the same gene symbol, an average value was used. In HG-133plus2 of Bild data set, 62 genes were computed to determine average expression values from multiple probe sets, since only one best matched probe set to HG-U95A 3965 l at (RECQL4 gene). In HG-133A platform of DCC data set, 22 Yin genes' expression was derived from 22 best matched probe sets, 3 genes matched single probe sets and 6 genes' expression was averaging expression of multiple probe sets; 29 Yang genes' expression was from best matched probe sets, and 2 genes from multiple probe sets. The patient risk scores were derived from the YMR values. Patients were divided into high-risk and low-risk prognostic groups using YMR cutoff values. Since a 2-fold difference is often chosen as an arbitrary value in a two-group comparison, a 2-fold difference in the Yin value over the Yang value was defined as a cutoff and then was adjusted by either: (i) a normal sample mean YMR, or (ii) a cancer sample mean YMR. If the normal lung sample YMR is significantly less than 1.0 (for example, the TCGA RNAseq data), the YMR cutoff will be adjusted to be lower than 2.0. If normal sample mean YMR is not available for a particular data set (for example, DCC and Bild data sets), a cutoff value was selected that is close to the mean YMR of the lung cancer data set since many studies use the mean risk score to stratify patients. This arbitrary YMR cutoff value is only used for the YMR signature validation. In future, a universal YMR cutoff value may be selected for results from a clinically relevant platform such as qPCR.

The YMR was compared to a geometric Mean of Yin and Yang Ratio (gYMR). The effects of dropping genes from the 31 Yin and 32 Yang gene list on associations with clinical outcome were assessed to determine optimal sizes of gene sets. The significance of the YMR signature was assessed by comparing the YMR to the any ratio of randomly picked up groups of identical group size.

1.5 Statistical analysis To evaluate the performance of the YMR signature, each YMR was assessed as a dichotomous covariate or a continuous covariate in a Cox proportional hazards model, with 5 to 6 years overall survival or without recurrence as the outcome variable. The estimated hazard ratio, 95% confidence interval and p-value allowed us to directly compare the performances of YMR covariate with other clinical variables. Kaplan-Meier product-limit methods and log-rank tests were used to estimate and test differences in probability of survival between low- and high-risk patient groups. The survivor function was plotted for each subgroup. All statistical analyses were performed using PARTEK ® software, version 6.3 (PARTEK is a registered trademark of Partek Inc., St. Louis, MO, USA) or R statistic package Survcomp.

1.6 Validation

In order to validate that the YMR is less than 1.0 in normal lung tissues and greater than 1.0 in lung cancer tissue samples, the YMR were measured in new independent data sets. These data sets were processed by different platforms including Affymetrix GENECHIP ® HG-U95, HG-133A, HG-133plus2, ILLUMINA ® beadChip (ILLUMINA is a registered trademark of Continental Resources Inc., Bedford, MA, USA), and two-channel arrays. The YMRs were calculated from these data sets either with or without data normalization based on the original data status.

To validate the YMR signature for lung cancer prognosis, four independent data sets were used: (i) 125 Bhattacharjee adenocarcinomas sample data set of HG_U95Av2 platform, (ii) 58 Bild adenocarcinomas sample data of HG-133Plus2 platform, (iii) 442 DCC sample files of HG-133A platform, and (iv) 259 TCGA samples of RNA-seq platform. These are well-defined patient samples with clinical information. For analyses in this study, survival or without recurrence outcomes were compared according to high-risk YMR (i.e. YMR is greater than 2.0 or an adjusted cutoff) and low-risk YMR (YMR is less than or equal to 2.0 or an adjusted cutoff) patients. The YMR score stratification in the same stages and in response to treatment was tested in the following groups of the DCC patients, respectively: Stage I,; Stage II, Stage III; received chemotherapy; no chemotherapy; chemotherapy on stage I; chemotherapy on stage II & III; no chemotherapy on stage I; no chemotherapy on stage II & III.

RESULTS

1.7 Identification of candidate lung cancer biomarker genes Normal lung samples from healthy subjects were compared with the lung cancer samples collected from patients of mixed tumour stages with different survival times to identify and select genes groups for signature development. The differential gene expression in 17 normal lung tissue samples and 83 samples from a variety of lung cancer types were examined using unsupervised clustering analysis of microarray data from Bhattacharjee et al. For the 2D clustering, regions where the genes down-regulated in normal samples but up- regulated in almost all types of lung cancers were selected (Fig. 4). The region where genes were up-regulated in one or a few cancer types was not selected. 70 probe sets were identified in this region (Fig. 5, Table 1). We also identified a region where genes are upregulated in normal samples but downregulated in almost all types of lung cancers (Fig. 6). The region where genes were downregulated in one or few cancer types was not selected. 98 probe sets were identified in this region (Fig. 7, Table 2).

Table 1 :

36839 at CDC6 cell division cycle 6 homolog (S. cerevisiae) protein kinase, membrane associated

37238_s_at PKMYT1

tyrosine/threonine 1

37267 at TH0P1 thimet oligopeptidase 1

40794 at KLK3 kallikrein-related peptidase 3

40866 at NIPSNAP1 nipsnap homolog 1 (C. elegans)

41149 at LOC81691 exonuclease NEF-sp

processing of precursor 7, ribonuclease P/MRP

32213_at P0P7

subunit (S. cerevisiae)

33935 at CACYBP calcyclin binding protein

34341 at PPAT phosphoribosyl pyrophosphate amidotransferase platelet-activating factor acetylhydrolase lb,

35800_at PAFAH1B3

catalytic subunit 3 (29kDa)

38107 at U C119 unc-119 homolog (C. elegans)

CXorf40A

chromosome X open reading frame 40A ///

38373_g_at ///

chromosome X open reading frame 40B

CXorf40B

38401 s at LSM14A LSM14A, SCD6 homolog A (S. cerevisiae)

TAF6-like RNA polymerase II, p300/CBP-

39909_g_at TAF6L associated factor (PCAF)-associated factor,

65kDa

40263 at ZFPL1 zinc finger protein-like 1

40528 at LHX2 LIM homeobox 2

40532 at BIRC5 baculoviral IAP repeat-containing 5

40597 g at TC0F1 Treacher Collins-Franceschetti syndrome 1

40891 f at LAGE3 L antigen family, member 3

41851 at CCDC85B Coiled-coil domain containing 85B

LSM4 homolog, U6 small nuclear RNA

32559_s_at LSM4

associated (S. cerevisiae)

33203 s at F0XD1 forkhead box Dl

1738 at CDC25A cell division cycle 25 homolog A (S. pombe)

1678_g_at IGFBP5 insulin-like growth factor binding protein 5

1601 s at IGFBP5 insulin-like growth factor binding protein 5

neuroblastoma RAS viral (v-ras) oncogene

1539_at NRAS

homolog

1133 at EN2 engrailed homeobox 2

966 at RAD54L RAD54-like (S. cerevisiae)

macrophage migration inhibitory factor

895_at MIF

(glycosylati on-inhibiting factor)

762 f at HIST1H4I histone cluster 1, H4i

696 at HOXD8 homeobox D8

651 at RPA3 replication protein A3, 14kDa

protein kinase, membrane associated

480_at PKMYT1

tyrosine/threonine 1

DDT /// D-dopachrome tautomerase /// D-dopachrome

374_f_at

DDTL tautomerase-like

HIST2H4A

152_f_at /// histone cluster 2, H4a /// histone cluster 2, H4b

HIST2H4B Table 2:

SEQ ID Gene

Probe set Gene name

NO: symbol

DEFAl ///

defensin, alpha 1 /// defensin, alpha IB /// defensin,

71 31506_s_at DEFAIB ///

alpha 3, neutrophil-specific

DEFA3

72 33690 at ... ...

73 34174 s at LPHN2 latrophilin 2

solute carrier family 6 (neurotransmitter transporter,

74 34604_at SLC6A4

serotonin), member 4

75 35079 at CNTN6 contactin 6

76 35606 at HDC histidine decarboxylase

77 36377 at IL18R1 interleukin 18 receptor 1

78 32370 at GZMH granzyme H (cathepsin G-like 2, protein h-CCPX)

79 32904 at PRF1 perforin 1 (pore forming protein)

80 32971 at FAM189A2 family with sequence similarity 189, member A2

81 33462 at P2RY14 purinergic receptor P2Y, G-protein coupled, 14

82 34940 at ... ...

83 34950 at ZNF423 zinc finger protein 423

84 34965 at CST7 cystatin F (leukocystatin)

85 35964 at MATN3 matrilin 3

86 36258 at PRKGl protein kinase, cGMP-dependent, type I

sema domain, transmembrane domain (TM), and

87 36275_at SEMA6A

cytoplasmic domain, (semaphorin) 6A

88 37121 at NKG7 natural killer cell group 7 sequence

89 37154 at PCDH17 protocadherin 17

90 37841 at BCHE butyrylcholinesterase

pro-platelet basic protein (chemokine (C-X-C motif)

91 39208_i_at PPBP

ligand 7)

92 39279 at BMP6 bone morphogenetic protein 6

93 39325 at LEFTY2 left-right determination factor 2

94 39577 at S0STDC1 sclerostin domain containing 1

95 39634 at SLIT2 slit homolog 2 (Drosophila)

extracellular matrix protein 2, female organ and

96 39673_i_at ECM2

adipocyte specific

extracellular matrix protein 2, female organ and

97 39674_r_at ECM2

adipocyte specific

98 40034 r at SCARF 1 scavenger receptor class F, member 1

99 40322 at IL1RL1 interleukin 1 receptor-like 1

100 40374 at ANKRD1 ankyrin repeat domain 1 (cardiac muscle)

101 40398 s at MEOX2 mesenchyme homeobox 2

102 40665 at FM03 flavin containing monooxygenase 3

103 40739 at CA4 carbonic anhydrase IV

104 41013 at C10orf72 chromosome 10 open reading frame 72

105 41030 at FOXJ1 forkhead box Jl

106 41644 at SASH1 SAM and SH3 domain containing 1

107 31892 at PTPRM protein tyrosine phosphatase, receptor type, M

108 32740 at RAB11FIP2 RABl 1 family interacting protein 2 (class I) 33328 at HEG1 HEG homolog 1 (zebrafish)

33766 at VIPR1 vasoactive intestinal peptide receptor 1

34203 at CN 1 calponin 1, basic, smooth muscle

34267 r at LEPR leptin receptor

reversion-inducing-cysteine-rich protein with kazal

35234_at RECK

motifs

AKAP2 ///

A kinase (PRKA) anchor protein 2 /// PALM2-

35985_at PALM2- AKAP2 readthrough

AKAP2

sema domain, seven thrombospondin repeats (type 1

36061_at SEMA5A and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A

36915 at CTSO cathepsin 0

37194 at GATA2 GATA binding protein 2

37251 s at GPM6B glycoprotein M6B

phosphatidylinositol-4-phosphate 5 -kinase, type I,

37253_at PIP5K1B

beta

37536 at CD83 CD83 molecule

37958 at TMEM47 transmembrane protein 47

38315 at ALDH1A2 aldehyde dehydrogenase 1 family, member A2

cytochrome c oxidase subunit Vila polypeptide 1

3903 l_at C0X7A1

(muscle)

39048 at N0TCH4 Notch homolog 4 (Drosophila)

39085 at TN C1 troponin C type 1 (slow)

neural precursor cell expressed, developmentally

39356_at NEDD4L

down-regulated 4-like

39400 at TBC1D2B TBC1 domain family, member 2B

39750 at ... ...

40434 at PODXL podocalyxin-like

40480 s at FYN FYN oncogene r+elated to SRC, FGR, YES

40763 at MEIS1 Meis homeobox 1

41151 at INPP5K inositol polyphosphate-5-phosphatase K

32208 at KIAA0355 KIAA0355

32838 at MYH10 myosin, heavy chain 10, non-muscle

35344 at LM07 LIM domain 7

35828 at CRIP2 cysteine-rich protein 2

36577 at FERMT2 fermitin family homolog 2 (Drosophila)

36627 at SPARCL1 SPARC-like 1 (hevin)

36939 at GPM6A glycoprotein M6A

37407 s at MYH11 myosin, heavy chain 11, smooth muscle

37710 at MEF2C myocyte enhancer factor 2C

37718 at SNRK SNF related kinase

38734 at PLN phospholamban

38747 at CD34 CD34 molecule

adenosine deaminase, RNA-specific, Bl (RED1

38748_at AD ARB 1

homolog rat)

39452 s at SPTBN1 spectrin, beta, non-erythrocytic 1

39541 at HEG1 HEG homolog 1 (zebrafish)

39544 at SYNM synemin, intermediate filament protein 149 40231 at SMAD6 SMAD family member 6

ankyrin repeat and sterile alpha motif domain

150 4097 l_at ANKS1A

containing 1A

151 40994 at GRK5 G protein-coupled receptor kinase 5

152 41549 s at AP1S2 adaptor-related protein complex 1, sigma 2 subunit

153 41837 at C14orfl32 chromosome 14 open reading frame 132

154 32593 at RFTN 1 raftlin, lipid raft linker 1

155 1595 at TEK TEK tyrosine kinase, endothelial

156 1389 at MME membrane metallo-endopeptidase

157 1135 at GRK5 G protein-coupled receptor kinase 5

158 994 at PTPRM protein tyrosine phosphatase, receptor type, M

159 995 g at PTPRM protein tyrosine phosphatase, receptor type, M

v-ets erythroblastosis virus E26 oncogene homolog

160 914_g_at ERG

(avian)

161 873 at H0XA5 homeobox A5

162 770 at GPX3 glutathione peroxidase 3 (plasma)

163 758 at PTGIR prostaglandin 12 (prostacyclin) receptor (IP)

164 610 at ADRB2 adrenergic, beta-2-, receptor, surface

165 560 s at TALI T-cell acute lymphocytic leukemia 1

166 538 at CD34 CD34 molecule

167 340 at MATN3 matrilin 3

168 210 at PLCB2 phospholipase C, beta 2

By comparing various cell types of lung cancer gene expression together to the normal lung cells, common Yin genes and Yang genes among the different cancers could be identified (Fig. 8). Gene clustering, rather than group statistic test, not only detects the expression patterns, but also indicates some extent of the gene interactions within the same pattern. The gene expression partem by clustering would be more tolerant of variations derived from sample collection and data processing comparing to the differentially expressed genes based on the holistic two-group statistics. Some genes would not present in the differential gene list because their big variations in a few samples, but may show a similar overall expression pattern.

Yin genes and Yang genes showed little overlap with the previously reported lung cancer prognostic signature genes. However, many Yin genes reported here were found in previous studies that relate lung cancer or other tissue type cancer development such as GRIN2D, GAST, AMH, TCF3, EXOSC2, GRM1, CDT1, RecQL4, CSTF2, FCGR2B, RNASEH2A, CDC6, CACYBP, BIRC5, CDC25, NRAS, EN2, and MIF. Typical oncogenes were not found. Accordingly, it appears that the progression genes play more important roles than tumor initial genes in determining lung cancer prognoses.

Pathway and interaction network analyses of these 70 genes allowed selecting two main networks that are related to tumor morphology (Table 3, network significant score of 42) and DNA replication (Table 4, network significant score of 30).

Table 3:

Table 4:

These networks participate in the canonical Molecular Mechanisms of Cancer pathway (Figs. 9, 10). These networks contain 31 genes whose gene symbol names matched the Affymetrix U95 AV2 probe set identifiers. We selected these 31 genes as Yin gene Table 5: Yin genes

SEQ ID HG U95A

Gene Symbol Gene Title

NO: probe set

histone cluster 1, H4j /// histone cluster 1,

2 34027_f_at HIST1H4J /// HIST1H4K

H4k

glutamate receptor, ionotropic, N-methyl D-

5 3171 l_at GRIN2D

aspartate 2D

9 34552_at GAST gastrin

10 35084_at AMH anti-Mullerian hormone

transcription factor 3 (E2A immunoglobulin

14 32874_at TCF3

enhancer binding factors E12/E47)

15 32975_g_at EXOSC2 exosome component 2

16 33510_s_at GRIM 1 glutamate receptor, metabotropic 1

chromatin licensing and DNA replication

17 34510_at CDT1

factor 1

23 37432_g_at PIAS2 protein inhibitor of activated STAT, 2

28 3965 l_at RECQL4 RecQ protein-like 4

cleavage stimulation factor, 3' pre-RNA,

29 40334_at CSTF2

subunit 2, 64kDa

31 1601_s_at IGFBP5 insulin-like growth factor binding protein 5 fizzy/cell division cycle 20 related 1

32 41623_s_at FZR1

(Drosophila)

Fc fragment of IgG, low affinity lib, receptor

34 34664_at FCGR2B

(CD32)

35 35141_at RNASEH2A ribonuclease H2, subunit A

36 36839_at CDC6 cell division cycle 6 homolog (S. cerevisiae)

38 37267_at THOP1 thimet oligopeptidase 1

41 41149_at LOC81691 exonuclease NEF-sp

43 33935_at CACYBP calcyclin binding protein

phosphoribosyl pyrophosphate

44 34341_at PPAT

amidotransferase

platelet- activating factor acetylhydrolase lb,

45 35800_at PAFAH1B3

catalytic subunit 3 (29kDa)

TAF6-like RNA polymerase II, p300/CBP-

49 39909_g_at TAF6L associated factor (PCAF)-associated factor,

65kDa 50 40264_g_at ZFPL1 zinc finger protein-like 1

52 40532_at BIRC5 baculoviral IAP repeat-containing 5

58 1738_at CDC25A cell division cycle 25 homolog A (S. pombe) neuroblastoma RAS viral (v-ras) oncogene

61 1539_at NRAS

homolog

62 1133_at EN2 engrailed homeobox 2

63 966_at RAD54L RAD54-like (S. cerevisiae)

macrophage migration inhibitory factor

64 895_at MIF

(glycosylation-inhibiting factor)

67 652_g_at RPA3 replication protein A3, 14kDa

D-dopachrome tautomerase /// D-dopachrome

69 374_f_at DDT /// DDTL

tautomerase-like

The 108 down-regulated genes constituted two main networks related to maintenance (network significant score of 63) and cellular development (network significant score of 23) processes. The RAR Activation pathway and the Hepatic Stellate Cell Activation pathway (Fig. 11) invoked by Yang genes exert a wide variety of effects on tissue homeostasis, cell proliferation, differentiation, and apoptosis. There is evidence that lung tissue harbors Hepatic Stellate-like cells, vitamin-A-storing lung cells. Assessment of focus genes retrieved from the networks that involved cell maintenance and cellular development process revealed two groups of genes. These two groups (Tables 6, 7) contain 43 genes resulting in 32 unique genes. We defined these 32 genes as Yang gene candidates for signature development (Table 8).

Table 6:

SEQ ID

Gene symbol Gene name

NO:

164 ADRB2 adrenergic, beta-2-, receptor, surface

122 ALDH1A2 aldehyde dehydrogenase 1 family, member A2

90 BCHE butyrylcholinesterase

92 BMP6 bone morphogenetic protein 6

120 CD83 CD83 molecule

136 CRIP2 cysteine-rich protein 2

105 FOXJ1 forkhead box Jl

130 FYN FYN oncogene related to SRC, FGR, YES

117 GATA2 GATA binding protein 2

161 HOXA5 homeobox A5

77 IL18R1 interleukin 18 receptor 1

99 IL1RL1 interleukin 1 receptor-like 1

112 LEPR leptin receptor

141 MEF2C myocyte enhancer factor 2C

131 MEIS1 Meis homeobox 1

101 MEOX2 mesenchyme homeobox 2

140 MYH11 myosin, heavy chain 11, smooth muscle

124 NOTCH4 Notch homolog 4 (Drosophila)

79 PRF1 perforin 1 (pore forming protein)

95 SLIT2 slit homolog 2 (Drosophila)

149 SMAD6 SMAD family member 6

142 SNRK SNF related kinase

94 SOSTDC1 sclerostin domain containing 1

148 SYNM synemin, intermediate filament protein

165 TALI T-cell acute lymphocytic leukemia 1

155 TEK TEK tyrosine kinase, endothelial

83 ZNF423 zinc finger protein 423

Table 7:

Table 8: Yang genes

SEQ ID HG U95A

Gene Symbol Gene Title

NO: Probe set

73 34174_ s _at LPHN2 latrophilin 2

77 36377_at IL18R1 interleukin 18 receptor 1

79 32904_at PRF1 perforin 1 (pore forming protein)

83 34950_at ZNF423 zinc finger protein 423

90 37841_at BCHE butyrylcholinesterase

pro-platelet basic protein (chemokine (C-X-C

91 39209_r_at PPBP

motif) ligand 7)

92 1733_at BMP6 bone morphogenetic protein 6

94 39577_at SOSTDC1 sclerostin domain containing 1

95 39634_at SLIT2 slit homolog 2 (Drosophila)

99 40322_at IL1RL1 interleukin 1 receptor-like 1

101 40398_s_at MEOX2 mesenchyme homeobox 2

105 41030_at FOXJ1 forkhead box Jl

112 34267_r_at LEPR leptin receptor

117 37194_at GATA2 GATA binding protein 2 120 37536_at CD83 CD83 molecule

aldehyde dehydrogenase 1 family, member

122 38315_at ALDH1A2

A2

124 39048_at NOTCH4 Notch homolog 4 (Drosophila)

125 39085_at TN C1 troponin C type 1 (slow)

130 2039_s_at FYN FYN oncogene related to SRC, FGR, YES

131 40763_at MEIS1 Meis homeobox 1

134 40900_at MYH10 myosin, heavy chain 10, non-muscle

136 35828_at CRIP2 cysteine-rich protein 2

140 774_g_at MYH11 myosin, heavy chain 11, smooth muscle

141 37710_at MEF2C myocyte enhancer factor 2C

142 481_at SNRK SNF related kinase

143 38734_at PLN phospholamban

148 39544_at SYNM synemin, intermediate filament protein

149 4023 l_at SMAD6 SMAD family member 6

155 1595_at TEK TEK tyrosine kinase, endothelial

161 873_at HOXA5 homeobox A5

164 610_at ADRB2 adrenergic, beta-2-, receptor, surface

165 560_s_at TALI T-cell acute lymphocytic leukemia 1

1.8 Gene signatures for lung cancer

The signature models disclosed herein are based on computation of the YMR as the patient risk scores. The YMR represents a simple combination or interaction effect of the Yin genes and Yang genes. The ratio indicates the Yin and Yang balance status in lung cells or which group of genes is more active than others and the extent of this difference. In normal lung cells, the Yang is greater than Yin. Cancer phenotypes have higher YMR scores then are associated with higher risk disease. First was validated the hypothesis that YMR is less than 1.0 in normal lung tissues and greater than 1.0 in lung cancer tissues. Several independent sample data sets with different platforms and different preprocesses were assessed (Table 9). YMRs were less than 1.0 in all normal lung data sets (Fig. 12). Also measured were the YMRs of 12 different normal human tissue types in one data set. The data were preprocessed by MAS50.0 and quantile-normalized data was download from NCBI GEO database (GSE803). The YMRs of each sample were directly calculated from the 31 Yin gene and the 32 Yang gene mean values. (Table 10). The YMRs were less than 1.0 in normal lung, as well as in other normal tissues such as the heart, spleen, skeletal muscle, and prostate, but greater than 1.0 in other tissues such as the liver. This result suggests that the Yin and Yang gene expression profiles are tissue type specific. In the 83 samples of various lung cancer types from which Yin and Yang genes were identified via differential gene expression analysis, all samples had an YMR greater than 1.0. The YMRs greater than 1.0 in other independent lung cancer sample data sets are also shown in Fig. 12.

Table 9:

Ref accession samples type sample# platform preprocess Normalization ef. 1 GSE803 normal tissues 24 HG-U95A MAS5.0 Quantile

Ref. 2 GSE2193 normal lung 3 2-color DNA Ratio linear global

Ref. 3 GSE16538 Normal lung 6 HG-133plus2 MAS5.0 None

Ref. 4 GSE17558 normal & tumor 16 lllumina Average Quantile

Ref. 5 GSE10072 normal & tumor 107 HG-133A RMA Quantile

Ref. 6 caArray lung cancer 233 HG-95A RMA Quantile

Ref. 7 GSE3141 Lung cancer 58 HG-133plus2 MAS5.0 median

Ref. 8 caArray lung cancer 443 HG-133A MAS5.0 None

Ref. 9 LUAD adenocarcinoma 259 RNA-seq RPKM global

Ref. 1 : Yanai et al., 2005, Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics 21(5):650-9.

Ref. 2: Shyamsundar et al., 2005, ^ 4 DNA microarray survey of gene expression in normal human tissues.

Genome Biol. 6(3):R22.

Ref. 3: Crouser et al., 2009, Gene expression profiling identifies MMP-12 andADAMDECl as potential

pathogenic mediators of pulmonary sarcoidosis. Am. J. Respir. Crit. Care Med. 179(10):929-38. Ref. 4: April et al., 2009, Whole-genome gene expression profiling of formalin-fixed, paraffin-embedded tissue samples. PLoS One 4(12):e8162.

Ref. 5: Landi et al., 2008, Gene expression signature of cigarette smoking and its role in lung

adenocarcinoma development and survival. PLoS One 3(2):el651.

Ref. 6: Bhattacharjee et al., 2001, Classification of human lung carcinomas by mKNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98(24): 13790-13795.

Ref. 7: Bild et al., 2006, Oncogenic pathway signatures in human cancers as a guide to targeted therapies.

Nature 439(19): 353-357.

Ref. 8 Shedden et al., 2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi- site, blinded validation study. Nature Medicine 14(8): 822-827.

Ref. 9: TCGA data portal: http://tcga-data.nci.nih.gov/tcga/tcgaDownload.jsp Table 10:

Tissue type YMR

boneMarrow 1.35

boneMarrow 1.57

liver 2.46

liver 2.78

heart 0.24

heart 0.24

spleen 0.74

spleen 0.87

lung 0.37

lung 0.38

kidney 1.39

kidney 1.26

skeletalMuscle 0.22

skeletalMuscle 0.21

thymus 1.58

thymus 1.72

brain 1.17

brian 1.33

spinalcord 0.98

spinalcord 1.08

prostate 0.84

prostat 0.88

pancreas 2.07

pancreas 2.04

1.9 YMR signature predicts survival outcomes

The YMR were evaluated for prognosis of four data sets in which the patient clinical information was available. First, the YMR model was validated for the risk outcome of the Bhattachaqee data set from which the model was built. Since the patients' survival time or recurrence free survival time information was not used in the modeling, this data set could therefore serve as an independent data set. The YMR was demonstrated as a continuous variable using proportional hazard model and proved that the increased YMR is associated with poorer outcomes within 6-year recurrence free rate (p=0.044, HR =1.96) (Table 11). Table 11 :

data set Bhattacharjee Bild DCC RNAseq

data size 125 58 442 258

mean YMR 2.23 1.65 1.85 2.24

normal sample mean YMR 0.91 NA NA 0.38

continuous variable

1.96 1.67 1.8 1.87

dichotomous variable

>2.0 >1.4 >1.8 >1.8

60 27 248 121

65 31 194 137

2.7 2.72 2.63 2.73

The YMR were then examined as a dichotomous variable to stratify patients as high and low risk groups. Since the normal lung samples from the same data set shows a mean YMR of 0.91 and the 125 adenocarcinomas have a mean YMR of 2.23, a YMR cutoff of 2.0 was defined. 125 adenocarcinomas patients were grouped into high risk (YMR>2.0, n=65) and low risk (YMR<2.0, n=60) groups. As seen in Figs. 13A-13D, the YMR significantly stratified the high recurrence and low recurrence risk groups (p=0.013, HR= 2.7). Previous studies have reported a significant p-value for their gene-signatures. This is to be expected as those signatures were developed by the patients' survival time and then used again to predict survival time. As subsequently demonstrated, the problem with these approaches is their low reproducibility for new independent data sets. By contrast, the YMR approach is not trained to a specific dataset and would be assumed to work for any data set. 500 pairs of groups of identical group sizes of Yin and Yang genes were randomly picked from 12,625 genes of the HU-95av2 platform and used the same ratio cutoff as the YMR > 2.0. The 500 p-values have a mean p-value of 0.75 (sd=0.32) (Figs. 14A-14D). Four p-values from these random tests were very low (0, 0, 0, 1E-18 respectively), however their HRs are 1.0 or close to 1.0 thus these groups cannot stratify risk groups.

The YMR were then evaluated for a large independent DCC data set. These data sets were collected and processed from four different institutions and contained pathological data and clinical information describing the severity of the disease at surgery and the clinical course of the disease after sampling. These 442 patients were grouped by YMR into high risk (YMR>1.8, n=194) and low risk (YMR<=1.8, n=248) subjects since the mean YMR is 1.85. As seen in Fig. IOC and in Table 11, the survival outcomes of these two groups were significantly different (p=0.004, HR=2.63). Similarly, YMR cutoff of 1.4 was used for Bild data set since the mean YMR of the 58 adenocarcinomas is 1.6. The YMR significantly stratified (p=0.019, HR=2.72) this independent data set into high (YMR >1.4, n=31) and low (YMR <= 1.4, n=27) risk groups (Fig. 10B). The YMR ratio was calculated using RNA-seq data of 259 TCGA samples. The continuous YMR scores associate with the survival rate significantly (p-value 0.007, HR 1.87) (Table 11). The dichotomous YMR signature significantly stratified the high- (n=137) and low-risk (n=121) groups (p=0.007, HR=2.73) (Fig. 10D and Table 11).

The geometric mean of Yin and Yang gene expression ratio (gYMR) was calculated and tested its association with the poor outcome both as a continuous variable and a dichotomous variable. As seen in Table 12, the continuous gYMR does not work for Bhattachaqee data and Bild data, and the dichotomous gYMR does not work for Bhattachaqee data either. The YMR is robust in four data sets. The continuous YMR did not show its association with clinical outcome in the Bild data set of HG-133plus2 platform (p=0.49) probably due to small data size. However the dichotomous YMR (cutoff > 1.4) significantly stratifies patients' risk of this data set (p=0.02, HR =2.72) (Table 11).

Table 12:

mean YMR 1.55 1.35 0.9(5 1.(55

normal sample mean YMR. NA NA NA 0.13

continuous vari ble

log Rank-p 0.46 0.15 0.0001 0.055

HR 1.19 2.02 1.93 2.04

di hotomous vari ble

YMR cutoff >1.2 >1.0 >0.8 -Λ .2

high risk 70 41 2(58 138

log Rank-p 0.64 0.017 0.0001 0.007

HR 2.8 3.28 2. So 2.74 The effect of dropping genes from the Yin and Yang gene list was assessed using the DCC data set. Dropping one Yin gene (217871_s_at, gene MIF) improved significantly the p- value of YMR, but its HR decreases at the same time (Figs. 15 A, 15B). Dropping Yin gene affects the p-value of gYMR but did not affect the HR (Figs. 16A, 16B). Dropping one Yang gene a time did not affect the p-value of both YMR and gYMR (data not shown), nor the HR of YMR and gYMR (Figs. 17A, 17B). Dropping three Yin genes (HIST1H4J, CDC25A, and IGFBP5) yields best performance of gYMR for DCC data (Figs. 16A, 16B, table 13), but the same gene dropping did not improve the performance of gYMR in other three data sets (Table 13). Table 13: YMR covariate and multivariate analysis using a direct method*

\Y (Wald

Chi p-value Hazard

Name Estimate Std Error Square) (W) Ratio

Covariate

YMR 0.39 0.09 20.32 <1.0E-05 1.47

Multivariate

YMR 0.28 0.10 7.54 0.006 1.32 chemo: yes 0.13 0.18 0.54 0.463 1.14 chemo: unknown 0.21 0.25 0.71 0.401 1.23 smoker: yes 0.27 0.27 1.01 0.316 1.31 smoker: unknown 0.16 0.34 0.22 0.641 1.17 sex: male -0.09 0.15 0.34 0.557 0.92 age: >=60 years old -0.41 0.17 5.61 0.018 0.66 stage III 1.34 0.20 44.78 <1.0E-10 3.82 stage II 0.65 0.18 13.41 <1.0E-3 1.91 differentiate: poor 0.27 0.25 1.22 0.270 1.32 differentiate: medium 0.10 0.24 0.17 0.679 1.10

Chemotherapy was a category variable (no chemotherapy group as reference);

Smoking history was a category variable (no smoking group as reference);

Sex: was a binary variable (0 for female as reference);

Age was a binary variable (0 for <60 years old as reference);

Tumor stage was a category variable (stage I as reference);

Differentiation (well, medium, poor) was a category variable (well as reference). These results indicate that Yin and Yang gene list could be further optimized to smaller size by removing one to three genes. However, this optimization is constrained by the survival time of the data set tested, similar to the limitations of the data training approach. It appears that about 30 Yin and 30 Yang genes would ensure a representation of the whole Yin and Yang effects of cancer cells and a consistent performance for different data sets. Smaller gene lists may keep the same or improve performance for one data set, but may not work well for other data set.

1.10 Comparison of YMR with previously reported signatures

Several aspects of the YMR model were compared to those of previously reported signatures. As summarized in Table 14, the YMR model is advanced in reproducibility and practicality. The prognostic performance of YMR model was compared to a recently reported 15 -gene signature (Table 14). This signature was claimed superior to many other previously reported lung cancer prognostic signatures by testing a same data set with all other signatures. We used the same DCC data set and the Bild adenocarcinoma data of different platform (U133plus2) for this comparison. As seen in Fig. 18A, the 15-gene signature significantly stratified the DCC samples (p= 0.011, HR=2.68), but not for the Bild samples (Fig. 18B, p=0.6). However, the YMR model not only stratified the DCC samples into high risk and low risk groups more significantly (Fig 18C, p= p=0.004, HR=2.63) than the 15-gene signature, but also (Fig. 18D, p=0.019, HR=2.72) separated the Bild samples into the high- and low-risk groups that the 15-gene signature could not. The other two data sets (NLCI, Agilent 44k; JBR 10, RT-qPCR) that were used in Zhu et al. study (Table 14) were not assessed because these two platforms do not contain enough YMR signature genes. The 15-gene signature works best for squamous cell lung carcinomas among all five data sets, but YMR did not work for this data (data not shown), probably due to the difference of tumor biology between squamous cell lung carcinoma and adenocarcinoma. Table 14:

Ref 1 : Gordon et al., 2002, Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res. 62:4963-4967.

Ref. 2: Raponi et al., 2006, Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res. 66(15):7466-7472.

Ref. 3: Chen et al., 2007) ^ 4 five-gene signature and clinical outcome in non-small-cell lung cancer. N. Engl.

J. Med. 356(1): 11-20.

Ref. 4: Shedden et al., 2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi- site, blinded validation study. Nature Medicine 14(8):822-827.

Ref. 5: Zhu et al., 2010, Prognostic and Predictive Gene Signature for Adjuvant Chemotherapy in Resected Non-Small-Cell Lung cancer. J. Clinical Oncology 28(29):4417-4424.

Ref. 6: Wan et al., 2010 Hybrid models identified a 12-gene signature for lung cancer prognosis and chemoresponse prediction. PLoS One 5(8):el2222.

Ref. 7: Lu et al., 2012, Gene-Expression Signature Predicts Postoperative Recurrence in Stage I Non-Small Cell Lung Cancer Patients. PLoS ONE 7(l):e30880. 1.11 Analysis of YMR and clinical covariates

The YMR were evaluated with clinic covariates in lung cancer prognosis. The 442 DCC samples showed greater than 50% survival rate within 5-year (Fig. IOC), which is biased because of the fact that the 5-year overall survival rate for lung cancer is as low as 16% and has not significantly improved over the past 30 years. The direct method was and only looked at the stage II/III patients within 72 months of follow up time. Not surprisingly, disease stage was the most important risk factor. Except for disease stage, however, the YMR model was the second most important covariate (HR=1.32, p=0.006, Table 13). The gYMR signature using actuarial method showed a similar result (HR=1.67, p=0.004, Table 15). Table 15: gYMR covariate and multivariate analysis using an actuarial method*

Hazard Lower Upper

Std p-value

Name coef Ratio 0.95 0.95 Z

Error

(HR) HR HR (z)

Covariate

2.00E-

YMR 0.64 0.15 1.9 1.41 2.56 4.2

05

Multivariate

YMR 0.51 0.18 1.67 1.18 2.35 2.89 0.004 chemo: yes 0.21 0.18 1.23 0.87 1.74 1.18 0.24 smoker: yes 0.30 0.28 1.35 0.78 2.36 1.07 0.29 sex: male 0.10 0.17 1.10 0.79 1.53 0.57 0.57 age: >=70 years 0.45 0.18 1.57 1.11 2.22 2.56 0.01

3.00E- stage 0.89 0.17 2.44 1.74 3.42 5.16

07 differentiate: poor 0.09 0.18 1.09 0.77 1.54 0.50 0.62

* YMR was gYMR with three genes dropped.

Chemo was a category variable (no chemotherapy group as reference);

Smoker was a category variable (no smoking group as reference);

Sex: was a binary variable (0 for female as reference);

Age was a binary variable (0 for <70 years old as reference);

Tumor stage was a category variable (stage I as reference);

Differentiation was a category variable (well differentiation as reference). The YMR model stratified 299 stage I DCC patients into high-risk and low-risk groups (p=0.048) and 141 stage II/III patients into high- and low-risk groups (p=0.042). The gYMR risk score showed more significant stratification for stage I patients (Fig. 19A, p=0.012). Unexpectedly, in the whole data set (Figs. 19A-19D), chemotherapy patients showed even poorer outcome for stage I patients than for those patients without chemotherapy (Figs. 20A-20C). This could be a result from the bias of patient selection for treatment. For those early stage patients who did not receive chemotherapy, the gYMR risk score was even more significant in predicting prognosis (Fig. 16C, p=0.004). The gYMR score also predicted prognosis among stage II & III patients (p=0.016) (Fig. 16D). These results show that patients with a low YMR score have a good prognosis regardless of disease stage and chemotherapy can improve outcomes for high- YMR stage II & III patients.

1.12 DISCUSSION

This disclosure pertains to a new survival prediction signature for lung cancer called 'YMR". This YMR signature was built from a cancer biology hypothesis in contrast to previously reported models that are based on survival time training (Table 14). The YMR value of individual patients can provide valuable biomarker information relevant to lung cancer prognosis and therapeutic decision-making. In a clinical setting, the ideal prediction model should be applicable to any single patient by providing an informative risk score for that patient. The major shortcoming of all previous prediction models is that the signature gene-expression values of new samples have to be comparable to those of the training sample data in terms of data preprocessing, analysis platform, and data normalization. For example, Shedden et al. (2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature Medicine 14(8): 822-827) normalized the entire training and testing data sets together. This is not practical for clinical use. Additionally, global normalization may remove some inter-site differences. Even though using a small number of genes by qRT-PCR would be more practical, qRT-PCR data also needs to be normalized before the same models can be applied.

Determination of YMR signatures as disclosed herein not only simplifies the modeling but also avoids data normalization preprocess since the ratio of each patient is comparable. The YMR is computed from the same individuals; therefore, it works for a single patient sample. YMR works for different data analysis platforms and different data preprocess methods. Further, lung cancer prognosis with the YMR could be improved by optimizing the Yin and Yang gene lists and the number of genes in the YMR calculation.

The ratio of two-gene expression within an individual patient has been reported as a biomarker signature development in lung cancer diagnosis and prognosis as well as for breast cancer prognosis. The single two-gene ratio or geometric mean of several two-gene ratios was selected between the treatment failures and the treatment responders from the training data samples. The single two-gene ratio works well for cancer cell type classification or diagnosis; for example, between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA), but it may not be able to reflect the complex tumor progression process for prognosis. In some cases, there could be substantial variation of the two genes among different samples. Therefore many new studies in recent years still cling to the Cox regression modeling to build the prognostic signatures. Most of these models applied the gene expression value to the Cox proportional coefficient of each signature gene and combined them as the patient risk scores. Some models computed the probability of a patient falling into the low-risk or high-risk class as the patient risk scores. However, there are difficulties in using overall survival as an endpoint in prognostic modeling in cancer. The expression variations of the same gene among individual subjects are substantial. Some genes associated with other aggressive diseases may be present in a subject's tumor. Similarly, a subject might develop and succumb to some other clinical condition shortly after diagnosis. In these instances, a correlation between gene expression and subject survival is lacking. The complex models could learn the expression variable as well as other variations precisely, which would result in low reproducibility if used for a different data set. Instead, we return to the two- group gene expression ratio approach but select these two groups of genes using differences between normal lung cells and lung cancer cells that represent the whole Yin and Yang effects of the cell, and not simply based on survival times. With the advent of microarray technology, groups of differentially expressed genes

(DEGs) were chosen between normal tissue samples and cancer samples. There have been no prior disclosures that selected the DEGs between normal and cancer samples for cancer prognostic signature development. Rather, previous disclosures selected genes between patients of long and short survival time or genes that correlate to survival time (Table 3). In those publications, Cox regression analysis of all genes against the survival time of all patients resulted in a proportional hazard rate for each gene. The top gene in the list, pre- clustered genes, or metagenes were used as signature genes. Other studies selected genes that were differentially expressed between high-risk and low-risk patients who were simply grouped by survival time. If the same idea (gene association with survival time) was used in gene selection then the selected signature gene lists would be similar for different studies. Often, however, the published signatures showed little overlap in the genes identified as significant predictors of outcome. Thus, there is a strong possibility that gene selections were influenced by variations in sample collection, sample size, data processing, and microarray platform. The YMR signatures disclosed herein do not use survival time as a parameter for gene selection and used a gene clustering approach instead of group statistics. It is not unexpected that the gene list does not overlap previously reported lung cancer signature genes as the YMR signature development approach is quite different.

A useful prognostic signature should not only predict the patient's prognosis, but should also help clinical therapeutic decision making. Even though surgery alone is a standard treatment for early stage lung cancer, more than 20% of stage I patients will relapse. This portion of patients might benefit from chemotherapy. For the late stage lung cancer patients, after the complete resection of tumors, a good prognostic signal could spare the patient from chemotherapy or recommend less intensive therapy. The YMR signature disclosed herein is a diagnostic tool for use in clinical therapeutic decision making for different stages of lung cancer. For those high YMR stage I patients, a careful therapy recipe is recommended. Chemotherapy can improve outcomes for high YMR stage II & III patients. Example 2;

2.1 Optimization of YMR signature development with combinations of ten Yin and Yang genes

The goal of this study was to narrow down the 31 Yin genes and 32 Yang genes disclosed in Example 1 required for development of reliable YMR signatures while retaining or increasing the YMR signature prognosis performance. A smaller Yin gene / Yang gene list will reduce the clinical costs required to generate the YMR signatures and will be more practical for routine PCR-based detection protocols. In this study, the YMR signature development was optimized on a variety of platform data sets using a multiple permutation process (MPP) as illustrated in Fig. 3. Seven hundred and forty one sample data were used in this optimization; specifically (i) 300 samples were downloaded from the DCC U133A NICI caArray database (https://array.nci. nih.gov/caarray/proiect/details.action?proiect.id=l 82); (ii) 260 samples downloaded downloaded from the TCGA RNAseq Data Portal (http://tcga- data.nci.nih. gov/tcga/tcgaDownload.i sp). ; (iii) 56 samples from the Bild U133 plus2 data set (Bild et al, 2006); and (iv) 125 samples from the Bhattacharjee U95A data set (Bhattacharjee et al, 2001,). First, the best gene list size was optimized by permutation of 10,000 YMR signatures with different sizes, followed by testing each signature against 1000 random data sets of 200 samples each. As expected, based on the results disclosed in Example 1, the signature with all 31 Yin and 32 Yang genes had the highest occurrence of tests of p-value less than 0.05 among the 100,000,000 data set tests, (Fig. 21(A)-(C)). Very few signatures consisting of 2 Yin and 2 Yang genes had p-values less than 0.05. However, when the 75 th percentile of the 1000 p-values of each permutated signature was checked, none of the signatures containing equal or more than 27 Yin or Yang genes had 75 th percentile of p-values less than 0.05. However, some signatures containing less than 24 Yin or 24 Yang genes did have 75 th percentile of p-values less than 0.05 (Fig. 22). This occurred because the 31-Yin and 32-Yang YMR signature using the same 31 Yin and 32 Yang genes in every 1000-data set has a stable performance. However, the signatures with gene size less than 31-32 used combinations of different genes. Some of these different gene combinations produced a high performance, which worked for the majority of the 1000-data set. These small size signatures are of primary interest. When the signatures that had even higher percentiles (85 th percentile) of p- value less than 0.05 were checked, the highest proportion of 85 th p<0.05 tent to locate at the range of 4-6 Yin and Yang genes (including 4-4,4-5,4-6,5-4,5-5,5-6,6-4,6-5,6-6 combinations) (Figs. 23(A)-(D)). Accordingly, a range of 4 to 6 of Yin and Yang gene list size i.e. small size but retaining good performance, were selected for further analysis.

Second, 1 million signatures containing combinations of 4-6 Yin genes and 4-6 Yang genes were generated. Each signature was tested against 1000 data sets, each of which contained 200 random samples. Those signatures that have 90 th percentile p-value less than 0.05 were selected for further assessment, and ranked the genes in these signatures by numbers of occurrences. Interestingly, the Yang gene TCNNl (troponin C type 1 (slow), a calcium binding protein) occurred in almost every successful signature test, i.e. 90th p-value less than 0.05 (Fig. 22). The top 10 Yin genes and top 10 Yang genes were selected for further optimization. The top 10 Yin and top 10 Yang performing genes were used to generate 1 million 4-6 gene-sized signatures. Each signature was tested in 1,000 data sets of 200 random samples each. Signatures with the lowest p-value as well as Hazard Ratio (HR) greater than 1.1, were retained. The top ten best performing YMR signatures were ranked by number of p-value < 0.05, the median p-value, 90 th percentile p-value, as well as the gene number in the signature (Table 16).

Table 16:

Rank Yin# Yin genes Yang# Yang genes

1 4 NRAS; RECQL4; IGFBP5; GRM1 6 GATA2; CD83; SOSTDC1 ; CRIP2; TNNC1 ; MYH10

2 4 GRM1 ; RECQL4; NRAS; IGFBP5 6 HOXA5; TNNC1 ; SOSTDC1 ; CRIP2; CD83; GATA2

3 4 CDC6; NRAS; RAD54L; GRM1 6 SOSTDC1 ; CD83; TNNC1 ; HOXA5; CRIP2; GATA2

4 5 IGFBP5; AMH; NRAS; RECQL4; GRM1 6 SOSTDC1 ; TNNC1 ; CD83; HOXA5; ALDH1A2; GATA2

5 6 RAD54L; IGFBP5; GRM1 ; CDC6; NRAS; AMH 6 CRIP2; GATA2; TNNC1 ; CD83; HOXA5; SOSTDC1

6 4 IGFBP5; CDC6; GRM1 ; NRAS 6 GATA2; CRIP2; SOSTDC1 ; CD83; MYH10; HOXA5

7 6 NRAS; GRM1 ; IGFBP5; CDT1 ; RAD54L; PPAT 6 FOXJ1 ; TNNC1 ; SOSTDC1 ; CRIP2; CD83; HOXA5

8 6 GRM1 ; CDT1 ; PPAT; IGFBP5; CDC6; NRAS 6 GATA2; HOXA5; CD83; CRIP2; SOSTDC1 ; FOXJ1

9 4 NRAS; GRM1 ; RAD54L; IGFBP5 6 CRIP2; GATA2; SOSTDC1 ; MYH10; CD83; TNNC1

10 5 IGFBP5; PPAT; GRM1 ; NRAS; RAD54L 6 SOSTDC1 ; TNNC1 ; MYH10; CRIP2; GATA;2 ALDH1A2

The top two signatures exhibited very close performance but differ in only one Yang gene (MYHIO versus HOXA5). Since HOXA5 occurred more frequently than MYHIO in the 4,398 signatures that have p-value less than 0.05 (Fig. 22) and HOXA5 acts directly downstream of the Retinoic acid receptor (RAR) activation pathway that noted in Example 1 as one of the main Yang effects, the signature of 4 Yin genes (GRMl, IGFBP5, NRAS, RECQL4) and 6 Yang genes (CRIP2, CD83, GATA2, HOXA5, SOSTDC1, TNNC1) genes were chosen for further testing. This signature showed prognostic significance in 994 of 1,000 data sets with a median p-value of 1.30e-05 and 90 th percentile p-value of 0.002. Each one of these data sets consisted of 200 randomly picked samples. The clinical potential of this signature is high since it was validated in 1,000 different data sets. It is worth noting that all these top 10 signatures present similar performance, and therefore, all 10 signatures are useful. They share the main components of biological processes or pathways. Therefore any one gene from a particular biological process or pathway group could substitute for any other within the group since each would be expected to exhibit the same biological effect in a signature.

2.2 A large real data set validation The 4 data sets used for optimization in above section 2.1, as well as 7 additional new data sets were used to assess if the YMR of the small gene list retains the performance for each data set:

(i) Okayama et al, 2012, Gene expression data for pathological stage I-II lung adenocarcinomas, PLoS One 7(9):e43923;

(ii) Tomida et al, 2009, Relapse-related molecular signature in lung adenocarcinomas identifies patients with dismal prognosis. J. Clin. Oncol. 27(17):2793-2799;

(iii) Tang et a., 2013, A 12-Gene Set Predicts Survival Benefits from Adjuvant Chemotherapy in Non-Small-Cell Lung Cancer Patients. Lin. Cancer Res. 19(6): 1577-1586;

(iv) Bild et al, 2006, Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(19):353-357;

(v) Fouret et al, 2012, A Comparative and Integrative Approach Identifies ATPase Family, AAA Domain Containing 2 as a Likely Driver of Cell Proliferation in Lung Adenocarcinoma, Clin Cancer Res 18(20):5606-5616;

(vi) Zhu et al, 2010, Prognostic and predictive gene signature for adjuvant chemotherapy in resected non-small-cell lung cancer, 28(29):4417-4424;

(vii) Matsuyama et al, 2011, Proteasomal non-catalytic subunit PSMD2 as a potential therapeutic target in association with various clinicopathologic features in lung adenocarcinomas. Mol. Carcinog. 50(4):301-309;

(viii) Sato et al., 2013, Human lung epithelial cells progressed to malignancy through specific oncogenic manipulations. Mol. Cancer Res. l l(6):638-650;

(ix) TCGA Data Portal, http://tcga-data.nci.nih.gov/tcga/tcgaDownload.isp;

(x) Shedden et al, 2008, Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature Medicine 14(8): 822-827;

(xi) Bhattacharjee et al, 2001, Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad Sci. USA 98(24): 13790-13795.

The optimized YMR significantly stratified patients into high-risk and low-risk groups (Figs. 23-27). These varieties of platform data sets were combined to make up a total of 1,664 samples. This large data set contains 909 samples from the 7 new data sets along with 613 samples used in the optimization studies plus additional 142 DCC samples that were not used in the optimization studies. Among the 741 cases used in optimization, 128 (56 Bild data samples and 72 TCGA RNAseq samples) were not included in the large data set because these cohorts lacked tumor stage information. As summarized in Table 16, among these patients, the median age at diagnosis 63.2 years, 47% males, 58% were smokers, and 54% were Stage I. The patient cohort included approximately 17% who took therapy, 48% who did not take therapy after diagnosis, and 35% of patients whose treatment information was unknown. EGFR/KRAS mutation status was known for 526 patients and p53 mutation status was known for 207 patients. Table 16:

number of patients 1664 age >=65 750

<65 914 sex m 784

f 880 smoking never 379

ever 972

unknown 313 treatment no 807

yes 290

unkown 577

KRAS/EGFR mutation negative 218

positive 308

unkown 1138 p53 mutation negative 140

positive 67

unkown 1457 stages 1 906

II 485

lll/IV 255

unknown 18

As shown in Fig. 28(A) and Table 17, the YMR signature significantly stratified the whole 1664 patients composed of all stages into high- and low-risk survival groups (p=1.55e- 08, HR=2.74). To date, this is the largest sample validation for a lung cancer prognostic signature.

Table 17:

Univariate analysis Hazard ratio (95%CI) logrank test

YMR 2.74 (1.92-3.89) 1.55E-08

Multivariate analysis Hazard ratio (95%CI) Wald test

YMR 1.66(1.39-1.98) 2.77E-08 age >=65 years 1.47(1.25-1.74) 5.10E-06 sex 1.12(0.94-1.33) 0.212 ever smoking 1.21(0.95-1.56) 0.13 treatment applied 1.30(1.02-1.66) 0.031

KRAS/EGFR mutation positive 0.86(0.62-1.21) 0.393 p53 mutation positive 1.16(0.72-1.87) 0.554 stage I I 1.66(1.35-2.03) 1.07E-06 stage l l l/IV 3.37(2.72-4.19) < 2e-16

As a univariate variable, YMR signature significantly stratified Stage I patients into high- and low-risk groups (p-value of 3.5e-05, HR of 2.7(1.7-4.2)) (Fig. 28(B)). YMR also stratified stage II patients (p-value of 0.046, HR of 2.8(1.01-7.9)) and stage III patients (p- value of 0.004, HR of 2.7(1.37-5.37) (Figs. 29(A)-(D)). The YMR signature significantly stratified patients who underwent either chemotherapy or radiotherapy treatment, into high- risk and low-risk groups (p-value of 0.03, HR of 2.9(1.1-7.6)). Similarly the YMR signature stratified patients who did not receive treatment (p-value of 0.02, HR of 2.7(1.16-6.26)).

On multi-variate analysis, the YMR signature remained significant amongst other factors (Table 17).

YMR (p= 2.77 "8 , HR =1.66) was second to stage in multivariate analysis taking into account the presence of the other predictors, i.e. age, sex, smoking status, stages, treatment, oncogene mutation status. Age > 65 years old carries significantly higher risk of death than age < 65 years (p-value of 5.10 "6 , HR of 1.47(1.25-1.74)), as expected. Cancer stage was the most significant predictor of death. Unexpectedly, the patients who bore at least one common oncogene mutation (KRAS/EGFR) or tumor suppressor gene p53 did not show higher risk than the patients not carrying these mutations. The YMR signature was assessed for prediction of the risk of relapse of stage I patients using another independent data set of 123 stage I adenocarcinomas samples, of which 28 (228.%) had relapse (Chitale et al, 2009, An integrated genomic analysis of lung cancer reveals loss of DUSP4 in EGFR-mutant tumors. Oncogene 28:2773-2783). Twenty-one patients who had relapse within a median of 26 months after resection were in the high-risk group that had a YMR greater than 1.0. The patient who with a YMR greater than 1.0 will have an increased risk of relapse (31.3%), whereas the patient having a YMR less or equal to 1.0 will decrease the risk of relapse (12.5%). The odds ratio is 3.2 (95% CI: 1.24-8.23; p=0.016). 2.3 Gene function and pathways

As shown in Table 3, the 4 Yin genes are oncogenes or involved in tumorigenesis in various organs, whereas the 6 Yang genes are tumor suppressors or involve in apoptosis or other anti-tumor processes. The IPA analysis resulted in a network involved in cancer with a p score of 25 (i.e. p=10e-25). We rebuilt this network by removing those genes beyond these ten genes and adding new connections from literature. The two pathways discussed in Example 1, i.e., the molecular mechanisms of cancer pathway and the RAR pathway, are the core of this Yin and Yang signature.

Table 18:

SEQ ID

Gene symbol

NO:

Yin genes

16 GRM1

28 RECQL4

61 NRAS

31 IGFBP5

Yang genes

161 HOXA5

125 TNNC1

94 SOSTDC1

136 CRIP2

120 CD83

171 GATA2 2.4 Discussion

A low reproducible problem is often associated with biomarker signature studies since the signatures while significant in the original reports often are not in new data sets. This is caused mostly by the signature development methods themselves, i.e. the signature models were trained too well on a data set (training set), but not using independent data sets (Subramanian et al., 2010). Another problem is that these gene expression signature models are not applicable to an individual patient because they require the patient's data to be highly similar to the original training data. For example, Shedden et al, (2008) normalized the entire training and testing data sets together. Therefore a new approach must be developed to address these problems. The Yin-Yang ratio model (YMR) disclosed herein is computed from the Yin and Yang gene expression alone of the same individuals without borrowing any reference value from a training group; therefore, it works for a single patient sample. YMR works for different data analysis platforms and different data preprocess methods because the ratios are comparable among individuals independently, minimizing the variations. To our knowledge, this report is the largest data set validation in lung cancer biomarker studies. Significantly this is the first report of combining different platform data as a whole. This combination of different platforms was only possible due to our ratio model. This success also suggests that YMR will work for qPCR detection which will be part of our future analysis. It is worth noting that a large data set of 1664 samples was generated that contained the 613 samples used for optimization. However, these 613 samples used for optimization are different from the traditional "training data" in that a coefficient constant value of the training data is not transferred into the YMR signature as is common in most other training models published to date. This signature that was developed using all stage expression data showed significance in predicting the relapse risk of stage I patients. This would aid the clinical decision for treatment of stage I patients after resection. This signature can be further amended by using more stage I patient data for Yin and Yang gene identification and the relapse prediction can be further lifted by adding other clinical and molecular variables.

It is disclosed in Example 1 that the identified 63 Yin and Yang genes were involved in the canonical Molecular Mechanisms of Cancer pathway and the Retinoic acid receptor (RAR) activation pathway. The refined 4 Yin and 6 Yang genes disclosed in this Example are either directly or indirectly related to these two pathways. The ratio of these gene expressions represents the balance of these pathways, thereby reflecting the biological balance of the Yin and Yang effects within the tumor cells and consequently the risk of either cancer progression or cancer suppression. As shown in Fig. 30, the network that was generated using the 4 Yin and 6 Yang genes included the p38MAPK, J K, AKT, MAPK, and ERK1/2 gene products, all with established roles in regulation of cancer development and progression.

Among the 4 Yin genes, GRM1 (SEQ ID NO: 16) is an oncogene in epithelial cells. GRM1 can activate ERK1/2 (Mann et al, 2006, Stimulation of oncogenic metabotropic glutamate receptor 1 in melanoma cells activates ERK1/2 via PKC. Cellular signalling 18: 1279-1286). ERK1/2 in turn can activate c-JUN and c-FOS transcription factors which regulate genes functional in the cell cycle and oncogenesis. GRM1 (SEQ ID NO: 16) was found to play a role in the regulation of cell proliferation and tumor growth of breast cancer and was suggested as a potential new molecular target for anti-angiogenic therapy of breast cancer (Speyer et al., 2014, Metabotropic Glutamate Receptor-1 as a Novel Target for the Antiangiogenic Treatment of Breast Cancer. PloS one 9:e88830) and renal cell carcinoma (RCC) (Martino et al, 2013, Metabotropic glutamate receptor 1 (Grml) is an oncogene in epithelial cells. Oncogene 32;4366-4376). RECQL4 (SEQ ID NO:28) was found highly expressed in human prostate tumor tissues. Transient and stable suppression of RECQL4 (SEQ ID NO:28) by small interfering RNA and short hairpin RNA vectors drastically reduced the growth and survival of metastatic prostate cancer cells, indicating that RECQL4 (SEQ ID NO:28) could play critical roles in prostate cancer progression (Su et al, Human RecQL4 helicase plays critical roles in prostate carcinogenesis . Cancer research 70:9207- 9217). RECQL4 (SEQ ID NO:28) was also overexpressed in breast cancer cells and may play a critical role in human breast tumor progression (Fang et al, 2013, RecQL4 Helicase Amplification Is Involved in Human Breast Tumorigenesis. PloS one 8:e69600). IGFBP5 (SEQ ID NO:31) is an important member of the insulin growth factor system, which is critical for both normal cell physiology and tumorigenesis. IGFBP5 (SEQ ID NO:31) was found more highly expressed in breast cancer compared to adjacent normal tissues (Pekonen et al, 1992, Insulin-like growth factor binding proteins in human breast cancer tissue. Cancer research 52:5204-5207). IGFBP-5 (SEQ ID NO:31) overexpression has also been found to be a poor prognostic factor in patients with urothelial carcinomas of upper urinary tracts and urinary bladder (Gopal et al, 2013, SOSTDC1 down-regulation of expression involves CpG methylation and is a potential prognostic marker in gastric cancer. doi:doi: 10.1016/j.cancergen.2013.04.005). The Yang gene HOXA5 (SEQ ID NO: 161) is a transcriptional factor whose expression is lost in more than 60% of breast carcinomas (Chen et al, 2004, HOXA5 -induced apoptosis in breast cancer cells is mediated by caspases 2 and 8. Molec. Cell. Biol. 24:924- 935). HOXA5 (SEQ ID NO: 161) acts directly downstream of retinoic acid receptor β and contributes to retinoic acid-induced apoptosis and growth inhibition and chemopreventive effects, and induction of HOXA5 (SEQ ID NO: 161) expression leads to cell death with features typical of apoptosis (Chen et al, 2007). SOSTDC1 (SEQ ID NO:94) has been reported to be down-regulated in various cancers (Gopal et al, 2013). Expression of SOSTDC1 (SEQ ID NO:94) in gastric tumors increased the probability of both overall and disease-free survival and it is consequently a potential prognostic factor and tumor suppressor in gastric cancer (SEQ ID NO:94). CRIP2 (SEQ ID NO: 136) is a candidate tumor-suppressor gene, capable of functionally suppressing tumor formation. It acts as a repressor of NF-kB- mediated proangiogenic cytokine transcription to suppress tumorigenesis and angiogenesis (Cheung et al, 2011, Cysteine-rich intestinal protein 2 (CRIP 2) acts as a repressor ofNF- κΒ-mediated proangiogenic cytokine transcription to suppress tumorigenesis and angiogenesis. PNAS 108:8390-8395. Down-regulation of NF-kB leads to positive feedback of the RAR pathway (Figs. 23-27). Over-expression of CRIP2 (SEQ ID NO: 136) induces apoptosis through induction of active caspase 3 and 9 proteins (Lo et al., The LIM domain protein, CPJP2, promotes apoptosis in esophageal squamous cell carcinoma. Cancer letters 316:39-45). GATA2 (SEQ ID NO:171) is critical for organ development and associated with progression of various cancer types and was found to associate with RAR. This association is mediated by the zinc fingers of GATA2 (SEQ ID NO: 171) and the DNA-binding domain of RAR (Tsuzuki et al, 2004, Cross talk between retinoic acid signaling and transcription factor GATA-2. Molec. Cell. Biol. 24:6824-6836). Decreased expression of GATA2 (SEQ ID NO: 171) was associated with poor prognosis of HCC following resection Li et al, 2014, Decreased Expression of GATA2 Promoted Proliferation, Migration and Invasion ofHepG2 In Vitro and Correlated with Poor Prognosis of Hepatocellular Carcinoma. PloS one 9:e87505). The repression of GATA2 (SEQ ID NO: 171) in human and mouse lung tumors is via an epigenetic mechanism since its promoter was unmethylated in normal lung but frequently methylated in lung tumors and NSCLC cell lines (Tessema et al, 2014, GATA2 is Epigenetically Repressed in Human and Mouse ung Tumors and Is Not Requisite for Survival of KRAS Mutant ung Cancer. J. Thorac. Oncol. 9:784-793). Human CD83 is a marker molecule for mature dendritic cells (DC) that play a key role in inducing and maintaining antitumor immunity. DC antigen-presenting function may be lost or inefficient in the tumor environment. Novel or improved therapeutic approaches could be designed to allow proper functioning of DCs in patients with cancer (Ma et al, 2013, Dendritic cells in the cancer microenvironment. J. Cancer 4:36). TNNCl (SEQ ID NO: 125), the troponin C type 1 (slow) gene, encodes a central calcium regulatory protein troponin of striated muscle contraction. This was the most frequent gene occurring in the optimizing Yang gene lists, suggesting that this gene could be a tumor suppressor in lung cancer similar to MYOD, another muscle gene, in brain cancer (Dey et al., 2013, MyoD is a tumor suppressor gene in medulloblastoma. Cancer Res. 73:6828-6837). A potential mechanism for the tumor-suppressor function of TNNCl (SEQ ID NO: 125) could be via a calcium regulatory function since TNNCl (SEQ ID NO: 125) binds calcium ions that are involved in apoptotic signaling (Pinton et al, 2008, Calcium and apoptosis: ER-mitochondria Ca2+ transfer in the control of apoptosis. Oncogene 27:6407-6418). The combination of different platform data as a whole provides confidence that this

YMR will also work using qPCR detection. The 4 Yin genes, 6 Yang genes as well as 3 housekeeping genes are feasible for development as a qPCR assay for clinical use. This Yin and Yang gene signature provides prognostic and potentially predictive information for all stage patients. In particular, those patients whose tumors have a low YMR ratio have better treatment outcomes than those patients who have a higher YMR ratio.