Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PROSTATE CANCER BIOMARKERS TO PREDICT RECURRENCE AND METASTATIC POTENTIAL
Document Type and Number:
WIPO Patent Application WO/2010/056993
Kind Code:
A2
Abstract:
Described herein are methods for predicting the recurrence, progression, and metastatic potential of a prostate cancer in a subject. For example, the method comprises detecting in a sample from a subject one or more biomarkers selected from the group consisting of FOXO1A, SOX9, CLNS1A, PTGDS, XPO1, LETMD1, RAD23B, ABCC3, APC, CHES1, EDNRA, FRZB, HSPG2, and TMPRSS2_ETV1 FUSION. The method can further comprise detecting in a sample from a subject one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. An increase or decrease in one or more biomarkers as compared to a standard indicates a recurrent, progressive, or metastatic prostate cancer.

Inventors:
MORENO CARLOS (US)
OSUNKOYA ADEBOYE (US)
ZHOU WEI (US)
LEYLAND-JONES BRIAN (US)
LONG QI (US)
JOHNSON BRENT A (US)
Application Number:
PCT/US2009/064384
Publication Date:
May 20, 2010
Filing Date:
November 13, 2009
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV EMORY (US)
MORENO CARLOS (US)
OSUNKOYA ADEBOYE (US)
ZHOU WEI (US)
LEYLAND-JONES BRIAN (US)
LONG QI (US)
JOHNSON BRENT A (US)
International Classes:
C12Q1/68; C12N15/12; G16B25/10; G16B40/00; G16B40/30
Domestic Patent References:
WO2007093657A22007-08-23
Other References:
DONG, X. ET AL.: 'FOX01A is a Candidate for the 13q14 Tumor Supressor Gene Inhibiting Androgen Receptor Signaling in Prostate Cancer' CANCER RES(2006) vol. 61, 15 July 2006,
DALLAS, P.B. ET AL.: 'Aberrant over-expression of a forkhead family member, FOX01A, in a brain tumor cell line' BMC CANCER(2007) vol. 7, no. 67, 19 April 2007,
Attorney, Agent or Firm:
MCKEON, Tina Williams et al. (P.O. Box 10223300 RBC Plaz, Minneapolis Minnesota, US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method of predicting the recurrence, progression, and metastatic potential of a prostate cancer in a subject, the method comprising detecting in a sample from the subject one or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION, an increase or decrease in one or more of these biomarkers as compared to a standard indicating a recurrent, progressive, or metastatic prostate cancer.

2. The method of claim 1 , wherein the sample comprises prostate tumor tissue.

3. The method of claim 1, wherein the detecting step comprises detecting mRNA levels of the biomarker.

4. The method of claim 3, wherein the RNA detection comprises reverse- transcription polymerase chain reaction (RT-PCR) assay; quantitative real-time-PCR (qRT-PCR); Northern analysis; microarray analysis; and cDNA-mediated annealing, selection, extension, and ligation (DASL®) assay.

5. The method of claim 4, wherein the RNA detection comprises the cDNA- mediated annealing, selection, extension, and ligation (DASL®) assay.

6. The method of claim 1 , wherein multiple biomarkers are detected and wherein the detection comprises identifying an RNA expression pattern.

7. The method of claim 1, wherein the detected biomarkers comprise two or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

8. The method of claim 1, wherein the detected biomarkers comprise three or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

9. The method of claim 1, wherein the detected biomarkers comprise four or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

10. The method of claim 1, wherein the detected biomarkers comprise five or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

11. The method of claim 1 , wherein the detected biomarkers comprise six or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

12. The method of claim 1, wherein the detected biomarkers comprise seven or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

13. The method of claim 1, wherein the detected biomarkers comprise eight or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

14. The method of claim 1, wherein the detected biomarkers comprise nine or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

15. The method of claim 1 , wherein the detected biomarkers comprise ten or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

16. The method of claim 1, wherein the detected biomarkers comprise eleven or more biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

17. The method of claim 1, wherein the detected biomarkers comprise twelve or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

18. The method of claim 1 , wherein the detected biomarkers comprise thirteen or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

19. The method of any one of the claims 1-18, further comprising detecting one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

20. The method of any one of the claims 1-18, further comprising detecting two or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

21. The method of any one of the claims 1-18, further comprising detecting three or more biomarkers selected from the group consisting of miR-103, miR-339, miR- 183, miR-182, miR-136, and miR-221.

22. The method of any one of the claims 1-18, further comprising detecting four or more biomarkers selected from the group consisting of miR-103, miR-339, miR- 183, miR-182, miR-136, and miR-221.

23. The method of any one of the claims 1-18, further comprising detecting five or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

24. A method of predicting the recurrence, progression, and metastatic potential of a prostate cancer in a subject, the method comprising detecting in a sample from a subject one or more biomarkers selected from the group consisting of miR-103, miR- 339, miR-183, miR-182, miR-136, and miR-221, an increase or decrease in one or more of these biomarkers as compared to a standard indicating a recurrent, progressive, or metastatic prostate cancer.

25. The method of claim 24, wherein the detected biomarkers comprise two or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

26. The method of claim 24, wherein the detected biomarkers comprise three or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

27. The method of claim 24, wherein the detected biomarkers comprise four or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

28. The method of claim 24, wherein the detected biomarkers comprise five or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

29. A method of treating a subject with prostate cancer comprising modifying a treatment regimen of the subject based on the results of the method of any one of the claims 1-28.

30. The method of claim 29, wherein the treatment regimen is modified to be aggressive based on an increase in one or more biomarkers selected from the group consisting of CLNSlA, XPOl, LETMDl, RAD23B, TMPRSS2 ETV1 FUSION, ABCC3, SPC, CHESl, FRZB, HSPG2, miR-103, miR-339, miR-183, and miR-182 as compared to a standard, and a decrease in one or more biomarkers selected from the group consisting of FOXOlA, SOX9, PTGDS, EDNRA, miR-136, and miR-221 as compared to a standard.

31. A method of predicting recurrence potential of a disease, the method comprising:

(a) detecting gene expression profiles in subjects with the disease;

(b) detecting sets of clinical variables associated with the disease;

(c) parametrically modeling the gene expression profiles and non- parametrically modeling the sets of clinical variables; and

(d) selecting gene expression profiles and sets of clinical variables consistent with a selected recurrence potential, wherein the selection step comprises Lasso type estimation.

32. The method of claim 31 , wherein the Lasso type estimation comprises a partly linear accelerated failure time model.

33. The method of claim 31 , wherein the disease is cancer.

34. The method of claim 33, wherein the cancer is prostate cancer.

35. The method of claim 34, wherein the gene expression profile is limited to a subset of genes.

36. The method of claim 35, wherein the subset of genes comprises one or more genes selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRSS2 ETV1 FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

37. A computer system for predicting recurrence potential of a disease, the computer system comprising:

(a) a memory on which is stored a database comprising: i.) a plurality of gene expression profiles for the disease, wherein each gene expression profile comprises a plurality of values, each value representing the expression level of a gene; ii.) a plurality of sets of clinical variables associated with the disease; and iii.) a descriptor associated with recurrence potential of the disease, wherein the descriptor is based on a combination of the gene expression profiles and the sets of clinical variables; and

(b) a processor having computer-executable code for effecting the following steps: i.) parametrically modeling the gene expression profiles and non- parametrically modeling the sets of clinical variables; ii.) selecting gene expression profiles and sets of clinical variables consistent with a reference recurrence potential; and iii.) outputting the descriptor stating the recurrence potential for the disease based on the combination of the gene expression profile and the set of clinical variables.

38. The computer system of claim 37, wherein the disease is a cancer.

39. The computer system of claim 38, wherein the cancer is prostate cancer.

40. The computer system of claim 39, wherein the gene expression profile is limited to a subset of genes.

41. The method of claim 40, wherein the subset of genes comprises one or more genes selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRSS2 ETV1 FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

42. A kit comprising (i) primers to detect the expression of biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION, and (ii) primers to detect the expression of biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

43. An array consisting of probes to one or more of the biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRS S2 ETVl FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

Description:
PROSTATE CANCER BIOMARKERS TO PREDICT RECURRENCE AND METASTATIC POTENTIAL

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No. 61/114,658, filed on November 14, 2008.

STATEMENT REGARDINGFEDERALLYFUNDED RESEARCH

The invention was made with government support under Grant Nos. RO ICAl 06826 and K22CA96560 from the National Institutes of Health. The government has certain rights in this invention.

BACKGROUND

Prostate cancer is the most commonly diagnosed noncutaneous neoplasm and second most common cause of cancer-related mortality in Western men. One of the important challenges in current prostate cancer research is to develop effective methods to determine whether a patient is likely to progress to the aggressive, metastatic disease in order to aid clinicians in deciding the appropriate course of treatment. The current standard for pathological evaluation of the status of prostate cancer patients is the Gleason score. The Gleason score is calculated based on summing the grades of glandular architecture of the two most prevalent histological components of the tumor. However, it is currently very difficult to predict the outcome of patients based solely on the Gleason score.

In medical studies, it is of substantial interest to conduct feature selection using high dimensional biomarker data such as gene expression data, when the outcome of interest may be censored, e.g., censored time to the development or recurrence of a disease. Subsequently, these selected features can be used to predict the risk of developing a disease. In the presence of clinical variables that have been established as the risk factors of a disease, it is preferred to use a feature selection procedure that also adjusts for these clinical variables.

SUMMARY

Provided are methods of predicting the recurrence, progression, and/or metastatic potential of a prostate cancer in a subject. Specifically, the methods comprise selecting a subject at risk of recurrence, progression or metastasis of prostate cancer, and detecting in a sample from the subject one or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRS S2 ET Vl FUSION to create a biomarker profile. An increase or decrease in one or more of the biomarkers as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize. The sample can, for example, comprise prostate tumor tissue. The method further comprises detecting one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR- 182, miR-136, and miR-221.

Also provided are methods of predicting the recurrence, progression, and/or metastatic potential of a prostate cancer in a subject, the methods comprising selecting a subject at risk of recurrence, progression, or metastasis of prostate cancer, and detecting in a sample from a subject one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221 to create a biomarker profile. An increase or decrease in one or more of the biomarkers as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize.

Also provided are methods of treating a subject with prostate cancer comprising modifying the treatment regimen of the subject based on the results of the method of predicting the recurrence, progression, and/or metastatic potential of a prostate cancer in a subject. The treatment regiment is modified to be aggressive based on an increase in one or more biomarkers selected from the group consisting of CLNSlA, XPOl, LETMDl, RAD23B, TMPRSS2 ETV1 FUSION, ABCC3, APC, CHESl, FRZB, HSPG2, miR-103, miR-339, miR-183, and miR-182 as compared to a standard, and a decrease in one or more biomarkers selected from a group consisting of FOXOlA, SOX9, PTGDS, EDNRA, miR-136, and miR-221 as compared to a standard.

Also provided are kits comprising primers to detect the expression of biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA,

PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION, and primers to detect the expression of biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR- 182, miR-136, and miR-221.

Also provided are methods of predicting recurrence potential of a disease. The methods comprise detecting gene expression profiles in subjects with the disease; detecting sets of clinical variables associated with the disease in subjects with the disease; parametrically modeling the gene expression profile and non-parametrically modeling the set of clinical variables; and selecting gene expression profiles and clinical variables consistent with a selected recurrence potential, wherein the selection step comprises Lasso type estimation. Further provided are computer systems for predicting recurrence potential of a disease. The computer systems comprise (a) a memory on which is stored a database comprising (i) a plurality of gene expression profiles for the disease, wherein each gene expression profile comprises a plurality of values, each value representing the expression level of a gene; (ii) a plurality of sets of clinical variables associated with the disease; and (iii) a descriptor associated with recurrence potential of the disease, wherein the descriptor is based on a combination of the gene expression profiles and the sets of clinical variables; and (b) a processor having computer-executable code for effecting the following steps (i) paramaterically modeling the gene expression profiles and non-parametrically modeling the sets of clinical variables; (ii) selecting gene expression profiles and sets of clinical variables consisting with a reference recurrence potential; and (iii) outputting the descriptor stating the recurrence potential for the disease based on the combination of the gene expression profile and the set of clinical variables.

DESCRIPTION OF DRAWINGS Figure 1 shows a Kaplan-Meier plot for the prediction of the recurrence, progression, and/or metastatic potential of prostate cancer based on the differential expression of the FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, and TMPRS S2 ET Vl FUSION protein coding genes in samples collected from 71 patients. (p=0.001). Figure 2 shows a Kaplan-Meier plot for the prediction of the recurrence, progression, and/or metastatic potential of prostate cancer based on the differential expression of the FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, and TMPRS S2 ET Vl FUSION protein coding genes in samples collected from 46 patients with a Gleason score of 7. (p=0.022).

Figure 3 shows a Kaplan-Meier plot for the prediction of the recurrence, progression, and/or metastatic potential of prostate cancer based on the differential expression of the miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221 genes in samples collected from 71 patients. (p=0.001).

Figure 4 shows a Kaplan-Meier plot for the prediction of the recurrence, progression, and/or metastatic potential of prostate cancer based on the differential expression of the miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221 genes in samples collected from 46 patients with a Gleason score of 7. (p=0.032).

Figure 5 shows a Kaplan-Meier plot for the prediction of recurrence, progression, and/or metastatic potential of prostate cancer based on the differential expression of ABCC3, APC, CHESl, EDNRA, FRZB, and HSPG2 protein coding genes in samples collected from 71 patients. (p=6.24 x 10 ~5 ) Figure 6 shows a graph demonstrating the estimated effect of PSA level on prediction of the recurrence, progression, and/or metastatic potential of prostate cancer.

Figure 7 shows a graph demonstrating the cumulative distribution function (CDF) of the /rvalues for evaluating the prediction performance. A: the Lasso-PL using all data; B: the Lasso-L using all data; C: the usual AFT model using all data; D: the partly linear AFT model using only the clinical variables; E: the linear AFT model using only the clinical variables.

DETAILED DESCRIPTION

Described herein are methods for predicting the recurrence, progression, and/or metastatic potential of a prostate cancer in a subject. The methods comprise selecting a subject at risk of recurrence, progression, or metastasis of prostate cancer, and detecting in a sample from a subject one or more biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRS S2 ET Vl FUSION to create a biomarker profile. An increase or decrease in one or more of the biomarkers as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize. Optionally, the sample comprises prostate tumor tissue.

Optionally, multiple biomarkers are detected. Detection can comprise identifying an RNA expression pattern. An increase in one or more of the biomarkers selected from the group consisting of CLNS IA, XPO 1 , LETMD 1 , RAD23B,

TMPRSS2 ETV1 FUSION, ABCC3, SPC, CHESl, FRZB, and HSPG2 as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize. A decrease in one or more of the biomarkers selected from the group consisting of FOXOlA, SOX9, EDNRA, and PTGDS as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize.

Optionally, the detected biomarkers comprise two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, ten or more, eleven or more, twelve or more, thirteen or more, or all fourteen biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and

TMPRSS2 ETV1 FUSION. For example, the detected biomarkers can comprise FOXOlA and SOX9. Alternatively, the detected biomarkers can comprise TGDS and ABCC3. For example, the detected biomarkers can comprise SOX9, CLNSlA, and TMPRSS2 ETV1 FUSION. Alternatively, the detected biomarkers can comprise PTGDS, XPOl, and CHESl. For example, the detected biomarkers can comprise

FOXOlA, PTGDS, XPOl, and RAD23B. Alternatively, the detected biomarkers can comprise CLNSlA, LETMDl, RAD23B, and EDNRA. For example, the selected biomarkers can comprise FOXOlA, S0X9, CLNSlA, PTGDS, and LETMDl. Alternatively, the selected biomarkers can comprise FOXOlA, CLNSlA, PTGDS, XPOl, and FRZB. For example, the selected biomarkers can comprise FOXOlA,

CLNSlA, PTGDS, XPOl, LETMDl, and RAD23B. For example, the selected biomarkers can comprise S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, and TMPRSS2 ETV1 FUSION. For example, the selected biomarkers can comprise FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION. For example, the selected biomarkers can comprise FOXOlA, S0X9, CLNSlA, XPOl, RAD23B, ABCC3, EDNRA, FRZB, and TMPRSS2 ETV1 FUSION. For example, the selected biomarkers can comprise FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, and APC. For example, the selected biomarkers can comprise FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, and CHESl. For example, the selected biomarkers can comprise FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, EDNRA, HSPG2, and TMPRSS2 ETV1 FUSION. For example, the selected biomarkers can comprise

FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDRNA, FRZB, and HSPG2. Optionally, the selected biomarkers comprise biomarkers selected from the group consisting of FOXOlA, S0X9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION.

Optionally, the method further comprises detecting one or more, two or more, three or more, four or more, five or more, or all six biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. For example, the selected biomarker can comprise miR-339. Alternatively, the selected biomarker can comprise miR-182. For example, the selected biomarkers can comprise miR-103 and miR-339. Alternatively, the selected biomarkers can comprise miR-136 and miR-221. For example, the selected biomarkers can comprise miR-103, miR-183, and miR-182. Alternatively, the selected biomarkers can comprise miR-339, miR-136, and miR-221. For example, the selected biomarker can comprise miR-103, miR-339, miR-136, and miR-221. Alternatively, the selected biomarkers can comprise miR-103, miR-183, miR-182, and miR-221. For example, the selected biomarkers can comprise miR-103, miR-339, miR-183, miR-182, and miR-136. Optionally, the method further comprises detecting biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. Also provided are methods of predicting the recurrence, progression, and/or metastatic potential of a prostate cancer in a subject. The methods comprise selecting a subject at risk of recurrence, progression, or metastasis of prostate cancer, and detecting in a sample from a subject one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221 to create a biomarker profile. An increase or decrease in one or more biomarkers as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize. Optionally, the sample can comprise prostate tumor tissue. Optionally, multiple biomarkers are detected. Detection can comprise identifying an RNA expression pattern. An increase in one or more biomarkers selected from the group consisting of miR-103, miR-339, miR-183, and miR-182 as compared to a standard indicates a prostate cancer that is prone to recur, progress, and/or metastasize. A decrease in one or more of the biomarkers selected from miR-

136 and miR-221 as compared to a control indicates a prostate cancer that is prone to recur, progress, and/or metastasize. Optionally, the detected biomarkers comprise two or more, three or more, four or more, five or more, or all six biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. For example, the detected biomarkers can comprise miR-136 and miR-221.

Alternatively, the detected biomarkers can comprise miR-103 and miR-182. For example, the detected biomarkers can comprise miR-103, miR-339, and miR-183. Alternatively, the detected biomarkers can comprise miR-339, miR-136, and miR-221. For example, the detected biomarkers can comprise miR-103, miR-339, miR-183, and miR-182. Alternatively, the detected biomarkers can comprise miR-183, miR-182, miR-136, and miR-221. For example, the detected biomarkers can comprise miR-339, miR-183, miR-182, miR-136, and miR-221. Optionally, the detected biomarkers comprise biomarkers selected from the group consisting of miR-103, miR-339, miR- 183, miR-182, miR-136, and miR-221. Optionally, the detecting step comprises detecting mRNA levels of the biomarker. The mRNA detection can, for example, comprise reverse-transcription polymerase chain reaction (RT-PCR), quantitative real-time PCR (qRT-PCR), Northern analysis, microarray analysis, and cDNA-mediated annealing, selection, extension, and ligation (DASL®) assay (Illumina, Inc.; San Diego, CA). Preferably, the RNA detection comprises the cDNA-mediated annealing, selection, extension, and ligation (DASL®) assay (Illumina, Inc.). Optionally, the detecting step comprises detecting miRNA levels of the biomarker. The miRNA detection can, for example, comprise miRNA chip analysis, Northern analysis, RNase protection assay, in situ hybridization, miRNA expression profiling panels designed for the DASL® assay (Illumina, Inc.), or a modified reverse transcription quantitative real-time polymerase chain reaction assay (qRT-PCR). Preferably the miRNA detection comprises the miRNA expression profiling panels designed for the DASL® assay (Illumina, Inc.). Optionally, the detecting step comprises detecting mRNA and miRNA levels of the biomarker. The analytical techniques used to determine mRNA and miRNA expression are known. See, e.g., Sambrook et al., Molecular Cloning: A Laboratory Manual, 3 rd Ed., Cold Spring Harbor Press, Cold Spring Harbor, NY (2001), Yin et al., Trends Biotechnol. 26:70-6 (2008); Wang and Cheng, Methods MoI. Biol. 414:183-90 (2008); Einat, Methods MoI. Biol. 342:139-57 (2006).

Comparing the mRNA or miRNA biomarker content with a biomarker standard includes comparing mRNA or miRNA content from the subject with the mRNA or miRNA content of a biomarker standard. Such comparisons can be comparisons of the presence, absence, relative abundance, or combination thereof of specific mRNA or miRNA molecules in the sample and the standard. Many of the analytical techniques discussed above can be used alone or in combination to provide information about the mRNA or miRNA content (including presence, absence, and/or relative abundance information) for comparison to a biomarker standard. For example, the DASL® assay can be used to establish a mRNA or miRNA profile for a sample from a subject and the abundances of specific identified molecules can be compared to the abundances of the same molecules in the biomarker standard.

Optionally, the detecting step comprises detecting the protein expression levels of the protein-coding gene biomarkers. The protein-coding gene biomarkers can comprise FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and TMPRSS2 ETV1 FUSION. The protein detection can, for example, comprise an assay selected from the group consisting of Western blot, enzyme-linked immunosorbent assay (ELISA), enzyme immunoassay (EIA), radioimmunoassay (RIA), immunohistochemistry, and protein array. The analytical techniques used to determine protein expression are known. See, e.g., Sambrook et al., Molecular Cloning: A Laboratory Manual, 3 rd Ed., Cold

Spring Harbor Press, Cold Spring Harbor, NY (2001).

Biomarker standards can be predetermined, determined concurrently, or determined after a sample is obtained from the subject. Biomarker standards for use with the methods described herein can, for example, include data from samples from subjects without prostate cancer, data from samples from subjects with prostate cancer that is not a progressive, recurrent, and/or metastatic prostate cancer, and data from samples from subjects with prostate cancer that is a progressive, recurrent, and/or metastatic prostate cancer. Comparisons can be made to multiple biomarker standards. The standards can be run in the same assay or can be known standards from a previous assay.

Also provided herein are methods of treating a subject with prostate cancer. The methods comprise modifying a treatment regimen of the subject based on the results of any of the methods of predicting the recurrence, progression, and metastatic potential of a prostate cancer in a subject. Optionally, the treatment regimen is modified to be aggressive based on an increase in one or more biomarkers selected from the group consisting of CLNSlA, XPOl, LETMDl, RAD23B, TMPRSS2 ETV1 FUSION, ABCC3, APC, CHESl, FRZB, HSPG2, miR-103, miR- 339, miR-183 and miR-182 as compared to a standard. Optionally, the treatment regimen is modified to be aggressive based on a decrease in one or more biomarkers selected from the group consisting of FOXOlA, SOX9, PTGDS, EDNRA, miR-136, and miR-221 as compared to a standard. Optionally, the treatment regimen is modified to be aggressive based on a combination of an increase in one or more biomarkers selected from the group consisting of CLNS IA, XPO 1 , LETMD 1 ,

RAD23B, TMPRSS2 ETV1 FUSION, ABCC3, APC, CHESl, FRZB, HSPG2, miR- 103, miR-339, miR-183, and miR-182 and a decrease in one or more biomarkers selected from the group consisting of FOXOlA, SOX9, PTGDS, EDNRA, miR-136, and miR-221 as compared to a standard. Also provided are methods of predicting recurrence potential of a disease. The methods comprise detecting gene expression profiles in subjects with the disease; detecting sets of clinical variables associated with the disease in the subjects with the disease; parametrically modeling the gene expression profiles and non-parametrically modeling the sets of clinical variables; and selecting gene expression profiles and sets of clinical variables consistent with a selected recurrence potential, wherein the selection step comprises Lasso type estimation. Optionally, the Lasso type estimation comprises a partly linear accelerated failure time model. Optionally, the disease is cancer. The cancer can, for example, be prostate cancer. Optionally, the gene expression profile is limited to a subset of genes. The subset of genes can, for example, comprise one or more genes selected from the group consisting of

FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRSS2 ETV1 FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. As used herein, parametrically modeling refers to modeling a family of distributions which can be described using a finite number of parameters. An example of a parametric model for outcome T and predictors Z is: T 1 = +S 1 , where θ are unknown parameters and d is a fϊnite- dimensional.

As used herein non-parametrically modeling refers to modeling where the interpretation does not depend on fitting any parametrized distributions. Non- parametric models are widely used for studying populations that take on a ranked order (e.g., the Gleason score associated with prostate cancer). An example of a non- parametric model for outcome T and predictors X is: T 1 = (J)[X 1 ) + S 1 , where φ is an unknown function to be estimated (i.e., an infinite dimensional parameter).

As used herein a lasso type estimation refers to the minimization of a convex loss function subject to Ll -norm constraints on a finite number of unknown parameters in the loss function. An example of a partly linear model is the following expression: T 1 = φ (X 1 ) + + E 1 , where T is a lifetime variable of interest, possibly right- censored and the unknown parameters are described above in the context of parametric models and non-parametric models.

Also provided are computer systems for predicting recurrence potential of a disease. The computer systems comprise a memory on which is stored a database comprising (i) a plurality of gene expression profiles for the disease, wherein each gene expression profile comprises a plurality of values, each value representing the expression level of a gene; (ii) a plurality of sets of clinical variables associated with the disease; and (iii) a descriptor associated with recurrence potential of the disease, wherein the descriptor is based on a combination of the gene expression profile and the set of clinical variables; and (b) a processor having computer-executable code for effecting the following steps: (i) parametrically modeling the gene expression profiles and non-parametrically modeling the sets of clinical variables; (ii) selecting gene expression profiles and sets of clinical variables consistent with a reference recurrence potential; and (iii) outputting the descriptor stating the recurrence potential for the disease based on the combination of the gene expression profile and the set of clinical variables. Optionally, the disease is a cancer. The cancer can, for example, be prostate cancer. Optionally, the gene expression profile is limited to a subset of genes. The subset of genes can, for example, comprise one or more genes selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRSS ETVl FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

Also provided are kits comprising primers to detect the expression of biomarkers selected from the group consisting of FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, and TMPRSS2 ETV1, and primers to detect the expression of biomarkers selected from the group consisting of miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221. Optionally, directions to use the primers provided in the kit to predict the progression and metastatic potential of prostate cancer in a subject, materials needed to obtain RNA in a sample from a subject, containers for the primers, or reaction vessels are included in the kit.

Also provided are arrays consisting of probes to one or more of the biomarkers selected from the group consisting of FOXO IA, SOX9, CLNS IA, PTGDS, XPO 1 ,

LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, TMPRSS2 ETV1 FUSION, miR-103, miR-339, miR-183, miR-182, miR-136, and miR-221.

The arrays provided herein can be a DNA microarray, an RNA microarray, a miRNA microarray, or an antibody array. Arrays are known in the art. See, e.g.,

Dufva, Methods MoI. Biol. 529:1-22 (2009); Plomin and Schalkwyk, Dev. Sci. 10:l):19-23 (2007); Kopf and Zharhary, Int. J. Biochem. Cell Biol. 39(7-8):1305-17 (2007); Haab, Curr. Opin. Biotechnol. 17(4):415-21 (2006); Thomson et al, Nat. Methods 1 :47-53 (2004). As used herein, subject can be a vertebrate, more specifically a mammal (e.g., a human, horse, cat, dog, cow, pig, sheep, goat mouse, rabbit, rat, and guinea pig), birds, reptiles, amphibians, fish, and any other animal. The term does not denote a particular age. Thus, adult and newborn subjects are intended to be covered. As used herein, patient or subject may be used interchangeably and can refer to a subject afflicted with a disease or disorder (e.g., prostate cancer). The term patient or subject includes human and veterinary subjects.

As used herein a subject at risk for recurrence, progression, or metastasis of prostate cancer refers to a subject who currently has prostate cancer, a subject who previously has had prostate cancer, or a subject at risk of developing prostate cancer. A subject at risk of developing prostate cancer can be genetically predisposed to prostate cancer, e.g., a family history or have a mutation in a gene that causes prostate cancer. Alternatively a subject at risk of developing prostate cancer can show early signs or symptoms of prostate cancer, such as hyperplasia. A subject currently with prostate cancer has one or more of the symptoms of the disease and may have been diagnosed with prostate cancer.

As used herein, the terms treatment, treat, or treating refers to a method of reducing the effects of a disease or condition (e.g., prostate cancer) or symptom of the disease or condition. Thus, in the disclosed method, treatment can refer to a 10%,

20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% reduction in the severity of an established disease or condition or symptom of the disease or condition. For example, a method of treating a disease is considered to be a treatment if there is a 10% reduction in one or more symptoms of the disease in a subject as compared to a control. Thus, the reduction can be a 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%,

90%, 100% or any percent reduction between 10 and 100% as compared to native or control levels. It is understood that treatment does not necessarily refer to a cure or complete ablation of the disease, condition, or symptoms of the disease or condition. Disclosed are materials, compositions, and components that can be used for, can be used in conjunction with, can be used in preparation for, or are products of the disclosed methods and compositions. These and other materials are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these materials are disclosed that while specific reference of each various individual and collective combinations and permutations of these compounds may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a method is disclosed and discussed and a number of modifications that can be made to a number of molecules including the method are discussed, each and every combination and permutation of the method, and the modifications that are possible are specifically contemplated unless specifically indicated to the contrary. Likewise, any subset or combination of these is also specifically contemplated and disclosed. This concept applies to all aspects of this disclosure including, but not limited to, steps in methods using the disclosed compositions. Thus, if there are a variety of additional steps that can be performed, it is understood that each of these additional steps can be performed with any specific method steps or combination of method steps of the disclosed methods, and that each such combination or subset of combinations is specifically contemplated and should be considered disclosed.

Publications cited herein and the material for which they are cited are hereby specifically incorporated by reference in their entireties.

Examples General Methods RNA Isolation

RNA is isolated from formalin-fixed paraffin-embedded (FFPE) tissue according to the methods described in Abramovitz et al., Biotechniques 44(3) :417-23

(2008). In brief, three 5 μm sections per block were cut and placed into a 1.5mL sterile micro fuge tube. The tissue section was deparaffmized with 100% xylene for 3 minutes at 5O 0 C. The tissue section was centrifuged, washed twice with ethanol, and allowed to air dry. The tissue section was digested with Proteinase K for 24 hours at 5O 0 C. RNA was isolated using an Ambion Recover All Kit (Ambion; Austin, TX).

cDNA-mediated annealing, selection, extension, and ligation assay (DASL® assay)

Upon the completion of RNA isolation, the isolated RNA is used in the DASL® assay. The DASL® assay is performed according to the protocols supplied by the manufacturer (Illumina, Inc.; San Diego, CA). The primer sequences for the fourteen biomarker genes are shown in Table 1. The probe sequences for the fourteen biomarker genes are shown in Table 2.

Table 1 : DASL® assay Primer Sequences for Fourteen Biomarker Genes

Gene Primer Sequences

FOXOlA 5'-ACTTCGTCAGTAACGGACGTCCTAGGAGAAGAGCTGCATCCA-S ' (SEQ ID NO:1) 5'-GAGTCGAGGTCATATCGTGTCCTAGGAGAAGAGCTGCATCCA-S ' (SEQ ID NO:2)

SOX9 5'-ACTTCGTCAGTAACGGACGCTCCTACCCGCCCATCACCC-S ' (SEQ ID NO:3) 5'-GAGTCGAGGTCATATCGTGCTCCTACCCGCCCATCACCC-S ' (SEQ ID NO:4)

CLNSlA 5'-ACTTCGTCAGTAACGGACGGAGAGAACTTGGTGCCTCTTCC-S ' (SEQ ID NO:5) 5'-GAGTCGAGGTCATATCGTGGAGAGAACTTGGTGCCTCTTCC-S ' (SEQ ID NO:6)

PTGDS 5'-ACTTCGTCAGTAACGGACGCGAACCCAGACCCCCAGG-S ' (SEQ ID NO:7) 5'-GAGTCGAGGTCATATCGTGCGAACCCAGACCCCCAGG-S ' (SEQ ID NO:8)

XPOl 5'-ACTTCGTCAGTAACGGACGCCAGCAAAGAATGGCTCAAGAA-S ' (SEQ ID NO:9) 5'-GAGTCGAGGTCATATCGTGCCAGCAAAGAATGGCTCAAGAA-S ' (SEQ ID NO: 10)

LETMDl 5'-ACTTCGTCAGTAACGGACGTCACCTTTCTCCAAAGGCAGATG-S ' (SEQ ID NO: 11) 5'-GAGTCGAGGTCATATCGTGTCACCTTTCTCCAAAGGCAGATG-S ' (SEQ ID NO: 12)

RAD23B 5'-ACTTCGTCAGTAACGGACAATCCTTCCTTGCTTCCAGCG-S ' (SEQ ID NO: 13) 5'-GAGTCGAGGTCATATCGTAATCCTTCCTTGCTTCCAGCG-S ' (SEQ ID NO: 14)

TMPRSS 5'-ACTTCGTCAGTAACGGACAGCGCGGCACTCAGGTACCT-S ' (SEQ ID NO: 15) 2_ETV1 5'-ACTTCGTCAGTAACGGACAGCGCGGCACTCAGGTACCT-S ' (SEQ ID NO: 16) FUSION ABCC3 5'-ACTTCGTCAGTAACGGACATGTTCCTGTGCTCCATGATGC-S' (SEQ ID NO: 17) 5'-GAGTCGAGGTCATATCGTATGTTCCTGTGCTCCATGATGC-S ' (SEQ ID NO: 18) 5 ' -GTCGCTGATCTTACAACACTATTACATGCCTATTGACGTGAGGCGGTCTGCCTATA GTGAGTC-3 ' (SEQ ID NO: 19)

APC 5'-ACTTCGTCAGTAACGGACGTCCCTGGAGTAAAACTGCGGTC-S ' (SEQ ID NO:20) 5'-GAGTCGAGGTCATATCGTGTCCCTGGAGTAAAACTGCGGTC-S ' (SEQ ID NO:21) 5'-AAAATGTCCCTCCGTTCTTATCTAGATCGCAAAAGTGTCTCGGAAGTCTGCCTATA GTGAGTC-3' (SEQ ID NO:22)

CHESl 5'-ACTTCGTCAGTAACGGACGGGTTTCTCCAAGGCCCTTCA-S ' (SEQ ID NO:23) 5'-GAGTCGAGGTCATATCGTGGGTTTCTCCAAGGCCCTTCA-S ' (SEQ ID NO:24) 5 ' -GAAGACGATGACCTCGACTTCATACGCGAATTGATAGAAGCTCGGTCTGCCTATAG TGAGTC-3' (SEQ ID NO:25)

EDNRA 5'-ACTTCGTCAGTAACGGACTGCAACTCTGCTCAGGATCATTT-S ' (SEQ ID NO:26) 5'-GAGTCGAGGTCATATCGTTGCAACTCTGCTCAGGATCATTT-S ' (SEQ ID NO:27) 5 ' -CCAGAACAAATGTATGAGGAATTCACTCAAGGCCGTTAGCTGTGGTCTGCCTATA GTGAGTC-3' (SEQ ID NO:28)

FRZB 5'-ACTTCGTCAGTAACGGACGGAAGCTTCGTCATCTTGGACTCAG-S ' (SEQ ID NO:29) 5'-GAGTCGAGGTCATATCGTGGAAGCTTCGTCATCTTGGACTCAG-S ' (SEQ ID NO:30) 5 ' -AAAAGTGATTCTAGCAATAGTGATTTTACTGCGCTCCTAATTGGCACCGTCTGCCT ATAGTGAGTC-3 ' (SEQ ID NO:31)

HSPG2 5'-ACTTCGTCAGTAACGGACCCAAATGCGCTGGACACATT-S ' (SEQ ID NO:32) 5'-GAGTCGAGGTCATATCGTCCAAATGCGCTGGACACATT-S ' (SEQ ID NO:33) 5 ' -GTACCTTTCTGATGATGAGGACGGAACAGCTTACGACTTTGCGGGTCTGCCTATAG TGAGTC-3' (SEQ ID NO:34)

Table 2: Probe Sequences for Detection of Fourteen Biomarker Genes in DASL® assay

To compute the predictive fourteen-gene score, DASL® signal levels are quantile normalized across the array, and then Z-score normalized across the samples. (Z-score = (signal - average(signal))/stdev(signal)). Once the predictive scores are computed, samples are separated based on whether they are greater or less than the median score. If a sample has a score greater than the median, the subject is predicted to not have recurrence. If the score is less than the median, the subject is predicted to have recurrence. For this predictive score, the higher the score, the less likely the subject is to have recurrence.

The predictive fourteen-gene score can be calculated using the following formula: FOURTEEN GENE SCORE = (C FO XOIA X FOXOlA Zs core) + (C SO χ9 x SOX9 Zs core) +

(CcLNSIA X CLNS lAzscore) + (C PTG DS X PTGDSzscore) + (Cχ PO l X XPOl Z score) +

(CRAD23B X RAD23B Zscore ) + (C T MPRSS 2 _ETVI FUSION X TMPRSS2_ETV1 FUSION Zs core) + (CABCC3 X ABCC3zscore) + (CAPC X APCzscore) ~ (CcHESl x CHES lzscore) + (CEDNRA X EDNRAzscore) + (C F RZB X FRZB Zs core) + (C HS PG2 X HSPG2 Z score).

The coefficients for the predictive fourteen-gene score are as follows: CFOXOIA = 0.687, C SO χ9 = 0.351, CCLNSIA = 0.112, CPTGDS = 0.058, C XPO i = -0.208, CLETMDI = -0.019, CRAD23B = -0.065, CTMPRSS2 ETVI FUSION = -0.168, CABCC3 = -0.202, CAPC = -0.128, CFRZB = 0.310, C HS PG2 = -0.048, C EDN RA = 0.539, and CCHESI = -0.143.

The coefficients for the predictive seven-gene score are as follows: CFOXOIA = 0.625, Csox9 = 0.253, CCLNSIA = 0.0, CPTGDS = 0.056, Cχpoi = -0.092, CLETMDI = -0.140, CRAD23B = -0.045, and CTMPRSS2 ETVI FUSION = -0.137.

miRNA expression profiling

The isolated RNA is additionally used in the Illumina Human Version 2 MicroRNA Expression Profiling kit (Illumina, Inc.; San Diego, CA) in conjunction with the DASL® assay. The miRNA expression profiling is performed according to the manufacturer's protocol. The mature miRNA sequence for the six miRNA biomarkers are shown in Table 3. The probe sequences for the six miRNA biomarkers are shown in Table 4.

Table 3: Mature miRNA Sequences for Six miRNA Biomarkers

Gene Mature miRNA sequence

Hsa-miR- 103 5'-AGCAGCATTGTACAGGGCTATGA-S ' (SEQ ID NO:49) Hsa-miR-339 5'-TCCCTGTCCTCCAGGAGCTCA-S ' (SEQ ID NO: 50) Hsa-miR- 183 5'-TATGGCACTGGTAGAATTCACTG-S ' (SEQ ID NO:51) Hsa-miR- 182 5'-TTTGGCAATGGTAGAACTCACA-S' (SEQ ID NO: 52) Hsa-miR- 136 5'-AGCTACATTGTCTGCTGGGTTTC-S' (SEQ ID NO:53) Hsa-miR-221 5'-ACTCCATTTGTTTTGATGATGGA-S' (SEQ ID NO:54)

Table 4: Probe Sequences for Detection of Six miRNA Biomarker Genes in DASL® assay

Gene Probe Sequence

Hsa-miR- 103 5'-ACTTCGTCAGTAACGGACTCCAGTAGCGACTAGCCCGTCAGCAGCAT TGTACAGGGCTA-3' (SEQ ID NO:55) Hsa-miR-339 5'-ACTTCGTCAGTAACGGACTATACCGGCCTAAGCACTCGCACCCTGT CCTCCAGGAGCT-3' (SEQ ID NO:56) Hsa-miR- 183 5'-ACTTCGTCAGTAACGGACAATGTTGACCCGGATCTCGTCCATGGCAC TGGTAGAATTCA-3' (SEQ IDNO:57) Hsa-miR- 182 5'-ACTTCGTCAGTAACGGACACTAGCCCTCGCATAGCTTGCGTTTGGCA ATGGTAGAACTC-3' (SEQIDNO:58) Hsa-miR- 136 5'-ACTTCGTCAGTAACGGACGCGCAATTCCCTCGATCTTACGCTACAT TGTCTGCTGGGT-3' (SEQIDNO:59) Hsa-miR-221 5'-ACTTCGTCAGTAACGGACGTAGGTCCCGGACGTAATCACCACTCCATT TGTTTTGATGAT-3' (SEQ ID NO:60) To compute a predictive miRNA score, DASL signal levels are quantile normalized across the array, and then Z-score normalized across the samples. (Z- score = (signal - average(signal))/stdev(signal)). The more positive the predictive score, the more likely the subject will recur. The more negative the score, the less likely the patient will recur.

The predictive six miRNA gene score can be calculated using the following formula:

SIX miRNA SCORE = miR-103 Zs core + miR-339 Zs core + miR-183 Zs core + miR-182 Zs core - miR136 Zscore - miR221 Zscore .

Example 1: Identification of biomarker predictors for the progression and metastatic potential of prostate cancer.

A highly predictive set of 520 genes was determined through analysis of multiple publicly available gene expression datasets (Dhanasekaran et al., Nature 412:822-6 (2001); Lapointe et al., Proc. Natl. Acad. Sci. USA 101 :811-6 (2004); LaTulippe et al., Cancer Res. 62:4499-506 (2002); Varambally et al., Cancer Cell

8:393-406 (2005)), datasets from gene expression profiling analysis of 58 prostate cancer patient samples (Liu et al., Cancer Res. 66:4011-9 (2006)), and genes involved in prostate cancer progression based on state of the art understanding of the disease (Tomlins et al., Science 310:644-8 (2005); Varambally et al., Cancer Cell 8:393-406 (2005)). The predictive set of 520 genes were optimized for performance in the cDNA-mediated annealing, selection, extension, and ligation (DASL ®) assay (Illumina, Inc.; San Diego, CA). The DASL® assay is based upon multiplexed reverse transcription-polymerase chain reaction (RT-PCR) applied in a microarray format and enables the quantitation of expression of up to 1536 probes using RNA isolated from archived formalin-fixed paraffin embedded (FFPE) tumor tissue samples in a high throughput format (Bibokova et al., Am. J. Pathol. 165:1799-807 (2004); Fan et al., Genome Res. 14:878-85 (2004)). RNA was isolated from 71 patient samples with definitive clinical outcomes and was analyzed using the DASL® assay. Based on the data from 71 patients, a subset of fourteen protein encoding genes were found to be capable of separating Gleason 7 subjects with and without recurrence, and thus were found to be good predictors of recurrent, progressive, or metastatic prostate cancers. The fourteen protein encoding genes included: FOXOlA, SOX9, CLNSlA, PTGDS, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, EDNRA, FRZB, HSPG2, and the TMPRSS2 ETV1 FUSION. The expression of CLNSlA, XPOl, LETMDl, RAD23B, ABCC3, APC, CHESl, FRZB, HSPG2, and TMPRS S2 ET Vl FUSION was increased in recurrent, progressive, or metastatic prostate cancers, while the expression of FOXOl A, SOX9, EDNRA, and PTGDS was decreased in recurrent, progressive, or metastatic prostate cancers. Additionally, based on data obtained from the 71 patients using the MicroRNA Expression Profiling Panels (Illumin, Inc.; San Diego, CA) designed for the DASL® assay, it was found that six miRNA genes were found to be good predictors of recurrent, progressive, or metastatic prostate cancers. The six miRNA genes included: miR-103, miR-339, miR-

183, miR-182, miR-136, and miR-221. The expression of miR-103, miR-339, miR- 183, and miR-182 was increased in recurrent, progressive, or metastatic prostate cancers, while the expression of miR-136 and miR-221 was decreased in recurrent, progressive, or metastatic prostate cancers.

Example 2: Determination of novel partly linear accelerated failure time (AFT) model.

Feature selection in AFT

The accelerated failure time (AFT) model is an important tool for the analysis of censored outcome data (Cox and Oakes, Analysis of Survival Data, Chapman and Hall, London, England (1984); Kalbfleisch and Prentice, The Statistical Analysis of

Failure Tie Data, John Wiley, New York, NY (2002)). Classic AFT models assume that the covariate effects on the logarithm of the time-to-event are linear, in which case standard rank-based techniques for estimation and inference could be used (Jin et al., Biometrika 90:341-53 (2003)), and its extension for lasso-type regularized variable selection could be considered (Cai et al., Biometrics, In press, 2009).

Regarding these variable selection procedures, there are two unsatisfying products. First, it is assumed that the clinical effects are linear. Second, an unsupervised implementation of the regularized variable selection procedure can inadvertently remove clinical variables that are known to be scientifically relevant and can be measured easily in practice. While the second limitation can be addressed by tweaking the underlying estimation scheme, the first limitation remains. Alternatively, many authors ignore important clinical covariate effects when selecting important gene features; here, those important clinical covariate effects were considered. Both concerns in the context of AFT models were addressed.

Partly linear models

Linear regression functions are insufficient in many applications, and it is more desirable to allow for more general covariate effects. The nonparametric modeling of covariate effects is less restrictive than the parametric approach, and thus, is less likely to distort the underlying relationship between the failure time and the covariate. However, new challenges arise when including nolinear covariate effects in regression models. For instance, it is known that nonparametric repression methods may encounter the so-called curse of dimersionality problem, when the dimension of covariates is high, e.g., a large number of gene biomarkers are used. The partly linear model of Engle et al. provides a useful compromise to model the effect of some covariates nonparametricaly and the rest parametrically (Engle et al., J. Am. Stat. Assoc. 81 :310-20 (1986)). The partly linear model is a ubiquitous concept in the statistics literature and an important tool in modern semiparametric regression

(Hardle et al., Partially Linear Models, Springer, New York, NY (2000); Ruppert et al., Semiparametric Regression, Cambridge University Press, New York, NY (2002)). Specifically, for the i-th subject, let T 1 be a univariate endpoint of interest for the z-th subject, and Z 1 (d x 1) and X 1 (q x 1) denotes features of interest (e.g., gene expression levels) and clinical variables, respectively. Then one partly linear model is:

T 1 = φ (X/ 1} ,..., X, (q) ) + υZ/ 1} + - + υZ/ d) + ε ls (1)

where υ = (D 1 , ... , U d ) T is a parameter vector of interest, φ is an unspecified function, and the errors are i.i.d. and follow an arbitrary distribution function F 8 . Special cases of this model have been used in varied applications across many disciplines including econometrics, engineering, biostatistics, and epidemiology (Hardle et al., Partially

Linear Models, Springer, New York, NY (2000)). Estimation and inference for when the outcomes Ti may be right-censored were considered herein, in which case the observed data is [(T 1 , δ z , Z 1 , X 1 )Y 1 = I , where T 1 = min (T 1 , C 1 ), di = 1(T 1 < C 1 ), C 1 is a random censoring event, Z 1 = (Z, (1) , ... , Z/ d) ) τ , and X 1 = (X/ 1} , ... , X/ q) ) τ . In the context of survival analysis, T 1 is the log-transformed survival time, and Model (1) is referred to as partly linear AFT models. In the absence of censoring, the nonparametric function φ in Model (1) can be estimated using kernel methods, especially when q ≡ 1 (Hardle et al., Partially Linear Models, Springer, New York, NY (2000) and references therein) and smmothing spline methods (Engle et al., J. Am. Stat. Assoc. 81 :310-20 (1986); Heckman, J. Royal Stat. Soc. Series B 48:244-8 (1986); Green and Silverman, Nonparametric Regression and Generalized Linear Models: a Roughness Penalty Approach, Chapman and Hall, New York, NY (1994)). To extend the partly linear models in the context of AFT models, one approach is to extend the basic weighting scheme of Koul et al. (Koul et al., Annals of Stat. 9:1276-88 (2006)). Here censoring is treated like other missing data problems (Tsiatis, Ann. Statist. 18:354-72 (1990)) and inversely weights the uncensored observations by the probability of being uncensored, i.e., so-called inverse-probability weighted (IPW) estimators. A close cousin of to the IPW methodology is censoring unbiased transformations (Fan and Gijbels, Local Polynomial Modeling and Its Applications, Chapter 5 and references therein, Chapman and Hall, New York, NY (1996)), which effectively aims to replace censored outcome with a suitable surrogate. After the transformation is made, complete-data estimation procedures can be applied. Both IPW kernel-type estimators and censoring unbiased transformation in the partly linear model have been studied for AFT models (Liang and Zhou, Comm. Statist. Theory Method 27:2895- 907 (1998); Wang and Li, J. Multivariate Anal. 83:469-86 (2002)).

A general penalized loss function for partly linear AFT models is considered herein: min L n (φ, υ) + λJ(φ), (2) β, ΦεΦ where L n is the loss function for observed data and J(φ) imposes some type of penalty on the complexity of φ. The approach will be to replace L n with the Gehan (Genhan, Biometirka 52:203-23 (1965)) loss function (Jin et al., Biometrika 90:341-53 (2003)) and model f using penalized regression splines. Variable selection and building predictive scores as well as estimation for the regression parameter υ, is described herein. To minimize the penalized loss function (Model (2)), the insight into the optimization procedure is due, in part, to Koenker et al., which noted that the optimization problem in quantile smoothing splines can be solved by Zi-type linear programming techniques. Subsequently, an interior point algorithm was proposed for the problem. (Koenker et al, Biometrika 81 :673-80 (1994)). Li et al, build on this idea to propose an entirely different path-finding algorithm to replace the interior point algorithm of Koenker et al. (Li et al., J. Am. Stat. Assoc. 102:255-68 (2007); Koenker et al., Biometrika 81 :673-80 (1994)). In a related paper, Li and Zhu adopt a similar approach for lasso-type variable selection in quantile regression (Li and Zhou,

J. Comp. Graph. Stat. 17:163-185 (2008); Tibshirani, J. Roy. Stat. Soc, Ser. B 58:267- 88 (1996)). Following the work of Koenker et al. (1994), it can be readily shown when J(φ) is taken as a L\ norm as discussed in penalized regression spline literature, there is a close connection to between our loss function (2) and the lasso-type problem; specifically, the optimization problem of (2) is essentially an L \ loss plus L \ penalty problem. This statement is true regardless of the algorithm, and this relation was exploited in the approach to the optimization problem. Once the basic spline framework is adopted, it was shown that the estimator can be generalized through additive models for q > 1 and variable selection in the parametric component. The additive structure of the nonparametric components is adopted to further alleviate the issue of curse of dimensionality, when q > 1 (Hastie and Tibshirani, Generalized Additive Models, Chapman and Hall, New York, NY (1990)).

Regression splines in partly linear AFT model

A simplified case for the partly linear model was first considered, where X 1 is assumed to be univariate, i.e., q=\ and X 1 ≡ X 1 , and then Model (1) reduces to:

T 1 = φ (X 1 ) + υiZ/ 1} + - + υ d z/ d) + e » (3)

The model in (3) agrees with the model considered by Chen et al. (Chen et al., Statistica Sinica 15:767079 (2005)). Let B(x) = (B 1 (Jc),...,B M (x)} T , M < n, be a set of linearly independent piecewise polynomial basis functions. The piecewise polynomial model asserts that φ(x) = B(x) τ β, for some β, β e i? M and E(E 1 ) = α.

Popular bases are 5-splines, natural splines, and truncated power series basis (Ruppert et al., Semiparametric Regression, Cambridge Univesity Press, New York, NY (2002)). The truncated power series basis of degree p without the intercept term was chosen, that is, B(x) = { x,..., z?, (x -κ.{f + ,..., (x -κ r ) p + } τ , where (K 1 , ...,κ r ) denotes the knots, r > 1 is the number of knots and (u) + = ul (u > 0), and hence M = p + r.

Throughout, equally spaced percentiles were used as knots and set/? = 3, i.e., the cubic splines, unless otherwise noted. Then define θ ≡ (S 5 5) - minø.* Cn(β t #). where

£ B o,#} = «- 2 y v.<y«- -

(4)

with βi= Tz-/? τ B(JQ-υ τ Zj. Because the model (3) has been "linearized," existing rank- based estimation techniques can be applied for the usual AFT models. In particular, it is noted that the minimizer L n φ, υ) is also the minimizer of

for a large constant ζ, where O lk = {B(X / ) T , Z T / } T - (B(X*) 1 , Z\} τ . (Jin et al, Biometrika 90:341-53 (2003)). Evidently, the minimizer of the new loss function may be viewed as the solution to the least absolute deviation (lad) regression of a pseudo response vector V = (V 1 ,...,Vs) T (S x 1) on a pseudo design matrix W = (W 1 ,..., Ws) τ (S x (M + d)). It can readily be shown tha the pseudo response vector V is of the form (5 / (TrT 7 ),..., ζ} τ and the pseudo design matrix W is of the form, where 5 / (Tj-T 7 ) and δ;D T y go through all i andy with 5; = 1. Without loss of generality, write

E?S<^ — - BiSϊl £ >

(5)

The fact that Θ RS can be written as the lad regression estimate facilitates the estimation techniques for the model of interest presented below. The estimator Θ RS is a regression spline estimator where, for fixed knots, its root-/? consistency and asymptotic normality can be established by extending previous work for linear AFT models (Tsiatis, Ann. Stat. 18:354-72 (1990); Jin et al., Biometrika 90:341-53 (2003)).

Penalized Regression Splines in Partly Linear AFT Models

When regression splines are used to model nonparametric covariates effects, it is crucial to choose the optimal number and location of knots (K 1 ,..., κ r ). It is well known that too many knots may lead to overfϊtting whereas too few knots may not be sufficient to capture the non-linear effect. To conduct knots selection, various sophisticated procedures have been developed and can be used to improve the performance of the regression spline models. Among these procedures, the so-called penalized regression spline method (Eilers and Marz, Stat. Science 11 :89-121 (1996);

Ruppert and Carroll, Unpublished Technical Report (1997)) is particularly attractive, which is simpler to implement and often enjoys better performance. For penalized regression spline models, a large number of knots can be included and overfϊtting is avoided through regularization, e.g., a L \ type of penalty on the regression coefficients. This approach was adopted and the penalized regression spline estimator for the partly linear AFT model was considered

where M = p + r and γ is a regulation parameter on the jumps in the /?th derivative and is used achieve the goal of knot selection. Using the lad loss function in (5) and a standard data augmentation technique for regularized lad regression, the penalized estimate may be found easily. Namely, define V* = (V τ , 0 T r ) T , W* = [W τ , (0 r x p , D r , Orxd) T ] T , and D r = γ/ r , where 0 r is a r- vector of zeros, 0 r x p (0 r x d) is a r xp (r x d) matrix of zeroes and I 1 an r-dimensional identity matrix. Then, $> RS(Y) is found through the lad regression of V* on W*. A penalized regression spline with L \ penalty corresponds to a Bayesian model with double exponential or Laplace priors and is known to be able to accommodate large jumps (Ruppert and Carroll, Unpublished Technical Report (1997)).

Variable Selection in Partly Linear AFT Models

Finally, variable selection was considered for Z = (Z T i,..., Z τ n ) τ (gene expression data) in partly linear AFT model (1) by extending the penalized regression spline estimator $> RS(Y) . Let λ = (λ ls ..., λd) be covariate-dependent regularization parameters and consider the minimizer to the convex loss function Ai

'P 1 RRSSj i Λ v λ) - mm I C,, * # . # } -4- -> Y " *«! λ∑ ϊL

(V)

The same data augmentation scheme used for regression splines and penalized regression splines applies to the lasso-type estimator (7) as well. Define the pseudo response vector V τ = (V , 0 r+ d) and the pseudo design matrix

For fixed γ and λ, the estimate is computed as the lad regression estimate of V^ on W^. To select γ and λ, two approaches were proposed, namely the cross validation (CV) and generalized cross validation (GCV). The K-fold CV approach chooses the values of γ and λ that maximize the Gehan loss function (4). The GCV approach chooses the values for γ and λ that maximize the criteria, Ln (β, υ) / (1 -d Yr Jnf, where n is the number of observations and d y> χ is the number of nonzero estimated coefficients for the basis functions B(x) and Z linear predictors, that is, the number of nonzero estimates in (β, υ). Note that d y> χ depends on γ and λ.

Extension to Additive Partly Linear AFT Models In the case where q is greater than 1 in the partly linear model (1), it is well-known that the estimation is difficult due to the issue of curse of dimensionality, even when q is moderate and it is in the absence of censoring. For the partly linear AFT model presented herein, an additive structure was proposed to be used for φ to further alleviate the problem, namely, an additive partly linear AFT model,

J "' J (8)

where φ/s (j = \,..., q) are unknown functions and are estimated through regression splines. Penalized regression splines can be used for additive partly linear model to conduct knot selection for each nonparametric effect ^(X^) (j = \,..., q). The variable selection for Z can also be easily extended to this additive partly linear AFT model. When q is large and it is also of interest to conduct feature selection among the q nonparametrically modeled effects, one can modify the regularization term for β in the loss function (6) and (7); specifically, one can regularize all β, i.e., γ∑ M m =i |βj, as opposed to only regularize the terms that correspond to the set of jumps in the /?th derivative, |β m |. Similarly, the data augmentation scheme to obtain the parameter estimates for these models can be modified.

Example 3: Simulation Studies.

Multiple simulation studies were conducted to evaluate the operating characteristics of the methods in comparison with several other methods. All calculations were conducted in R and the models described herein were fit using the algorithms proposed above, which utilize the quantreg package in R.

Estimation

A case of single covariate Z 1 and single covariateX, in (1) was first considered and the estimates of the regression coefficient υ and its sampling variance were focused on. Note in this setup, no feature selection was involved. To facilitate comparisons, the simulation study details were adapted from those given by Chen et al. (Chen et al, Statistica Sinica 15:767-79 (2005)). It was assumed that the partly linear model (1) holds with υ = 1 and E 1 ~ N(O, σ 2 ) with σ 2 = 1 and mutually independent of (X 1 , Z 1 ). The random variable X 1 was correlated with Z 1 through the regression relation X 1 = 0.25 Z 1 + U 1 , where U 1 is Un(-5, 5) and completely independent of all other random variables. As in Chen et al., linear and quadratic effects were considered, φ( X 1 ) = 2 X 1 and φ( X 1 ) = X 1 2 , respectively. Finally, censoring random variables were simulated according to the rule, Ci = φ( X 1 ) + Z 1 O + U 1 * follows the uniform distribution Un (0, τ) with τ = 1.6. As a result, the proportion of censored outcomes ranges from 20% to 30%. The estimator, the partly linear AFT model (PL-AFT) with r knots (r = 2 and 4), which was fit using the loss function (6), was compared to the stratified estimator of Chen et al. (Chen et al., Statistica Sinica 15:767-79 (2005)) (S K -AFT) where K denotes the number of strata, the usual AFT model with both X 1 and Z 1 modeled parametrically (AFT), and an AFT model with the true φ plugged in (AFT-ø). Two sample sizes, n = 50 and n = 100, were considered. Based on the simulation results, the estimators using the CV and GCV methods give similar results, so the results using the GCV method were reported, which was significantly faster than the CV method. Table 5 summarizes the mean bias of υ, the standard deviation (SD) of υ and means squared errors (MSE) over 200 Monte Carlo data sets. In all cases, the estimator of υ outperforms the other estimators in terms of MSE and its performance is comparable with the estimator using true . The number of knots has little impact on the performance of the proposed estimator. The usual AFT estimator of υ exhibits the largest bias and MSE when is not linear, which indicates that it is important to adjust for the nonlinear effect of X even when one is only interested in the effect of Z. While the stratification step in the S K -AFT method results into improved performance, it still under-performs the estimator, as the number of strata changes, its performance can vary considerably and it is not obvious how to choose the number of strata in practice.

Table 5 : Simulation results for parameter estimation (υ) where d = \ and υ = 1. PL- AFT, the partly linear AFT model with r knots; S K -AFT, the stratified AFT estimator with K strata; AFT, the usual AFT model with both X 1 and Z 1 modeled parametrically; and AFT-ø, the AFT model with true φ plugged in.

Feature Selection and Prediction

In the second set of simulation studies, the case where the regression function consists of nonlinear effect of a single covariateX z was still considered, but the dimension of the linear effects via Z 1 were increased. The simultaneous estimation and feature selection problem was focused on for Z 1 as well as the prediction performance when using both Z 1 and X 1 . The true regression coefficients are set at υ = (1, 1, 0, 0, 0, 1, 0, 0)', which corresponds to a strong signal or effect size, and υ = (0.5, 0.5, 0, 0, 0, 0.5, 0, 0)', which corresponds to a weak signal or effect size. S 1 follows N(O, σ 2 ) with σ 2 = 1 and is mutually independent of (X 1 ; Z 1 ). The predictors Z 1 followed a standard normal with the correlation between theyth and Mi components of Z 1 equal to p and p = 0, 0.5, 0.9 were considered in the simulation. The random variable Xi was correlated to Zi through the relation X 1 = 0.5Zj 1 + 0.5Z2z + 0.5Z 3J + Ui, where U 1 is Un (-1, 1) and completely independent of all other random variables. (JKX 1 ) = (0.2 * X 1 + 0.5 * X 1 + 0.15 * X^)I(X 1 > 0) + (0.05 * X 1 )I(X 1 < 0) was considered, where I() is the indicator function. This setup mimics a practical setting where there is little change in the outcome variable when the clinical variable (X) is less than a threshold level (X= 0), but as X increases past the threshold level, the outcome variable increases at a considerably higher rate. The censoring random variable was simulated according to the rule, C 1 = (/(X 1 ) + Z 1 υ+ U 1 *, where follows the uniform distribution Un (0, τ) with τ = 6. The resulting proportion of censoring ranges from 20% to 30%.

Five models were compared: (1) the Lasso partly linear AFT model (Lasso- PL) with r = 6 which was fit using the loss function (7); (2) the Lasso stratified model (Lasso-Sκ) where K denotes the number of strata, which was an extension to combine the stratified model in Chen et al. (Chen et al, Statistica Sinica 15:767-79 (2005)) with Lasso; (3) the Lasso linear AFT model assuming a linear effect for both X 1 and Z 1 (Lasso-L); (4) the usual AFT parametric model assuming a linear effect for both with regularization (AFT); and (5) the so-called oracle partly linear model (Oracle) with O3, O4, O5, Oj, and υg fixed at 0 and r = 6 for the penalized splines. The oracle model while unavailable in practice, may serve as the optimal bench mark for the purpose of comparisons. In each instance of lasso, the GCV method was used to tune the regularization parameters, λ and γ.

In each simulation run, a training data set was generated of size n = 125 to estimate the parameters of interest and a testing data set of size 1On to evaluate the prediction performance of the partly linear model. To evaluate the performance of parameter estimation, the sum squared error (SSE) of υ defined as (υ - υ) T (ύ - υ) was monitored. To evaluate the performance of model selection, the proportion of correct Pc ≡∑^=i /(C 1 = O)I(O 1 = 0) / Σ"U /(D 1 = 0) and incorrect (P 7 ≡ ∑ d 1=ι 1(O 1 = O)I(O 1 ≠ 1(O 1 ≠ O)) zeroes was monitored. To assess the prediction performance of the partly linear model, two mean squared prediction errors, MSPEi = Σ Wn j =ι[§(X j )-φ(X j ) + (C-D) 1 Z 7 ] 2 , and MSPE 2 = E 10 ViC(C-U) 1 Z 7 ] 2 were computed, where . / goes through the observations in the testing data set. MSPEi is the squared prediction error using both nonparametric and parametric components in (1), and MSPE 2 is the squared prediction error using only parametric components in (1), both of which are of potential interest in practice. Note that the stratified Lasso model does not provide an estimate of the effect of X, therefore MSPEi is not applicable for Lasso-SK. For each simulation setting, these measures were averaged over 200 Monte Carlo data sets.

The simulation results are summarized in Table 6. For the Lasso stratified model, K = 2, 4, 8, 10, and 25 were considered. In all cases, K = 4 provides the best results; hence Table 6 only presents the results for K = 2, 4, and 8.

Table 6: Simulation results for feature selection and prediction where n = 125. Lasso- PL, the Lasso partly linear AFT model with r = 6 knots; Lasso-Sκ, the Lasso stratified model with K strata; Lasso-L, the Lasso linear AFT model assuming a linear effect for both X 1 and Z 1 ; AFT, the usual AFT parametric model assuming a linear effect for both X 1 and Z 1 without regularization; Oracle, the so-called oracle partly linear model with zero coefficients being set to 0 and r = 6 for the penalized spline. Pc, the proportion of correct zero estimates; P 1 , the proportion of incorrect zero estimates.

The simulations showed that the performance of the usual rank-based AFT estimator was not satisfactory in terms of both prediction and feature selection. In all cases, the Lasso partly linear AFT estimator exhibited substantially smaller SSE, MSPE 1 , and MSPE 2 compared to other Lasso estimators and its MSPE 1 is comparable to that of the Oracle estimator, whereas the Lasso stratified estimators exhibited the worst performance. In terms of the feature selection, both the method described herein and the Lasso linear AFT method correctly identified the majority of the zero regression coefficients (Pc); the method described herein outperforms the Lasso linear AFT method when p = 0 or 0.5 and their performances are comparable when the correlation becomes extremely high (p = 0.9). Note that as p increases, so does the correlation between X and Z. By comparison, the Lasso stratified estimators only identify less than 30% of true zeros in some cases and roughly half of the true zeros in the rest of the cases. When there is no correlation and the signal is strong, all Lasso estimators successfully avoided setting nonzero coefficients to zero (Pi = 0). However, as the correlation gets stronger and the signal becomes weaker, P \ increases for all estimators; in particular, P\ becomes appreciable for the Lasso linear AFT estimator when p = 0.9, whereas it remains moderate for our estimator as well as some Lasso stratified estimator.

To summarize, the lasso partly linear AFT model achieved better performance in all three areas, estimation, feature selection, and predication. While the lasso stratified estimator performed reasonably well in estimation, its performance in feature selection and prediction was not satisfactory. When the effect of X is nonlinear, the performance of the Lasso linear AFT model deteriorates, and the deterioration can be substantial when prediction is of interest.

Example 4: Prostate Cancer Study using novel partly linear accelerated failure time (AFT) model.

The methods described above were used to analyze data from a prostate cancer study. Data from 83 patients were used in this data analysis. The outcome of interest is time to prostate cancer recurrence, which begins on the surgery date of the prostatectomy and is subject to censoring; the observed survival time ranges from 2 months to 160 months and the censoring rate is 62.6%. In the data analysis, the log- transformed survival time was used to fit AFT models. Gene expression data were measured using 1536 probes from samples collected at the baseline, i.e., right after the surgery. In addition, two clinical variable, the PSA (Prostate Specific Antigen) and total gleason score, are of particular interest in this study and were measured for all subjects at baseline. The total gleason score only takes integer values from 5 to 9 and 89% of patients have a total gleason score of either 6 or 7; combining this with suggestions from the investigators, the total gleason score is dichotomized as > 6 or not. Before the data analysis, all gene expression measurements were preprocessed by investigators and then standardized by us to have mean 0 and unit standard deviation. In the literature, it is of interest to study both the gene data and probe data (Nakagawa et al., (2008)) and it can be argued that in some cases it is more important to examine the probe data. Cox PH models were first fit for each individual probe and ranked according to their score test statistics from the largest (J = 1) to the smallest (J = 1536). Subsequently, feature selection was conducted while adjusting for the nonlinear effect of PSA. To fit the models of interest, the top d = 25 probes were selected to fit the lasso partly linear AFT model.

Feature Selection

Four models were used to conduct feature selection, the Lasso-PL with r = 8, Lasso-L, and usual linear AFT without regularization. In the Lasso-PL model, X 1 in model (3) is PSA, which is modeled using penalized splines, and Z include both 1536 probes and the binary clinical variable gleason score. Similarly, in the Lasso stratified model, the stratification is based on PSA.

The results are summarized in Table 7. A linear effect of PSA wsa included in the Lasso linear AFT model and was estimated to be nonzero (-0.132), which further justifies the inclusion of PSA in the final model. On the other hand, the clinical variable, total gleason score, is not selected by any of the methods. Figure 5 shows the estimated effect of PSA using the Lasso partly linear AFT modle and it is evident that its effect is nonlinear. Specifically, the primary endpoint initially decreases as the PSA value increases and then starts to increase slightly at about PSA = 11. After further examination of the data, the mean of PSA was found to be 8.20 (SD = 4.13) with a range of 0-32; most patients had PSA values ranging from 0-15.2, but five had PSA values between 18-32, all of which had censored outcomes. As a result, the tail was suspected to be an artifact of the data. Additional analysis was conducted and the 5 outliers were removed. While the results for feature selection remain the same, the estimate φ of became more flat towards the right tal of the curve, which indicates that the effect of PSA levels off after PSA becomes greater than 11. Table 7: Feature selection for the prostate cancer study, n = 83.

As to feature selection, the majority of the top 25 probes were not selected by the methods considered herein. The results were similar using different methods, but they do select somewhat different sets of probes. Of the stratified Lasso estimators, it seemed that LaSSo-S 4 provided the most consistent results with the Lasso-PL method, which seems to agree with the finding in the simulations that LaSSO-S 4 achieved better performance compared to other K values. Among the probes picked by the Lasso-PL method, probe 1, 7, 15, and 17 were also selected by Lasso-S 4 and Lasso-L, and 2 was not selected by the Lasso-S 4 method and 10 was not selected by the Lasso-L method. In addition, the Lasso-L method also selected probes 9 and 13. This observation agreed with the simulation results, i.e., when the correlation was moderate, the Lasso- L method tended to select a larger number of incorrect features; and the difference between the Lasso-PL method and the Lasso-L method was attributed to the nonlinear effects of PSA. A sensitivity analysis was conducted to examine the impact of d, where the feature selection procedures were repeated using the different numbers of top probes (d = 20, 30, and 35). The results of the feature selection are summarized in Table 8, which uses the set of probes selected for d = 25 as the reference set. When d = 20, 25, and 30, the Lasso-PL model selected exactly the same subset of probes; and when d = 35, the Lasso-PL model dropped probe 2 and selected probe 32. For all ds, the estimated nonlinear effect of PSA using the Lasso-PL model was almost identical. While the LaSSo-S 4 and Lasso-Ss models selected different sets of features compared to the Lasso-PL model, they were also insensitive to the value of d. By comparison, the Lasso-L model and LaSSo-S 2 models seemed to be considerably more sensitive to the value of d. In particular, when d = 20, the Lasso-L method selected a significantly larger number of probes, which seemed to indicate that the impact of the misspecifϊed linear effect of PSA was substantial in terms of the feature selection, especially when a small number of probes were used.

Table 8: Sensitivity analysis for feature selection in the prostate cancer study using the top d probes. The set of probes selected by Lasso-PL with J = 25 is considered the reference set, {1, 2, 7, 10, 15, 17}. +, probes not in the reference set that are selected by a method; -, probes in the reference set that are not selected by a method.

Prediction Preformance

To internally evaluate the prediction performance of the models of interest, an approach was followed as used in Cai et al. (Cai et al., Biometrics, In Press (2009)). The data were randomly split into a training sample (70%) and a validation sample (30%). The models were fit using the training sample and were then used to predict the risk of failure for subjects in the validation sample. The subjects were classified as high or low risk based on whether the predicted risk exceeded the median risk. Subsequently, a log-rank test was conducted in the validation sample comparing the two risk groups. This procedure was repeated 1000 times and the prediction performance was evaluated using the resulting p-values. The models were compared that were used in the data analysis, namely, the Lasso-PL with r = 8, Lasso-L, and usual linear AFT, all of which use both clinical variables and gene expression data. Note that the Lasso-Sκ can not be used for prediction. Two other models were also considered that use only the clinical variables, namely, a partly linear AFT model that models PSAnonparametrically and a linear AFT model.

The results were visualized using the cumulative distribution function of the/?- values from the log-rank tests in Figure 6. In Figure 6, the larger the area under the curve is, the better the performance of the method. In addition, the proportion ofp- values being less than 0.05 are computed for predictions based on three models that use all data, namely, the Lasso-PL with r = 8 (45:3%), the Lasso-L (43:0%), and the usual linear AFT (12:6%), as well as two models that use only clinical variables, namely, a partly linear AFT model (26:3%) and a linear AFT model (19:2%). Our results show that it is important to correctly specify the nonlinear effect of PSA when prediction is of interest. In the absence of gene expression data, the partly linear AFT model (D in Figure 6) achieved considerably better performance than the linear AFT model (E in Figure 6). After adjusting for gene expression data, while the prediction performance of the lasso partly linear AFT model (A in Figure 6) was still slightly better than that of the lasso linear AFT model (B in Figure 6), the improvement diminished. A possible explanation was that the gene expression data were potentially correlated with the PSA level and consequently the addition of gene expression data was able to offset the impact of the misspecifϊed linear effect of PSA, especially when the prediction performance was evaluated based on the dichotomized risk scores.

The results provide the answer to the research question of interest, whether the addition of gene expression data improved the prediction performance of the resultant risk scores. If the appropriate models were used (e.g., the Lasso partly linear AFT model), the prediction performance improved substantially when the gene expression data were added (A in Figure 6) compared to the AFT models without using the gene expression data (D and E in Figure 6). However, if an inappropriate model was used, the gain in prediction performance was not realized. Specifically, the linear AFT model without regularization that used both clinical and gene expression data (C in Figure 6) underperformed the AFT models that used only clinical variables (D and E in Figure 6).