Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DIAGNOSTIC METHOD
Document Type and Number:
WIPO Patent Application WO/2020/079407
Kind Code:
A1
Abstract:
The disclosure relates to a method for diagnosis of prostate cancer (PCa) by determining methylation of selected genomic sites within the human genome; probes useful in determining methylation and kits comprising the probes and components useful in conducting the method.

Inventors:
PELLACANI DAVIDE (CA)
MAITLAND NORMAN (GB)
Application Number:
PCT/GB2019/052914
Publication Date:
April 23, 2020
Filing Date:
October 14, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV YORK (GB)
International Classes:
C12Q1/6886
Domestic Patent References:
WO2016102674A12016-06-30
WO2013185779A22013-12-19
WO2017143296A22017-08-24
WO2016102674A12016-06-30
WO2014160829A22014-10-02
WO2013185779A22013-12-19
WO2017143296A22017-08-24
WO2012138609A22012-10-11
WO2013140161A12013-09-26
WO2008143896A12008-11-27
Foreign References:
US20140303002A12014-10-09
Other References:
MASSIE CEMILLS IGLYNCH AG: "The importance of DNA methylation in prostate cancer development", J STEROID BIOCHEM MOL BIOL., vol. 166, February 2017 (2017-02-01), pages 1 - 15, XP029865488, DOI: 10.1016/j.jsbmb.2016.04.009
ZHANG DPARK DZHONG YLU YRYCAJ KGONG S ET AL.: "Stem cell and neurogenic gene-expression profiles link prostate basal cells to aggressive prostate cancer", NAT COMMUN., vol. 7, 29 February 2016 (2016-02-29), pages 10798
PELLACANI DKESTORAS DDROOP AFRAME FMBERRY PALAWRENCE MG ET AL.: "DNA hypermethylation in prostate cancer is a consequence of aberrant epithelial differentiation and hyperproliferation", CELL DEATH DIFFER, vol. 21, no. 5, May 2014 (2014-05-01), pages 761 - 73
FRAME FMPELLACANI DCOLLINS ATMAITLAND NJ: "Harvesting Human Prostate Tissue Material and Culturing Primary Prostate Epithelial Cells", METHODS MOL BIOL., vol. 1443, 2016, pages 181 - 201.4
AKALIN AKORMAKSSON MLI SGARRETT-BAKELMAN FEFIGUEROA MEMELNICK A ET AL.: "methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles", GENOME BIOL., vol. 13, no. 10, 3 October 2012 (2012-10-03), pages R87, XP021108527, DOI: 10.1186/gb-2012-13-10-r87
POHL ABEATO M: "bwtool: a tool for bigWig files", BIOINFORMATICS, vol. 30, no. 11, 1 June 2014 (2014-06-01), pages 1618 - 9
MCLEAN CYBRISTOR DHILLER MCLARKE SLSCHAAR BTLOWE CB ET AL.: "GREAT improves functional interpretation of cis-regulatory regions", NAT BIOTECHNOL., vol. 28, no. 5, May 2010 (2010-05-01), pages 495 - 501
"Cancer Genome Atlas Research Network", THE MOLECULAR TAXONOMY OF PRIMARY PROSTATE CANCER, vol. 163, no. 4, 5 November 2015 (2015-11-05), pages 1011 - 25
YU YPDING YCHEN RLIAO SGREN B-GMICHALOPOULOS A ET AL.: "Whole-genome methylation sequencing reveals distinct impact of differential methylations on gene transcription in prostate cancer", AM J PATHOL., vol. 183, no. 6, December 2013 (2013-12-01), pages 1960 - 70
HEINTZMAN NDHON GCHAWKINS RDKHERADPOUR PSTARK AHARP LF ET AL.: "Histone modifications at human enhancers reflect global cell-type-specific gene expression", NATURE, vol. 459, no. 7243, 7 May 2009 (2009-05-07), pages 108 - 12, XP055308084, DOI: 10.1038/nature07829
HEINZ SROMANOSKI CEBENNER CGLASS CK: "The selection and function of cell type-specific enhancers", NAT REV MOL CELL BIOL., vol. 16, no. 3, March 2015 (2015-03-01), pages 144 - 54
MUNDBJERG KCHOPRA SALEMOZAFFAR MDUYMICH CLAKSHMINARASIMHAN RNICHOLS PW ET AL.: "Identifying aggressive prostate cancer foci using a DNA methylation classifier", GENOME BIOL. BIOMED CENTRAL, vol. 18, no. 1, 12 January 2017 (2017-01-12), pages 3
TANG YJIANG SGU YLI WMO ZHUANG Y ET AL.: "Promoter DNA methylation analysis reveals a combined diagnosis of CpG-based biomarker for prostate cancer", ONCOTARGET. IMPACT JOURNALS, vol. 8, no. 35, 29 August 2017 (2017-08-29), pages 58199 - 209
XI YLI W: "BSMAP: whole genome bisulfite sequence MAPping program", BMC BIOINFORMATICS, vol. 10, 2009, pages 232, XP021055678, DOI: 10.1186/1471-2105-10-232
LI HHANDSAKER BWYSOKER AFENNELL TRUAN JHOMER N ET AL.: "The Sequence Alignment/Map format and SAMtools", BIOINFORMATICS, vol. 25, no. 16, 15 August 2009 (2009-08-15), pages 2078 - 9, XP055229864, DOI: 10.1093/bioinformatics/btp352
WANG H-QTUOMINEN LKTSAI C-J: "SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures", BIOINFORMATICS, vol. 27, no. 2, 15 January 2011 (2011-01-15), pages 225 - 31
QUINLAN ARHALL IM: "BEDTools: a flexible suite of utilities for comparing genomic features", BIOINFORMATICS, vol. 26, no. 6, 15 March 2010 (2010-03-15), pages 841 - 2, XP055307411, DOI: 10.1093/bioinformatics/btq033
THURMAN RERYNES EHUMBERT RVIERSTRA JMAURANO MTHAUGEN E ET AL.: "The accessible chromatin landscape of the human genome", NATURE, vol. 489, no. 7414, 29 August 2012 (2012-08-29), pages 75 - 82, XP055269699, DOI: 10.1038/nature11232
ZHAO SGEYBELS MSLEONARDSON ARUBICZ RKOLB SYAN Q ET AL.: "Epigenome-Wide Tumor DNA Methylation Profiling Identifies Novel Prognostic Biomarkers of Metastatic-Lethal Progression in Men Diagnosed with Clinically Localized Prostate Cancer", CLIN CANCER RES. AMERICAN ASSOCIATION FOR CANCER RESEARCH, vol. 23, no. 1, 1 January 2017 (2017-01-01), pages 311 - 9
GEYBELS MSZHAO SWONG C-JBIBIKOVA MKLOTZLE BWU M ET AL.: "Epigenomic profiling of DNA methylation in paired prostate cancer versus adjacent benign tissue", PROSTATE, vol. 75, no. 16, December 2015 (2015-12-01), pages 1941 - 50, XP055644568, DOI: 10.1002/pros.23093
GEYBELS MSWRIGHT JLBIBIKOVA MKLOTZLE BFAN J-BZHAO S ET AL.: "Epigenetic signature of Gleason score and prostate cancer recurrence after radical prostatectomy. Clin Epigenetics", BIOMED CENTRAL, vol. 8, no. 1, 2016, pages 97
Attorney, Agent or Firm:
SYMBIOSIS IP LIMITED (GB)
Download PDF:
Claims:
Claims

1. A diagnostic method to determine the methylation status of genomic regions isolated from a human subject that has prostate cancer comprising the steps:

i) obtaining an isolated biological sample from a subject and extracting genomic DNA to provide an isolated sample of genomic DNA;

ii) determining the methylation status of one or more CpG dinucleotides of said genomic DNA comprising or consisting of (a) the nucleotide sequence set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO:

31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO:

35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO:

39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO:

43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO:

47 and SEQ ID NO: 48 and wherein said region can be extended 250bp upstream or downstream of said region, or (b) a polymorphic sequence variant that has at least 90% sequence identity over the full length nucleotide sequences set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ

ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ

ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ

ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ

ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ

ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48; wherein said region can be extended 250bp upstream or downstream of said region, as a measure of whether the prostate cancer is aggressive or non-aggressive.

2. The method according to claim 1 wherein said genomic DNA region comprises or consist of genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48. 3. The method according to claim 1 or 2 wherein said methylation status of genomic DNA is hypermethylation wherein hypermethylation is typically associated with down regulation of the gene associated with the genomic region.

4. The method of claims 3 wherein said genomic DNA region is hypermethylated in one or more CpG dinucleotides in genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 28, SEQ I D NO: 29, SEQ ID NO: 31 , SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 40, SEQ ID NO: 42, SEQ ID NO: 46 and SEQ ID NO: 48.

5. The method according to claim 1 or 2 wherein said methylation status of genomic DNA is to hypomethylation wherein hypomethylation is typically associated with up- regulation of the gene associated with the genomic region.

6. The method of claims 5 wherein said genomic DNA region is hypomethylated in one or more CpG dinucleotides in genomic regions set forth in SEQ ID NO: 27, SEQ ID NO: 30, SEQ ID NO: 32, SEQ ID NO: 35, SEQ ID NO: 39, SEQ ID NO: 41 , SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45 and SEQ ID NO: 47.

7. A diagnostic method to determine the methylation status of genomic regions of a human subject having prostate cancer comprising the steps:

i) obtaining a biological sample from a subject and extracting genomic DNA to provide an isolated sample comprising genomic

DNA;

ii) determining the methylation status of genomic DNA by interrogating the methylation status associated with the probes cg00066748, cg01285435, cg03122624, cg03731646, cg04301614, cg04804717, cg05977462, cg06740893, cg07076509, cg09433131 , cg12010198, eg 15830431 , eg 17737967, eg 18020065, eg 18825594, cg20595634, cg2081 1788, cg21155609, cg21265647, cg21914290, cg22755142, cg23986375, cg24348240;

as a measure of whether the prostate cancer is aggressive or non-aggressive. 8. A diagnostic method for determining if a subject has aggressive or non- aggressive prostate cancer comprising:

i) providing an isolated biological sample to be tested and preparing cDNA;

ii) forming a preparation comprising said cDNA and an oligonucleotide primer pairs adapted to anneal to a nucleic acid molecule comprising a nucleotide sequence encoding: TDRP, VPS28, EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3, a thermostable DNA polymerase, deoxynucleotide triphosphates and co-factors;

iii) providing polymerase chain reaction conditions sufficient to amplify all or part of said nucleic acid molecule(s);

iv) analyzing the amplified products of said polymerase chain reaction for the presence and/or level of said nucleic acid molecule encoding said polypeptide as a measure of gene expression; and v) comparing the amplified product with a normal matched control.

9. The diagnostic method according to any one of claims 1 to 8 wherein said biological sample is selected from the group consisting of: urine, seminal fluid, blood, lymph fluid, or prostate tissue.

10. The diagnostic method according to claim 9 wherein the isolated biological sample is a tissue biopsy.

11. The diagnostic method according to claim 10 wherein said biopsy is a prostate tissue biopsy.

12. A kit comprising primers or probes designed to the nucleic acids encoding polypeptides selected from the group consisting of: TDRP, VPS28, EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3 13. A kit according to claim 19 wherein said kit can comprises further DNA polymerase, deoxynucleotide triphosphates and co-factors.

Description:
DIAGNOSTIC METHOD

Field of the Disclosure

The disclosure relates to a method for diagnosis of prostate adenocarcinoma (PCa) by determining methylation of selected genomic sites within the human genome; probes useful in determining methylation and kits comprising the probes and components useful in conducting the method are also disclosed.

Background to the Disclosure

The prostate gland is the major accessory organ of the male reproductive tract and is the most common site of cancer in men. The two main pathologies of the gland are: benign prostatic hyperplasia, which is a non-malignant condition that is common with age; and prostate adenocarcinoma (PCa), which is the second most common cause of death in European men after lung cancer and is increasingly prevalent in our ageing Western Society. Symptoms include, blood in the semen or the urine, frequent pain or stiffness in the lower back, hips or upper thigh. PCa may be primary (i.e. located in the organ of origin) or secondary (i.e. tumours which form in other organs due to the ability of cancerous cells to move and invade other tissues via the circulatory system). The diagnosis of PCa is based on a combination of tests including digital rectal examination, serum PSA and prostate biopsy. This approach can lead to the over diagnosis of indolent cancer and delays in the diagnosis of significant disease. The latter are of most clinical concern.

PCa can vary from relatively harmless to extremely aggressive. Some prostate cancers are slow growing and cause few clinical symptoms. Aggressive PCas spread rapidly to the lymph nodes and other organs, especially bone. It is known that the growth of PCa can be inhibited by blocking the supply of male hormones such as testosterone. However, PCa eventually becomes independent of male sex hormones (i.e. they become androgen-independent prostate cancer cells). These cells are linked with aggressive, malignant prostate cancer. All male mammals have a prostate gland but only humans and dogs are known to naturally develop prostate cancer.

Metastatic prostate cancers predominantly move to the bone and are treated by reducing the production of androgens by blocking androgen production by the adrenal glands and testis. This treatment is only effective for a short period of time as the metastatic lesions become androgen independent and grow uncontrollably. The presence of androgen- independent prostate cancer cells means that this treatment regimen is no longer effective and further intervention is required to control the progress of the disease. A similar response is seen to chemotherapeutic and radiotherapy treatments. As a result, metastatic prostate cancer remains an incurable disease by current treatment strategies and early diagnosis is clearly desirable.

Methylation of cytosine (C) nucleotides in CpG dinucleotide sequences (CpG sites; G= guanosine) of the human DNA is a known phenomenon and is correlated with specific changes gene expression. Moreover, abnormal methylation (hypermethylation and/or hypomethylation) of specific genomic areas is present in virtually all cancer types. Genomic regions with altered methylation states in test samples compares to controls samples are commonly referred to as“differentially methylated regions” (DMRs). For example, WO2016/102674 discloses a sixteen gene methylation signature associated with the risk of developing aggressive prostate cancer and include methylation of regulatory regions of genes such as GSTP1 , SFRP2, IGFBP3 and IGFBP7. WO2014/160829 discloses the methylation of eight genes and their association with prostate cancer which include determination of methylation status of one or more genes including CAV1 , EVX1 , MCF2L and FGF1. WO2013/185779 discloses a six gene methylation signature comprising genes HAPLN3, AOX1 and GAS6. Further examples of methylation signatures allegedly determinant of prostate cancer or the aggressiveness of prostate cancer are disclosed in WO2017/143296, WO2012/138609, W02013/140161 and W02008/143896.

This disclosure relates to a comparison of genome wide methylation profiles obtained separately from epithelial cells with luminal and basal phenotypes, isolated with a high purity from patient-matched normal and cancer biopsy samples. From comparative analyses of these profiles a major proportion of the methylation differences between normal basal and luminal cells were conserved in their malignant counterparts. This disclosure also makes it possible to identify, for the first time, regions specifically altered in the luminal fraction of PCa. The hypermethylated DMRs in this group were genes associated to genes involved in metabolic processes.

We disclose a set of CpG sites. These genomic regions were consistently altered in both tumour phenotypes in the PCa samples and can discriminate normal and PCa samples in prostate cancer analysed in bulk and presented in The Cancer Genome Atlas (TCGA) dataset. The new logistic model constructed from these regions makes use of only 17 probes to distinguish normal and PCa samples with similar specificity and sensitivity to previously developed, non-overlapping models 35 · 36 , and will be useful in the context of the low mutagenic burdens seen in most hormone-naive prostate cancers.

We disclose that many DNA methylation changes commonly associated with PCa cells are explained by a predominant luminal phenotype of the treatment-naive PCa population and are not cancer-specific nor are likely to contain driver events. Importantly however, we disclose the identity of a class of PCa-specific DNA methylation changes that are specific to cancer luminal cells that can distinguish normal from cancer samples. The changes common to basal and luminal cancer cells are able to distinguish PCa efficiently from normal samples. This novel set of cancer-specific changes clearly demonstrate the potential of profiling normal and cancer cell subpopulations in identifying signatures that may contain previously unrecognized driver events in the development and progression of PCa.

Statements of Invention

We disclose methods of diagnosis of prostate cancer that measures methylation of prostate specific genomic biomarkers associated with upstream or downstream regulatory regions of genes which, when aberrantly methylated may result in gene dysregulation that is either causally related to cancer initiation or the result of cancer initiation; and including kits for testing methylation of said biomarkers comprising said probes.

The disclosure also relates to the analysis of expression of genes associated with the differential methylation of regulatory regions by determining the expression level of one or more genes selected from the group consisting of: FOXI2, IRS2, KCNC2, C3orf22, LOC10192826, HOXC12, VWA5B1 , P2RY1 , GLUD2, DDOST, BRF1 , GABRB3, PLAGL1 , L3MBTL1 and SC5D wherein the expression level is a measure of whether the subject has prostate cancer

Alternatively, the disclosure also relates to the analysis of expression of genes associated with the differential methylation of regulatory regions by determining the expression level of one or more genes selected from the group consisting of: TDRP, KCNQ1 DN, CLEC18A, GATA3, MIR2467, MIR6792, ADCY1 wherein the expression level is a measure of whether the subject has prostate cancer. Furthermore, the disclosure relates to the analysis of expression of genes associated with the differential methylation of regulatory regions by determining the expression level of one or more genes selected from the group consisting of: TDRP, VPS28, EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 or SH3BGRL3 to determine whether a subject diagnosed with prostate cancer has aggressive or non-aggressive prostate cancer.

According to an aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions isolated from a human subject that is suspected of having prostate cancer comprising the steps:

i) obtaining an isolated biological sample from a subject and extracting genomic DNA to provide an isolated sample of genomic DNA; and ii) determining the methylation status of one or more CpG dinucleotides of said genomic DNA comprising or consisting of one or more a genomic nucleotide sequence selected from the group consisting of SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15, SEQ ID NO: 16 and SEQ ID NO: 17 and wherein said region can be extended 250bp upstream or downstream of said region; or a polymorphic sequence variant that has at least 90% sequence identity over the full length to the recited nucleotide sequences set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15 , SEQ ID NO: 16 and SEQ ID NO: 17, and wherein said region can be extended 250bp upstream or downstream of said region; as a measure of the likelihood said subject has prostate cancer.

To obtain these signatures, first DNA methylation patterns were analysed in four different sub-populations of prostate cells: Normal Basal (NB), Normal Luminal (NL), Cancer Basal (CB), Cancer Luminal (CL), using a technique called Reduced Representation Bisulphite Sequencing (RRBS). DMRs were then calculated for each pair-wise comparison between populations (e.g. CL vs NL). A logistic regression model was then constructed on the probes of the TOGA dataset overlapping the genomic locations of specific sets of DMRs and used to identify statistical relevant probes that could distinguish cancerous (Cancer) vs non-cancerous (Normal) samples, and samples confined within the prostate capsule (Organ Confined - non-aggressive) vs those that already escaped outside or the prostate (Extraprostatic - aggressive).

Methods to determine methylation status of genomic DNA are well known to the skilled person, for example by bisulphite sequencing or related techniques such as: methylation specific PCR, microarray analysis of bisulphite converted DNA, pyrosequencing methylation assay. In a preferred method of the invention the polymorphic sequence variant that has at least 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% sequence identity over the full length to said nucleotide sequence.

In a preferred method of the invention said genomic DNA region comprises one or more of SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15 , SEQ ID NO: 16 and SEQ ID NO: 17.

In a preferred method of the invention the method determines the methylation status of one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen or seventeen genomic regions set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15 , SEQ ID NO: 16 or SEQ ID NO: 17.

In a preferred method of the invention the method determines the methylation status of each of the genomic regions set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15 , SEQ ID NO: 16 and SEQ ID NO: 17.

In a preferred method of the invention methylation status of genomic DNA is hypermethylation wherein hypermethylation is typically associated with down regulation of the gene associated with the genomic region.

Preferably hypermethylation is of CpGs in one, more or all of the genomic regions as set forth in SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 11 , SEQ ID NO: 14, SEQ ID NO: 16 and SEQ ID NO: 17. In a preferred method of the invention methylation status of genomic DNA is hypomethylation wherein hypomethylation is typically associated with up-regulation of the gene associated with the genomic region.

Preferably hypomethylation is of CpG in one, more or all of the nucleotide sequence(s) as set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 10, SEQ ID NO: 12, SEQ ID NO: 13 and SEQ ID NO: 15.

In a preferred method of the invention the biological sample is analysed for expression of one or more genes selected from the group consisting of FOXI2, IRS2, KCNC2, C3orf22, LOC101928266, HOXC12, VWA5B1 , P2RY1 , GLUD2, DDOST, BRF1 , GABRB3, PLAGL1 , L3MBTL and SC5D and analysed for the differential methylation status of one or more genomic regions set forth in SEQ ID NO: 1 , SEQ ID NO: 2, SEQ ID NO: 3, SEQ ID NO: 4, SEQ ID NO: 5, SEQ ID NO: 6, SEQ ID NO: 7, SEQ ID NO: 8, SEQ ID NO: 9, SEQ ID NO: 10, SEQ ID NO: 11 , SEQ ID NO: 12, SEQ ID NO: 13, SEQ ID NO: 14, SEQ ID NO: 15 , SEQ ID NO: 16 and SEQ ID NO: 17.

According to an alternative aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions isolated from a human subject that is suspected of having prostate cancer comprising the steps:

i) obtaining an isolated biological sample from a subject and extracting genomic DNA to provide an isolated sample of genomic DNA; ii) determining the methylation status of one or more CpG dinucleotides of said genomic DNA comprising or consisting of one or more genomic nucleotide sequence selected from the group consisting of SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, SEQ ID NO: 25, and wherein said region can be extended 250bp upstream or downstream of said region, or a polymorphic sequence variant that has at least 90% sequence identity over the full length to the recited nucleotide sequences set forth in SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24, SEQ ID NO: 25 and wherein said region can be extended 250bp upstream or downstream of said region, as a measure of the likelihood said subject has prostate cancer. In a preferred method of the invention said genomic DNA region comprises or consists of SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24 and SEQ ID NO: 25.

In a preferred method of the invention the method determines the methylation status of one, two, three, four, five, six, seven or eight probes corresponding to the genomic regions set forth in SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24 and SEQ ID NO: 25.

In a preferred method of the invention the method determines the methylation status of each of the genomic regions set forth in SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO:

20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24 and SEQ ID NO:

25.

In a preferred method of the invention methylation status of genomic DNA is hypermethylation wherein hypermethylation is typically associated with down regulation of the gene associated with the genomic region.

Preferably hypermethylation is of CpG represented in one, more or all of the genomic regions as set forth in SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23 and SEQ ID NO: 25.

In a preferred method of the invention methylation status of genomic DNA is hypomethylation wherein hypomethylation is typically associated with up regulation of the gene associated with the genomic region.

Preferably hypomethylation is of CpG represented in the genomic region set forth in SEQ ID NO: 24.

In a preferred method of the invention said biological sample is analysed for expression of one or more genes selected from the group consisting of TDRP, KCNQ1 DN, CLEC18A, GATA3, MIR2467, MIR6792, ADCY1 and analysed for the differential methylation status of one or more genomic regions set forth in SEQ ID NO: 18, SEQ ID NO: 19, SEQ ID NO: 20, SEQ ID NO: 21 , SEQ ID NO: 22, SEQ ID NO: 23, SEQ ID NO: 24 and SEQ ID NO: 25. According to an aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions isolated from a human subject that has prostate cancer comprising the steps:

i) obtaining an isolated biological sample from a subject and extracting genomic DNA to provide an isolated sample of genomic DNA; ii) determining the methylation status of one or more CpG dinucleotides of said genomic DNA comprising or consisting of one or more genomic nucleotide sequence selected from the group consisting of SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48 and wherein said region can be extended 250bp upstream or downstream of said region., or a polymorphic sequence variant that has at least 90% sequence identity over the full length to the recited nucleotide sequences set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48 and wherein said region can be extended 250bp upstream or downstream of said region; as a measure of whether the prostate cancer is aggressive or non-aggressive.

Diagnostic methods include in this context methods which predict the likelihood of future patient outcomes e.g. whether the cancer is aggressive or non-aggressive.

In a preferred method of the invention said genomic DNA region comprises or consist of genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48. In a preferred method of the invention the method determines the methylation status of one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty one, twenty two, twenty three genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48.

In a preferred method of the invention the method determines the methylation status of one or more CpG dinucleotides of said genomic DNA consisting of SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48.

In a preferred method of the invention the method determines the methylation status of each of the genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48.

In a preferred method of the invention methylation status of genomic DNA is hypermethylation wherein hypermethylation is typically associated with down regulation of the gene associated with the genomic region.

Preferably hypermethylation is of CpG represented in one, more or all of the genomic regions as set forth in SEQ ID NO: 26, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 31 , SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 40, SEQ ID NO: 42, SEQ ID NO: 46 and SEQ ID NO: 48.

In a preferred method of the invention methylation status of genomic DNA is to hypomethylation wherein hypomethylation is typically associated with up-regulation of the gene associated with the genomic region. Preferably hypomethylation is of CpG represented in one, more or all of the genomic regions set forth in SEQ ID NO: 27, SEQ ID NO: 30, SEQ ID NO: 32, SEQ ID NO: 35, SEQ ID NO: 39, SEQ ID NO: 41 , SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45 and SEQ ID NO: 47.

In a preferred method of the invention the biological sample is analysed for expression of one or more genes selected from the group consisting of TDRP, VPS28, EFEMP1 , LOC100131655, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3 and analysed for differential methylation status of one or more genomic regions set forth in SEQ ID NO: 26, SEQ ID NO: 27, SEQ ID NO: 28, SEQ ID NO: 29, SEQ ID NO: 30, SEQ ID NO: 31 , SEQ ID NO: 32, SEQ ID NO: 33, SEQ ID NO: 34, SEQ ID NO: 35, SEQ ID NO: 36, SEQ ID NO: 37, SEQ ID NO: 38, SEQ ID NO: 39, SEQ ID NO: 40, SEQ ID NO: 41 , SEQ ID NO: 42, SEQ ID NO: 43, SEQ ID NO: 44, SEQ ID NO: 45, SEQ ID NO: 46, SEQ ID NO: 47 and SEQ ID NO: 48.

In a preferred method of the invention the expression of each gene is correlated with the methylation status of each genomic region.

According to an aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions of a human subject that is suspected of having prostate cancer comprising the steps:

i) obtaining a biological sample from a subject and extracting genomic DNA to provide an isolated sample comprising genomic DNA;

ii) determining the methylation status of at least one region of genomic DNA by interrogating the methylation status associated with one or more probes selected from the group consisting of cg02523640, cg02315096, cg06563089, cg04527018, cg10116893, cg06729806, cg22333412, cg03293976, cg03356806, eg 13233461 , cg24876897, cg01153451 , eg 14859324, cg10007452, cg09541000, cg15628253 and cg21745537; as a measure of the likelihood that said subject has prostate cancer.

In a preferred method of the invention said extracted genomic DNA is contacted with one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen or seventeen probes selected from the group consisting of cg02523640, cg02315096, cg06563089, cg04527018, cg10116893, cg06729806, cg22333412, cg03293976, cg03356806, cg13233461 , cg24876897, cg01153451 , cg14859324, cg10007452, cg09541000, cg15628253 and cg21745537.

Typically, probes useful in the analysing genomic DNA use the microarray platform Infinium HumanMethylation450K BeadChip and interrogating any of the probes mentioned above.

In a preferred method of the invention said extracted genomic DNA is contacted with each probe set forth in: cg02523640, cg02315096, cg06563089, cg04527018, cg10116893, cg06729806, cg22333412, cg03293976, cg03356806, cg13233461 , cg24876897, cg01153451 , cg14859324, cg10007452, cg09541000, cg15628253 and cg21745537.

In a preferred method of the invention the biological sample is analysed for expression of one or more genes selected from the group consisting of FOXI2, IRS2, KCNC2, C3orf22, LOC10192826, HOXC12, VWA5B1 , P2RY1 , GLUD2, DDOST, BRF1 , GABRB3, PLAGL1 , L3MBTL1 and SC5D and analysed for the differential methylation status of one or more genomic regions as determined by one or more probes selected from the group consisting of: cg02523640, cg02315096, cg06563089, cg04527018, cg10116893, cg06729806, cg22333412, cg03293976, cg03356806, cg13233461 , cg24876897, cg01153451 , cg14859324, cg10007452, cg09541000, cg15628253 and cg21745537.

In a preferred method of the invention the expression of each gene is correlated with the methylation status of each genomic region as determined by each of said probes.

According to an aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions of a human subject that is suspected of having prostate cancer comprising the steps:

i) obtaining a biological sample from a subject and extracting genomic DNA to provide an isolated sample comprising genomic DNA;

ii) determining the methylation status of at least one region of genomic DNA by interrogating the methylation status associated with one or more methylated probes selected from the group consisting of; cg00066748, cg08376310, cg15192750, cg15267232, cg17124583, cg19125791 , cg23044391 , cg26459372; as a measure of the likelihood said subject has prostate cancer. In a preferred method of the invention said extracted genomic DNA is contacted with one, two, three, four, five, six, seven or eight probes selected from the group consisting of cg00066748, cg08376310, cg15192750, cg15267232, cg17124583, cg19125791 , cg23044391 , cg26459372.

In a preferred method of the invention said extracted genomic DNA is contacted with each probe set forth as: cg00066748, cg08376310, cg15192750, cg15267232, cg17124583, cg19125791 , cg23044391 , cg26459372.

In a preferred method of the invention the biological sample is analysed for expression of one or more genes selected from the group consisting of TDRP, KCNQ1 DN, CLEC18A, GATA3, MIR2467, MIR6792, ADCY1 and analysed for the differential methylation status of one or more genomic regions as determined by one or more probes selected from the group consisting of: cg00066748, cg08376310, cg15192750, cg15267232, cg17124583, cg19125791 , cg23044391 , cg26459372.

According to an aspect of the invention there is provided a diagnostic method to determine the methylation status of one or more genomic regions of a human subject having prostate cancer comprising the steps:

i) obtaining a biological sample from a subject and extracting genomic DNA to provide an isolated sample comprising genomic DNA;

ii) determining the methylation status of at least one region of genomic DNA by interrogating the methylation status associated with one or more probes selected from the group consisting of; cg00066748, cg01285435, cg03122624, cg03731646, cg04301614, cg04804717, cg05977462, cg06740893, cg07076509, cg09433131 , cg12010198, cg15830431 , eg 17737967, cg18020065, cg18825594, cg20595634, cg20811788, cg21155609, cg21265647, cg21914290, cg22755142, cg23986375, cg24348240;

as a measure of whether the prostate cancer is aggressive or non- aggressive

In a preferred method of the invention said extracted genomic DNA is contacted with one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty one, twenty two or twenty three probes selected from the group consisting of cg00066748, cg01285435, cg03122624, cg03731646, cg04301614, cg04804717, cg05977462, cg06740893, cg07076509, cg09433131 , cg12010198, cg15830431 , cg17737967, cg18020065, eg 18825594, cg20595634, cg20811788, cg21155609, cg21265647, cg21914290, cg22755142, cg23986375, cg24348240.

In a preferred method of the invention said extracted genomic DNA is contacted with each probe set forth as: cg00066748, cg01285435, cg03122624, cg03731646, cg04301614, cg04804717, cg05977462, cg06740893, cg07076509, cg09433131 , cg12010198, cg15830431 , cg17737967, cg18020065, cg18825594, cg20595634, cg20811788, cg21155609, cg21265647, cg21914290, cg22755142, cg23986375, cg24348240.

In a preferred method of the invention the biological sample is analysed for expression of one or more genes selected from the group consisting of TDRP, VPS28, EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3, and analysed for the differential methylation status of one or more genomic regions as determined by one or more probes selected from the group consisting of: cg00066748, cg01285435, cg03122624, cg03731646, cg04301614, cg04804717, cg05977462, cg06740893, cg07076509, cg09433131 , cg12010198, cg15830431 , cg17737967, cg18020065, cg18825594, cg20595634, cg20811788, cg21155609, cg21265647, cg21914290, cg22755142, cg23986375, cg24348240.

In a preferred method of the invention said biological sample is selected from the group consisting of: urine, seminal fluid, blood, lymph fluid, or prostate tissue.

Preferably the isolated biological sample is a tissue biopsy. Preferably said biopsy is a prostate tissue biopsy.

According to an aspect of the invention there is provided a diagnostic method for determining if a subject has or is susceptible to prostate cancer comprising:

i) providing an isolated biological sample to be tested and preparing cDNA;

ii) forming a preparation comprising said cDNA and an oligonucleotide primer pairs adapted to anneal to a nucleic acid molecule comprising a nucleotide sequence encoding one or more polypeptides selected from the group consisting of: FOXI2, IRS2, KCNC2, C3orf22, LOC10192826, HOXC12, VWA5B1 , P2RY1 , GLUD2, DDOST, BRF1 , GABRB3, PLAGL1 , L3MBTL1 and SC5D, a thermostable DNA polymerase, deoxynucleotide triphosphates and co-factors;

iii) providing polymerase chain reaction conditions sufficient to amplify all or part of said nucleic acid molecule(s);

iv) analyzing the amplified products of said polymerase chain reaction for the presence and/or level of said nucleic acid molecule encoding said polypeptide as a measure of gene expression; and v) comparing the amplified product with a normal matched control.

According to an aspect of the invention there is provided a diagnostic method for determining if a subject has or is susceptible to prostate cancer comprising:

i) providing an isolated biological sample to be tested and preparing cDNA;

ii) forming a preparation comprising said cDNA and an oligonucleotide primer pairs adapted to anneal to a nucleic acid molecule comprising a nucleotide sequence encoding one or more polypeptides selected from the group consisting of: TDRP, KCNQ1 DN, CLEC18A, GATA3, MIR2467, MIR6792, ADCY1 , a thermostable DNA polymerase, deoxynucleotide triphosphates and co-factors;

iii) providing polymerase chain reaction conditions sufficient to amplify all or part of said nucleic acid molecule(s);

iv) analyzing the amplified products of said polymerase chain reaction for the presence and/or level of said nucleic acid molecule encoding said polypeptide as a measure of gene expression; and v) comparing the amplified product with a normal matched control.

According to an aspect of the invention there is provided a method for determining if a subject has aggressive or non-aggressive prostate cancer comprising:

i) providing an isolated biological sample to be tested and preparing cDNA;

ii) forming a preparation comprising said cDNA and an oligonucleotide primer pairs adapted to anneal to a nucleic acid molecule comprising a nucleotide sequence encoding one or more polypeptides selected from the group consisting of: TDRP, VPS28,

EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3, a thermostable DNA polymerase, deoxynucleotide triphosphates and co-factors;

iii) providing polymerase chain reaction conditions sufficient to amplify all or part of said nucleic acid molecule(s);

iv) analyzing the amplified products of said polymerase chain reaction for the presence and/or level of said nucleic acid molecule encoding said polypeptide as a measure of gene expression; and v) comparing the amplified product with a normal matched control.

Any primers/probes designed to the nucleic acids encoding polypeptides selected from the group consisting of: TDRP, VPS28, EFEMP1 , LOC10013165, C4orf19, UNCX, CUX2, MIR4632, BARX1 , KCNB2, ZNF251 , STK3, MAN2B1 , RASA3, KLF11 , NKX6-2, MAPK11 , FAM167B, SLITRK1 , PRRX1 , MYCNOS, C8orf31 and SH3BGRL3; or TDRP, KCNQ1 DN, CLEC18A, GATA3, MIR2467, MIR6792 and ADCY1 ; or FOXI2, IRS2, KCNC2, C3orf22, LOC10192826, HOXC12, VWA5B1 , P2RY1 , GLUD2, DDOST, BRF1 , GABRB3, PLAGL1 , L3MBTL1 and SC5D are part of a kit for use in the methods described above. The kit can comprise further DNA polymerase, deoxynucleotide triphosphates and co-factors.

Methods to analyse gene expression are known in the art. For example, quantitative PCR, also referred to as real time PCR. Alternatively, gene expression can be assessed in situ on isolated biological samples. In situ detection can also be via PCR amplification of mRNA expressed from target genes, this can also be a real time assay.

Throughout the description and claims of this specification, the words“comprise” and “contain” and variations of the words, for example“comprising” and“comprises”, means “including but not limited to”, and is not intended to (and does not) exclude other moieties, additives, components, integers or steps. “Consisting essentially” means having the essential integers but including integers which do not materially affect the function of the essential integers.

Throughout the description and claims of this specification, the singular encompasses the plural unless the context otherwise requires. Where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise.

Features, integers, characteristics, compounds, chemical moieties or groups described in conjunction with a aspect, embodiment or example of the invention are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith.

An embodiment of the invention will now be described by example only and with reference to the following figures:

Fig. 1 : Identification of DMRs between prostate cancer cell populations. (A) Representative FACS profiles of a cell suspension prepared from core needle biopsies of a radical prostatectomy sample. (B) Heatmap showing scaled methylation values of the top 1% most variable regions (100 bp bins) in the samples analysed. Hierarchical clustering is based on Euclidean distance of the unsealed values and complete linkage. (C) Diagram showing all pairwise comparisons carried out. (D) Number of DMRs found in each comparison. (E) Overlap of DMRs with CpG islands, shores (2 kb flanking islands) or shelves (2 kb flanking shores). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (F) Distribution of distances of DMRs to the closest TSS. Grey box indicates ±5 kb from a TSS. Purple lines: hypermethylated DMRs, orange lines: hypomethylated DMRs, gry line: all regions. (G) Proportion of DMRs proximal or distal to TSSs. P-values from hypergeometric test against all regions. E = enriched, D = depleted;

Fig. 2: Hypermethylated distal DMRs have features of enhancers. (A) Average plots of evolutionary conservation scores of the distal DMRs in each set. Purple lines: hypermethylated DMRs; orange lines: hypomethylated DMRs, gray line: all regions. P- values are from bootstrapping analysis. (B) Proportion of distal DMRs overlapping with DHSs (identified by ENCODE). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (C) Overlap of distal DMRs with ChIP-seq derived TFBSs (identified by ENCODE). P-values are from hypergeometric tests against all regions. E = enriched, D = depleted. (D) Overlap of each set of distal DMRs with repetitive elements (UCSC repeatMask), SINEs, LINEs and LTRs. P-values from hypergeometric tests against all regions. E = enriched, D = depleted. (E) Number of GO terms enriched by each set of DMRs. GO terms identified using GREAT (FDR<0.05 and at least 3 genes in the set);

Fig. 3: Shared phenotype-specific DMRs. (A) Overlap between the DMRs identified in the NL-NB and CL-CB comparisons. P-values derived from Fisher's exact test. (B) Heatmap showing scaled methylation values of the DMRs identified in the NL-NB (left) or CL-CB (right) comparisons. Hierarchical clustering is based on Euclidean distances of the unsealed values and complete linkage. (C) TFBSs enriched in the hypermethylated (purple) or hypomethylated (orange) DMRs common between the NL-NB and CL-CB comparisons. Left panel: analysis performed using HOMER findMotifs, p-values from binomial test. Right panel: enrichment of ENCODE defined TFBSs, p-values from hypergeometric test against all regions. (D) Frequently hyper- or hypomethylated genes in PCa 1 that were also hypermethylated (purple) or hypomethylated (orange) in the NL- NB and CL-CB comparisons. (E-F) Genome browser plots of the promoter regions of GSTP1 (E) and CCDC8 (F). Grey squares are the bins analysed. Lines and shaded areas represent mean ±SEM of each category (NB=light blue, NL=light red, CB=dark blue, CL=dark red). DMRs are shown on top: hypermethylated=purple, hypomethylated=orange;

Fig. 4: Aberrant methylation in CL. (A) Frequently hyper- or hypomethylated genes in PCa 1 that are also hypermethylated (purple) or hypomethylated (orange) in the CL-CB and CL-NL comparisons. (B) Overlap between the DMRs identified in the CL-CB and CL- NL comparisons. P-values derived from Fisher's exact test. (C) Clustering of the gene ontologies (biological process) enriched in DMRs common between the CL-CB and CL- NL comparisons based on information similarity. Each circle shows an individual GO term enriched in regions hypermethylated (purple), hypomethylated (orange) or both (green), the size of the circles is proportional to the enrichment p-value. The 2 main clusters of GO terms determined by k-means are highlighted (light blue and pink), and named after the most frequent terms. (D) Heatmap showing scaled methylation values (b-values) of probes overlapping the DMRs common to the CL-CB and CL-NL comparisons in the PCa samples (magenta) and matched normal samples (green) within the TCGA dataset. Hierarchical clustering based on Euclidean distances of the unsealed values and complete linkage. The dark green and gray clusters were generated by cutting the tree at the first bifurcation. (E) Heatmap showing scaled methylation values (b-values) of probes overlapping the DMRs common to the CL-CB and CL-NL comparisons in the PCa samples (matched normal samples not included) of the TCGA dataset. Hierarchical clustering based on Euclidean distance of the unsealed values and complete linkage. The dark green and gray clusters are generated by cutting the tree at the first bifurcation;

Fig. 5: PCa-specific DMRs shared between CB and CL. (A) Overlap between the DMRs identified in the CL-NL and CB-NB comparisons. P-values derived from Fisher's exact test. (B) Genome browser views of KCNC2 promoter (top) and RHCG exon 2 (bottom). Grey squares are the bins analysed. Lines and shaded areas represent mean ±SEM of each category (NB=light blue, NL=light red, CB=dark blue, CL=dark red). DMRs are shown on top: hypermethylated=purple, hypomethylated=orange. (C) Heatmap showing scaled methylation values of probes overlapping the DMRs common between CL-CB and CB-NB in the matched normal and cancer samples within the TCGA dataset. Hierarchical clustering based on Euclidean distances of the unsealed values and complete linkage. The dark green and gray clusters were generated by cutting the tree at the first 2 bifurcations. (D) Selection of a 17-probe signature distinguishing normal and PCa samples applying LASSO regression on a logistic model of the training dataset (70% of the TCGA samples). Lines show the changes in coefficients in relation to different lambdas. The vertical dashed line shows the optimal lambda identified using cross-validation. (E) Receiver-operating characteristic curve generated by applying the optimal logistic model to the test dataset (30% of the TCGA samples). (F) Heatmap showing scaled methylation values of the 17-probe signature in the test dataset (30% of the TCGA samples). The bar plot on the left side shows the final coefficients for each probe in the model, and the bar plot on top shows the logistic probability generated by for each sample (Green: normal samples, magenta: cancer samples);

Figure 6: Cancer and extraprostatic specific signatures. (A) Selection of an 8-probe signature distinguishing normal and PCa samples applying LASSO regression on a logistic model of the training dataset (70% of the TCGA samples) composed of the probes presented in figure 4D. Lines show the changes in coefficients in relation to different lambdas. The vertical dashed line shows the optimal lambda identified using cross-validation. (B) Receiver-operating characteristic curve generated by applying the optimal logistic model generated in A to the test dataset (30% of the TCGA samples). (C) Heatmap showing scaled methylation values of the 8-probe signature generated in A in the test dataset (30% of the TCGA samples). The bar plot on the left side shows the final coefficients for each probe in the model, and the bar plot on top shows the logistic probability generated by for each sample (Green: normal samples, magenta: cancer samples). (D) Selection of a 23-probe signature distinguishing organ confined and extraprostatic PCa samples applying LASSO regression on a logistic model of the training dataset (70% of the TCGA samples) composed of the probes presented in figure 4E. Lines show the changes in coefficients in relation to different lambdas. The vertical dashed line shows the optimal lambda identified using cross-validation. (B) Receiver- operating characteristic curve generated by applying the optimal logistic model generated in D to the test dataset (30% of the TCGA samples). (C) Heatmap showing scaled methylation values of the 23-probe signature generated in D in the test dataset (30% of the TCGA samples). The bar plot on the left side shows the final coefficients for each probe in the model, and the bar plot on top shows the logistic probability generated by for each sample (Light blue: organ confined, orange: extraprostatic).

Supplementary Figure 1 : Validation of the FACS sorting strategy for basal and luminal cells. (A) Immunofluorescence for pan-cytokeratin (PanCK), cytokeratin 5 (KRT5), cytokeratin 8 (KRT8) and androgen receptor (AR) on cytospin preparations of sorted EpCAM+CD49f+CD24- and EpCAM+CD49f-CD24+ cells. The pie charts show the proportion of negative (white) dim (grey) and bright (black) cells for each marker. At least 100 cells per population per marker were counted. For AR, the fraction of dim or positive cells with cytoplasmic (yellow) or nuclear (blue) localization is also shown. One representative image is shown for each marker in the green channel, and DAPI counterstain in blue. Scale bars = 10 pm. (B) qRT-PCR analysis performed on FACS sorted EpCAM+CD49f+CD24- and EpCAM+CD49f-CD24+ cells from one matched tumor directed and contralateral biopsy. Bar show mean ±SEM of the log10(AACt values) using GAPDH as housekeeping gene as NB population as a reference sample. (C) Table of the pre-operation clinical features of the donors for the samples used for the DNA methylation analysis (PSA measured in ng/ml); Supplementary Fig. 2: Identification of differentially methylated CpGs between prostate cancer cell populations. (A) Heatmap showing scaled methylation values of the top 1% most variable CpGs in the samples analysed. Hierarchical clustering is based on Euclidean distance of the unsealed values and complete linkage. (B) Number of differentially methylated CpGs found in each comparison. (C) Distribution of distances of differentially methylated CpGs to the closest TSS. Grey box indicates ±5 kb from a TSS. Purple lines: hypermethylated, orange lines: hypomethylated, gray line: all CpGs. (D) Overlap of differentially methylated CpGs with CpG islands, shores (2 kb flanking islands) or shelves (2 kb flanking shores). P-values from hypergeometric test against all CpGs analysed. E = enriched, D = depleted. (E) Heatmap showing scaled methylation values of the differentially methylated CpGs identified in the NL-NB (left) or CL-CB (right) comparisons. Hierarchical clustering is based on Euclidean distances of the unsealed values and complete linkage.

Supplementary Fig. 3: Proximal DMRs are associated with differential expression. (A) Dot-plot showing for each differentially expressed gene associated with a proximal (<5 kb from TSS) DMR in NL-NB differential methylation (x axis) and differential expression (y axis) in similarly selected cell populations. Each dot represents one gene/DMR association; purple dots: hypermethylated and downregulated genes; orange dots: hypomethylated and upregulated genes; blue line: least squares linear fit. (B) Top: Venn diagrams showing the overlap of the DMRs obtained in the NL-NB and NL-CB comparisons (p-values from Fisher's exact test). Bottom: dot-plot showing the methylation difference of the DMRs identified in the NL-NB (green), NL-CB (red) comparisons, or both (Black). (C) Top: Venn diagrams showing the overlap of the DMRs obtained in the CL-CB and CL-NB comparisons. Bottom: dot-plot showing the methylation difference of the DMRs identified in the CL-CB (green), CL-NB (red) comparisons, or both (Black).

Supplementary Fig. 4: Gene ontology enrichment analysis. Clustering of the gene ontologies (biological process) enriched in DMRs identified in the NL-NB (left) CL-CB (middle) and CL-NL (right) comparisons based on information similarity. Each circle shows an individual GO term, the size of the circles is proportional to the enrichment p- value. The 3 main clusters of GO terms determined by k-means are highlighted.

Supplementary Fig. 5: Phenotype-specific distal DMRs are highly enriched in enhancer features. (A) Distribution of distances of DMRs common between the NL-NB and CL-CB comparisons to the closest TSS. Grey box indicates ±5 kb from TSS. Purple line: hypermethylated DMRs, orange line: hypomethylated DMRs, gray line: all regions. (B) Average plots of evolutionary conservation scores of the distal DMRs common between the NL-NB and CL-CB comparisons. Purple line: hypermethylated DMRs, orange line: hypomethylated DMRs, gray line: all regions. P-values from bootstrapping analysis. (C) Proportion of distal DMRs common between the NL-NB and CL-CB comparisons that overlapped with DHSs (identified by ENCODE). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (D) Overlap of distal DMRs common between the NL-NB and CL-CB comparisons with ChIP-seq-derived TFBSs (identified by ENCODE). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (E) Top 15 gene ontologies enriched in hypermethylated and hypomethylated DMRs common between NL-NB and CL-CB. P-values from hypergeometric test (FDR<0.05 and at least 3 genes in the set). (F) Clustering of the gene ontologies (biological process) enriched in DMRs common between the NL-NB and CL-CB comparisons based on information similarity. Each circle shows an individual GO term, the size of the circles is proportional to the enrichment p-value. The 3 main clusters of GO terms determined by k-means are highlighted.

Supplementary Fig. 6: Aberrant methylation in luminal cells from PCa samples. (A) Distribution of distances of the DMRs common between the CL-CB and CL-NL comparisons to the closest TSS. Grey box indicates ±5 kb from TSS. Purple line: hypermethylated DMRs, orange lines hypomethylated DMRs, gray line: all regions. (B) Average plots of evolutionary conservation scores of the distal DMRs common between the CL-CB and CL-NL comparisons. Purple line: hypermethylated DMRs, orange lines hypomethylated DMRs, gray line: all regions. P-values from bootstrapping analysis. (C) Proportion of distal DMRs common between the CL-CB and CL-NL comparisons that overlapped with DHSs (identified by ENCODE). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (D) Overlap of distal DMRs common between the CL-CB and CL-NL comparisons with ChIP-seq-derived TFBSs (identified by ENCODE). P-values from hypergeometric test against all regions. E = enriched, D = depleted. (E) Overlap of each set of distal DMRs common between the CL-CB and CL- NL comparisons with repetitive elements (UCSC repeatMask), SINEs, LINEs and LTRs. P-values from hypergeometric test against all regions. E = enriched, D = depleted. (F) TFBSs enriched in the DMRs common between the CL-CB and CL-NL comparisons. Top panel: enrichment of ENCODE defined TFBSs, p-values from hypergeometric tests against all regions. Bottom panel: analysis performed using HOMER findMotifs, p- values from binomial tests. (G) Heatmap showing scaled methylation values of probes in the TCGA dataset (all samples) overlapping the DMRs common between the CL-CB and CL-NL comparisons. Hierarchical clustering based on Euclidean distance of the unsealed values and complete linkage. The dark green and gray clusters are generated by cutting the tree at the first 2 bifurcations. Supplementary Fig. 7: PCa-specific DMRs shared by both basal and luminal subsets. (A) Heatmap showing scaled methylation values of the DMRs common between the CL-CB and CB-NB comparisons. Hierarchical clustering is based on Euclidean distance of the unsealed values and complete linkage. (B) Heatmap showing scaled methylation values of probes in the TCGA dataset (all samples) overlapping the DMRs common between the CL-CB and CB-NB comparisons. Hierarchical clustering based on Euclidean distance of the unsealed values and complete linkage. The dark green and gray clusters are generated by cutting the tree at the first 3 bifurcations.

Table S1 : Quality metrics for all RRBS libraries generated.

Table 1: Methylation signature SEQ ID NO 1-17

Table 2: Methylation signature SEQ ID NO 18-25

Table 3: M ethylation signature SEQ ID NO 26-48

Key Table 1-3

Table S1

Materials and Methods

Tissue processing

Prostate tissues were obtained from patients undergoing radical prostatectomy at Castle Hill Hospital (Cottingham, UK) with informed patient consent and approval from the NRES Committee Yorkshire & The Humber (LREC Number 07/H 1304/121). Tissues were sampled immediately after surgery. For radical prostatectomies, three core needle biopsies were taken from four different sites (left base, left apex, right base, right apex) and were directed by previous pathology, imaging and palpation. Tissues were transported in RPMI-1640 with 5% FCS and 100U/ml antibiotic/antimitotic solution at 4°C and processed immediately upon arrival. PCa diagnosis was confirmed by histological examination of the whole prostate. Tissues were disaggregated as previously described 13 , and all reagents were supplemented with 10 nM R1881 to better preserve the viability of luminal cells.

Fluorescence activated cell sorting (FACS) and characterization of cell populations:

Single-cell suspensions were labelled with Lineage Cell Depletion Kit (human) and CD31 MicroBead Kit (Miltenyi Biotec) and Lin7CD31 + cells depleted twice using MACS LS Columns (Miltenyi Biotec). LinVCD3T cells were then labelled with EpCAM-APC, CD49f- FITC and CD24-PE (Miltenyi Biotec) and DAPI and EpCAM7CD49f7CD24- and EpCAM7CD49f/CD24 + sorted at >95% purity using a MoFlo (Beckman Coulter) cell sorter. Sorted populations were characterized by immunofluorescence and qRT-PCR as previously described 18 .

Reduced Representation Bisulphite Sequencing (RRBS):

DNA was extracted from FACS-sorted populations using phenol/chloroform extraction and ethanol precipitation. DNA was quantified using a NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific) and shipped to Zymo Research for RRBS analysis. Bisulphite conversion, library preparation, sequencing, and initial bioinformatics analyses were performed by Zymo Research following the Methyl-MiniSeq pipeline.

Sequence data processing and methylation calls

Fastq files were trimmed using Trim Galore! vO.4.1 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galo re/) with the following parameters: --fastqc --illumina --paired --rrbs --non_directional. Trimmed sequences were aligned to the human genome (hg19 downloaded from UCSC, 08-Mar-2009 version) using bsmap v2.90 14 and the following parameters: -m 0 -x 1000 -n 1 -p 8 -S 1. The resulting bam files were sorted and indexed using samtools vO.1.19 15 , and methylation and coverage calls for each CpG site calculated using the methratio.py script in the bsmap software (Supplementary Table 1). Methylation calls were then filtered for low (<3) and high (>99.95%) read coverage and merged in non-overlapping genomic bins of 100 bp using the methylKit package vO.99.2 16 within R v3.3.1 to increase comparability between samples. All subsequent analyses were carried out using only those genomic bins covered in all samples, with the exception of the results presented in Supplementary Fig. 2 which were generated using single CpG information.

Identification of differentially methylated regions (DMRs)

DMRs were calculated using methylKit 4 ; with all pairwise comparisons between the four cell populations carried out and similar populations from different donors defined as biological replicates. The patient of origin was used as a categorical covariate to account for the strong inter-donor variability seen. All p-values were generated using a logistic regression model and corrected for multiple testing using the SLIM method 16 . DMRs were defined as those genomic bins with q-values <0.05 and absolute methylation difference >10% in each pairwise comparison. Characterization of DMRs

All genomic features were downloaded from the UCSC Table browser (genome.ucsc.edu) for the hg19 genome. Gene models: “refGene” (RefSeq Genes), CpG Islands: “cpglslandExt”, Evolutionary conservation: “phastConsl OOway”, DNase hypersensitivity sites (DHSs): “wgEncodeRegDnaseClusteredV3”, transcription factor binding sites (TFBSs): “wgEncodeRegTfbsClusteredV3”, repeats: “rmsk” (RepeatMasker). Overlaps and distances of DMRs to other genomic features were calculated using BEDtools v2.26.0 17 , and significance of enrichments or depletions was calculated using custom R scripts. All p-values <10 300 were approximated to 10 30 ° to avoid reaching the minimum value for a floating-point number (2.2*1 O 308 ). Average conservation signals around DMRs were calculated using bwtool v1.0 5 . P-values were calculated using a bootstrapping approach comparing the average conservation of the distal DMRs with the average of an equal number of randomly selected, non-overlapping, distal genomic bins, 1000 times. Gene ontology (GO) analysis was performed using GREAT v3.0 6 , using all covered genomic bins as background and the default“Basal plus extension” association rules. Results were filtered to include only GO categories, with a Benjamini-Hochberg corrected (FDR) hypergeometric test p-value <0.05 and ³3 genes with associated regions. K-means clustering of GO categories (biological processes only) was based on information similarity values calculated using the GOSim package within R v3.3.1. Promoters frequently altered in PCa were downloaded from the review by Massie et al., 2017 1 . Only promoters reported by ³3 studies were considered frequently altered. Genome browser plots were generated using the package Sushi within R v3.3.1 and custom scripts.

TCGA data analysis lllumina Infinium HumanMethylation450 data generated within the The Cancer Genome Atlas (TCGA) consortium 7 was downloaded (pre-processed Level 3 data only) from the NCI Genomic Data Commons website using the provided GDC Data Transfer Tool (data downloaded on 7 th Dec 2016). Clinical data was downloaded from firebrowse.org (8 th Dec 2016). The presence of evident batch effects was excluded by visualizing the data on TCGA Batch Effects (http://biosnfcrmatlcs.mdanderson.ora/tcganibatch/). A data matrix containing the beta values for each sample was generated using custom scripts. Probes were mapped to hg19 using the positions officially reported by lllumina. Overlap of array probes with DMRs was carried out using BEDtools v2.26.0. Hierarchical clustering was based on Euclidean distances of unsealed beta-values. Logistic model training using least absolute shrinkage and selection operator (LASSO) regression was performed using the glmnet package within R v3.3.1 on a random selection of 70% of the samples. 200 lambda values ranging from e '7 to e 2 were tested and 10-fold cross validation performed. The lambda with the minimum mean cross-validated error was selected and resulted in 17 probes with non-zero coefficients. The optimal model was then tested on the remaining 30% of samples and receiver operator curve and area under the curve (AUC) calculated using the ROCR package.

Calculation of probability of a sample containing prostate cancer (Model 1 - related to Table 1).

Ppmstate cancer = 1 /(1 +exp(-(sum(coefficients P ro b e*Pvalpro b e)+23.4781567769938)))) wherein

Ppmstate cancer = calculated probability of a single sample of being a prostate cancer as opposed to a normal prostate sample coefficientSpmbe - coefficients of the logistic models for each probe fivalpm b e OR fivakgxxxxxxx = beta values of each individual probe as determined by the lllumina Infinium HumanMethylation450K Bead Chip

Calculation of probability of a sample containing prostate cancer (Model 2 - related to Table 2).

wherein

pms tat e c an ce r = calculated probability of a single sample of being a prostate cancer as opposed to a normal prostate sample coefficientSpmbe - coefficients of the logistic models for each probe fivalpm b e OR b valcgxxxxxxx = beta values of each individual probe as determined by the lllumina Infinium HumanMethylation450K Bead Chip Calculation of probability of a sample containing aggressive (extraprostatic) prostate cancer( Model 3 - related to Table 3). wherein

Paggressive prostate cancer = calculated probability of a single sample of being a prostate cancer that has escaped the prostate (extraprostatic) as opposed to being still contained within the prostate capsule (organ confined) coefficientSp mbe = coefficients of the logistic models for each probe b valp mbe OR bn3 l cg xxxxxxx = beta values of each individual probe as determined by the lllumina Infinium HumanMethylation450K Bead Chip

Example 1

Phenotypically defined prostate cells from patient-matched normal and PCa samples show donor-specific DNA methylation profiles

Matched tumour-directed (cancer) and contralateral (normal) core needle biopsies (1 or 2 per site) were obtained from 4 treatment-naive prostate cancer patients undergoing radical prostatectomies. These samples were then enzymatically dissociated and labeled with antibodies against EpCAM, CD49f and CD24 to enable the prospective isolation of luminal (EpCAM+CD49f-CD24+) and basal (EpCAM+CD49f+CD24-) cells at >95% purity (Fig. 1A). EpCAM+CD49f+CD24- cells expressed higher levels of molecular markers associated with basal cells and lower levels of luminal markers compared to EpCAM+CD49f-CD24+ cells from the same biopsy, both at the mRNA and protein level (Supplementary Fig. 1A-B). For convenience, we named the paired subsets as follows: Cancer Luminal (CL) EpCAM+CD49f-CD24+ cells purified from tumour-directed biopsies; Cancer Basal (CB) EpCAM+CD49f+CD24- cells purified from tumour-directed biopsies; Normal Luminal (NL) EpCAM+CD49f-CD24+ cells from contralateral biopsies; Normal Basal (NB) EpCAM+CD49f+CD24- cells purified from contralateral biopsies. This yielded 4 CL and CB populations, and 3 matched NL and NB populations, as in one prostate the palpable tumour was extended to most of the prostate and it was not possible to obtain a contralateral“normal” tissue biopsy (Supplementary Fig. 1C). DNA obtained from each of these isolates was then subjected to Reduced Representation Bisulphite Sequencing (RRBS). On average, this generated information on the DNA methylation status of >8.9x10 6 cytosines within CpG sites per sample (range 8x10 6 - 9.6x10 6 , with an average coverage of 7.5 reads, Supplementary Table 1). The data was processed as described in Method and binned into 100 bp genomic regions to maximize the comparability between samples (932,905 bins covering 4.1x10 6 CpGs in all samples). Unsupervised hierarchical clustering of the top 1% most variable regions (bins) across all samples showed clustering primarily according to the patient of origin, rather than the subset analyzed (Fig. 1 B). This indicates a high donor-determined variation in CpG methylation, consistent with previous reports of similarly accrued data 8 .

Example 2

Distinct DNA methylation profiles in basal and luminal cells

We then calculated DMRs for all pairwise comparisons between the 4 sorted populations (Fig. 1C). Among these, the comparison between CB and NB cells (CB-NB comparison) produced the smallest number of DMRs. In contrast, a large number of DMRs were seen when either normal or cancer luminal cells were compared with either source of basal cells (i.e., NL-NB, NL-CB, CL-NB and CL-CB, Fig. 1 D). Of the DMRs revealed in these latter comparisons, ~2/3 were hypermethylated in luminal cells, which correlates with the higher levels of DNMT3a seen in these cells 3 . We also calculated differential methylation on single CpGs (prior the 100bp binning) with very similar results Moreover, integration of the DMRs identified in NL-NB proximal (±5 kb) to annotated transcriptional start sites (TSSs) with RNA-seq data of similarly purified cells 2 showed the expected inverse correlation (Supplementary Fig. 3A).

We also found an extensive overlap in the DMRs obtained from both the NL-NB and NL- CB comparisons, and also from the CL-NB and CL-CB comparisons (Supplementary Fig. 3B-C). Accordingly, we focussed our subsequent analyses on comparisons of NL-NB and CL-CB, where cells from the same biopsy could be compared directly.

Characterization of the genomic features of the DMRs thus identified showed that >50% of them fell outside of CpG islands, shores or shelves (Fig. 1 E), and >70% were >5 kb away from any annotated TSSs (Fig. 1 F-G). These features were particularly pronounced (highly significant hypergeometric test) for the hypomethylated DMRs identified in the comparisons of NL-NB, CL-CB and CL-NL. Because hypermethylated and hypomethylated DMRs might be anticipated to differ in their genomic context, their impact on the biological properties of basal and luminal cells could also be different.

Example 3

Distal hypermethylated DMRs are enriched in enhancer features Given that most of the DMRs identified were outside CpG islands and far from TSSs, we asked whether they might affect distal regulatory elements (enhancers). We therefore examined three genomic characteristics of such elements: evolutionary conservation 9 , open chromatin shown by hypersensitivity to DNase I 18 , and presence of TFBSs 10 . Distal hypermethylated DMRs in each comparison were enriched for evolutionarily conserved sequences (Fig. 2A, bootstrapped p-value) and overlapped significantly with both DHSs and ChIP-seq-defined TFBSs (identified within the ENCODE project, Fig. 2B-C, hypergeometric test). Distal hypomethylated DMRs generally scored lower than the hypermethylated counterparts for each metric measured. DMRs hypomethylated in the CL-CB and CL-NL comparisons showed the weakest enrichments. However, all distal hypomethylated DMRs had high overlaps with genomic repetitive elements (Fig. 2D). Specifically, LINE and LTR elements, but not SINE elements, were significantly enriched in the distal CL hypomethylated regions.

GO enrichment analysis (Fig. 2E, Supplementary Fig. 4) showed that hypermethylated DMRs in NL-NB were enriched for more than 500 terms, many of which were linked to prostate development or epithelial stem cell regulation; while hypomethylated DMRs in the same comparison were enriched for terms related to androgen receptor signalling and response to cytokines. In the CL-CB comparison, hypermethylated DMRs were also enriched for more than 500 terms, 31 1 of which were also identified in the NL-NB comparison, suggesting a high functional overlap in hypermethylated regions in luminal cells from both normal and cancer samples. In the CL-NL comparison, hypermethylated DMRs were enriched in terms related to cell adhesion, while hypomethylated DMRs were enriched in terms related to epithelial morphogenesis. These results indicate that several pathways fundamental to the establishment and maintenance of the normal prostate epithelium are altered in cancer cells with a luminal phenotype.

Example 4

Phenotype-specific DMRs are shared in normal and cancerous prostate tissues

As suggested by the enriched GO analyses, we found a 28% overlap in all the DMRs identified from the NL-NB and the CL-CB comparisons (3852/13816, Fisher's exact test p-value < 10 30 °, Fig. 3A). Hierarchical clustering of all samples based on both sets of DMRs separated them by phenotype (Fig. 3B), reinforcing the presence of a strong phenotypic signature independent of disease state. These shared DMRs were enriched in features characteristic of enhancers (Supplementary Fig. 5A-D) and linked to GO terms related to prostate development, regulation of epithelial stem cells and androgen receptor signalling (Supplementary Fig. 5E-F). Moreover, hypermethylated DMRs were highly enriched for TFBSs of TP63, TP53 and NF1, and hypomethylated DMRs for FOXA1, p65-NFkB and GAT A3 (Fig. 3C), all well-known regulators of basal and luminal epithelial cells, respectively. Interestingly, 26 of the 168 genes described as frequently differentially methylated in PCa 1 , showed hyper- or hypomethylated DMRs within 5 kb of their TSSs in both the NL-NB and CL-CB comparisons (Fig. 3D). These included the frequently hypermethylated genes, GSTP1 and CCDC8 (Fig. 3E-F).

In summary, these analyses identified a large set of phenotype-specific and disease- independent DMRs, both of which contained many binding sites for TFs with known regulatory roles in the normal prostate.

Example 5

CL hypermethylate PRC2 target sites and hypomethylate repetitive elements

A second group of genes frequently hypermethylated in PCa were found hypermethylated in both the CL-CB and CL-NL comparisons (Fig. 4a), but not in the NL- NB comparison. These might be expected to reflect a PCa-specific methylation signature. DMRs identified in the CL-CB and CL-NL comparisons showed that many were shared (1472 DMRs, Fisher's exact test p-value < 10 300 , Fig. 4B) with very few also different between NL and NB cells (106 DMRs). 65% of these CL-specific hypermethylated DMRs were distal to TSSs and were again highly enriched for enhancer features, but significantly depleted in repetitive elements (Supplementary Fig. 6A-E). These regions were associated with GO terms related to metabolic processes, cell proliferation and epithelial development (Fig. 4C) and showed a high enrichment of DNA sequences potentially bound by EZH2 and SUZ12, two main members of the PRC2 complex (Supplementary Fig. 6F). On the other hand, distal hypomethylated DMRs were not enriched for any feature of putative regulatory regions, but significantly overlapped with LINE and LTR elements.

Since the CL subset represents the majority of the cells in untreated PCa samples, we hypothesized that aberrant methylation of these DMRs would be measurable even when whole tissue homogenates are analysed. We therefore interrogated the DNA methylation array dataset for PCa made available by the TCGA consortium, which consists of 50 PCa samples with matched normal counterparts, 452 additional PCa samples without normal counterparts, and 1 metastatic PCa sample 7 . 255 array probes overlap these 1472 DMRs. Hierarchical clustering of the 50 matched normal and PCa samples showed an almost perfect subdivision based on the malignancy status of the samples (TPR = 0.92, TNR = 0.92, Chi-squared test p-value = 2.4x10 16 , Fig. 4D). The same analysis carried out on all 553 samples produced similar results, with one cluster highly enriched in normal samples (Chi-squared test p-value = 1.7x1 O 39 , Supplementary Fig. 6G). This clustering also appeared to divide the PCa samples into two main groups, according to their differences from the normal samples. Exclusive analysis of the cancer samples confirmed this clustering pattern (Fig. 4E) and showed one cluster to be significantly enriched for samples with extra-prostatic extensions (pT3 or pT4 in TNM classification, Chi-squared test p-value < 0.005) in the absence of significant differences in Gleason score (Chi-squared test p-value >0.1).

Overall, these results indicate that phenotypic luminal PCa cells possess an aberrant methylation signature characterized by hypermethylation of putative regulatory sequences involved in tissue development, and hypomethylation of LINEs and LTRs repetitive elements. This signature was also able to distinguish cancer samples from normal, and organ-confined from extra- prostatic disease.

Example 6

Identification of PCa-specific, phenotype-independent DMRs

Comparisons of the DMRs in the CL-NL and CB-NB pairs showed a small but significant overlap of both hyper- and hypomethylated DMRs in each (189 DMRs in total, Fig. 5A). These common DMRs were able to cluster all samples according to their disease state in a phenotype-independent manner (Supplementary Fig. 7A). Notably, they included DMRs close to many genes previously implicated in prostate cancer (e.g., NEAT1, MTOR, RHCG, KCNC2, WT1, HOXC12, KMT2B, Fig. 5B). To determine whether these DMRs would be altered in an independent dataset, we applied the same analysis to the TCGA dataset, where 66 array probes overlapped these 189 DMRs. Hierarchical clustering of the 50 matched normal and PCa samples produced a single cluster containing 46/50 normal samples and 10/50 PCa samples (TPR = 0.8, TNR = 0.92, Chi- squared test p-value = 1.8x1 O 12 , Fig. 5C). Application of the same analysis to all samples in the TCGA database produced similar results: one cluster was highly enriched in normal samples (TPR = 0.87, TNR = 0.74, Chi-squared test p-value = 8.3x10 26 , Supplementary Fig. 7B), indicating that at least some of these DMRs are frequently altered in PCa.

To select the probes most strongly associated with disease state (i.e., PCa vs normal), we trained a logistic model using LASSO regression on 70% of the TCGA samples and selected a 17-probe signature (Fig. 5D). We then tested this model on the remaining 30% of the dataset. This resulted in an AUC of 0.92 (TPR = 0.9, TNR = 0.94, Fisher's exact test p-value = 2.82x10 12 at the selected cut-off of 0.8, Fig. 5E-F, Table 1. The 17- probe signature also included sequences proximal to several genes with recognized importance in PCa (e.g., PLAGL1IHYMAI , HOXC12, KCNC2), but was completely non overlapping with other similar signatures recently developed for PCa 19 21 , 11 12 .

Example 7

To identify another two relevant signatures indicative of prostate cancer or aggressiveness of cancer respectively, we again trained logistic models using LASSO regression on the 255 array probes of the TCGA dataset overlappping the 1472 DMRs found in the comparisons CL-NL and CL-CB (Figure 4).

We trained a first model using a random selection of 70% of all samples categorized as normal or PCa, and selected the 8 probes most associated with disease state. We then tested this model on the remaining 30% of the samples. This resulted in an AUC of 0.96 (TPR = 0.93, TNR = 1 , Fisher's exact test p-value = 1.06x10 12 at the selected cut-off of 0.78, Fig. 6A-C).

We trained a second model using a random selection of 70% of only the PCa samples categorized into organ confined or extraprostatic. This classification was derived from the official TNM classification given by TCGA: Extraprostatic = pT3 or pT4 / Organ Confined = pT2. The LASSO regression returned an optimal model based on 23 probes most associated with disease aggressiveness. We then tested this model on the remaining 30% of the samples. This resulted in an AUC of 0.74 (TPR = 0.64, TNR = 0.78, Fisher's exact test p-value = 4.23x1 O 7 at the selected cut-off of 0.63, Fig. 6D-F).

References

1. Massie CE, Mills IG, Lynch AG. The importance of DNA methylation in prostate cancer development. J Steroid Biochem Mol Biol. 2017 Feb;166:1-15.

2. Zhang D, Park D, Zhong Y, Lu Y, Rycaj K, Gong S, et al. Stem cell and neurogenic gene-expression profiles link prostate basal cells to aggressive prostate cancer. Nat Commun. 2016 Feb 29;7:10798. 3. Pellacani D, Kestoras D, Droop A, Frame FM, Berry PA, Lawrence MG, et al. DNA hypermethylation in prostate cancer is a consequence of aberrant epithelial differentiation and hyperproliferation. Cell Death Differ. 2014 May;21(5):761-73.

13. Frame FM, Pellacani D, Collins AT, Maitland NJ. Harvesting Human Prostate Tissue Material and Culturing Primary Prostate Epithelial Cells. Methods Mol Biol. 2016;1443:181-2014. Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol. 2012 Oct 3;13(10):R87.

5. Pohl A, Beato M. bwtool: a tool for bigWig files. Bioinformatics. 2014 Jun 1 ;30(11 ) : 1618—9.

6. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010 May;28(5):495-501.

7. Cancer Genome Atlas Research Network. The Molecular Taxonomy of Primary Prostate Cancer. 2015 Nov 5; 163(4): 1011-25.

8. Yu YP, Ding Y, Chen R, Liao SG, Ren B-G, Michalopoulos A, et al. Whole- genome methylation sequencing reveals distinct impact of differential methylations on gene transcription in prostate cancer. Am J Pathol. 2013 Dec; 183(6): 1960-70.

9. Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, et al.

Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009 May 7;459(7243):108-12.

10. Heinz S, Romanoski CE, Benner C, Glass CK. The selection and function of cell type-specific enhancers. Nat Rev Mol Cell Biol. 2015 Mar; 16(3): 144-54.

11. Mundbjerg K, Chopra S, Alemozaffar M, Duymich C, Lakshminarasimhan R, Nichols PW, et al. Identifying aggressive prostate cancer foci using a DNA methylation classifier. Genome Biol. BioMed Central; 2017 Jan 12; 18(1 ):3.

12. Tang Y, Jiang S, Gu Y, Li W, Mo Z, Huang Y, et al. Promoter DNA methylation analysis reveals a combined diagnosis of CpG-based biomarker for prostate cancer. Oncotarget. Impact Journals; 2017 Aug 29;8(35):58199-209.

13. Frame FM, Pellacani D, Collins AT, Maitland NJ. Harvesting Human Prostate Tissue Material and Culturing Primary Prostate Epithelial Cells. Methods Mol Biol. 2016;1443:181-20 14. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009; 10:232.

15. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.

16. Wang H-Q, Tuominen LK, Tsai C-J. SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures. Bioinformatics. 2011 Jan 15;27(2):225-31.

17. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2.

18. Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature. 2012 Aug

29;489(7414):75-82

19. Zhao S, Geybels MS, Leonardson A, Rubicz R, Kolb S, Yan Q, et al. Epigenome- Wide Tumor DNA Methylation Profiling Identifies Novel Prognostic Biomarkers of Metastatic-Lethal Progression in Men Diagnosed with Clinically Localized Prostate Cancer. Clin Cancer Res. American Association for Cancer Research; 2017 Jan 1 ;23(1):311-9.

20. Geybels MS, Zhao S, Wong C-J, Bibikova M, Klotzle B, Wu M, et al. Epigenomic profiling of DNA methylation in paired prostate cancer versus adjacent benign tissue. Prostate. 2015 Dec;75(16):1941-50.

21. Geybels MS, Wright JL, Bibikova M, Klotzle B, Fan J-B, Zhao S, et al. Epigenetic signature of Gleason score and prostate cancer recurrence after radical prostatectomy.

Clin Epigenetics. BioMed Central; 2016;8(1):97.