Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
BIOMARKERS FOR DISEASE DETECTION
Document Type and Number:
WIPO Patent Application WO/2020/194057
Kind Code:
A1
Abstract:
Provided herein are methods and kits for analyzing a sample from a subject, such as identifying a sample as benign or malignant for a disease or condition, such as cancer. The methods as described herein may enrich for an epigenetic modification in one or more fragments of a sample and sequence the enriched portion to obtain a result. The result may be input into a trained algorithm to classify the sample as benign or malignant for the disease or condition based at least in part on the presence or absence of the epigenetic modification in the one or more fragments.

Inventors:
WALKER NICOLAS (GB)
PROUTSKI VITALI (GB)
HOWELL KATE (GB)
MORGANELLA SANDRO (GB)
VASQUEZ LOUELLA (GB)
Application Number:
PCT/IB2020/000234
Publication Date:
October 01, 2020
Filing Date:
March 20, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CAMBRIDGE EPIGENETIX LTD (GB)
International Classes:
C12Q1/6869; C12Q1/6806; C12Q1/6886
Domestic Patent References:
WO2016008451A12016-01-21
WO2019035100A22019-02-21
WO2019006269A12019-01-03
WO2017176630A12017-10-12
WO2014011928A12014-01-16
Other References:
BRINKMAN A B ET AL: "Whole-genome DNA methylation profiling using MethylCap-seq", METHODS, ACADEMIC PRESS, NL, vol. 52, no. 3, 11 June 2010 (2010-06-11), pages 232 - 236, XP027456648, ISSN: 1046-2023, [retrieved on 20100611], DOI: 10.1016/J.YMETH.2010.06.012
BRUTLAG ET AL., COMP. APP. BIOSCI., vol. 6, 1990, pages 237 - 245
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1. A method comprising:

(a) enriching a cell-free DNA sample obtained from a subject for one or more fragments comprising an epigenetic modification; and

(b) quantifying a plurality of cell-free DNA sequencing reads corresponding to at least a portion of a region of a genome of said subject and a plurality of genomic DNA sequencing reads corresponding to said region of said genome,

wherein said plurality of cell-free DNA sequencing reads are produced by sequencing said one or more fragments, and wherein said plurality of genomic DNA sequencing reads are (i) produced by sequencing at least a portion of genomic DNA from a genomic DNA sample obtained from said subject or (ii) based on a reference.

2. The method of claim 1, wherein said region of said genome comprises a gene body, an

intergenic region, an enhancer, or any combination thereof.

3. The method of claim 1, wherein said region of said genome comprises a base.

4. The method of claim 1, wherein said region of said genome comprises a plurality of regions in said genome.

5. The method of any proceeding claim, further comprising identifying a tissue of origin of at least a portion of said cell-free DNA sample.

6. The method of claim 5, wherein said identifying of said tissue of origin is based at least in part on said plurality of genomic DNA sequencing reads.

7. The method of any proceeding claim, further comprising: measuring a presence or an

absence of a peak of said cell-free DNA sequencing reads at said region of said genome, wherein said peak of said cell-free DNA sequencing reads comprises at least about 50% overlap with a peak of said genomic DNA sequencing reads at said region of said genome.

8. The method of claim 7, wherein said overlap is at least about 60%.

9. The method of claim 7, wherein said presence or said absence of said peak comprises a level of said peak.

10. The method of claim 7, further comprising measuring said presence or said absence of said peak of said cell-free DNA sequencing reads in a reference.

11. The method of claim 10, wherein said reference comprises DNA obtained from a healthy subject.

12. The method of any proceeding claim, wherein said sequencing comprises nanopore sequencing.

13. The method of any proceeding claim, wherein said sequencing comprises high throughput sequencing.

14. The method of any proceeding claim, wherein said cell-free DNA sample is obtained from a blood sample.

15. The method of claim 14, wherein said cell-free DNA sample comprises fetal DNA, maternal DNA, or a combination thereof.

16. The method of any proceeding claim, wherein said genomic DNA sample is obtained from a tissue sample.

17. The method of claim 16, wherein said tissue sample comprises a tissue biopsy, a buccal swab, a fine needle aspirate, or any combination thereof.

18. The method of any proceeding claim, wherein said cell-free DNA sample and said genomic DNA sample are obtained from a same sample.

19. The method of any proceeding claim, wherein said subject has or is suspected of having a disease or condition.

20. The method of claim 19, wherein said subject has or is suspected of having a cancer.

21. The method of any proceeding claim, further comprising identifying a presence or an

absence of a disease or a condition in said subject.

22. The method of claim 21, wherein said presence of said absence of said disease or a condition is based at least in part on said presence or said absence of said peak of said cell-free DNA sequencing reads.

23. The method of claim 21, wherein said disease or condition comprises a cancer.

24. The method of claim 23, wherein said cancer comprises a colorectal cancer, a breast cancer, a lung cancer, a prostate cancer, or any combination thereof.

25. The method of claim 23, wherein said identifying comprises distinguishing a cancer type from a plurality of said cancer types.

26. The method of claim 25, wherein said distinguishing comprises a specificity of at least about 80%.

27. The method of claim 25, wherein said distinguishing comprises a sensitivity of at least about 80%.

28. The method of claim 21, further comprising identifying a stage of said disease or condition.

29. The method of claim 28, wherein said disease or condition comprises a cancer, and wherein said stage comprises stage I, stage II, stage III, or stage IV.

30. The method of any proceeding claim, wherein said epigenetic modification comprises a plurality of epigenetic modifications.

31. The method of any proceeding claim, wherein said epigenetic modification comprises a methylated base, a hydroxymethylated base, a formylated base, a carboxylated base, or any combination thereof.

32. The method of claim 31, wherein said epigenetic modification comprises said

hydroxymethylated base.

33. The method of claim 32, wherein said hydroxymethylated base comprises a 5- hydroxymethylated cytosine.

34. The method of any proceeding claim, wherein said enriching comprises (a) hybridizing a substantially complementary strand to a fragment of said one or more fragments comprising said epigenetic modification and (b) amplifying said substantially complementary strand in a reaction in which said fragment is substantially not present.

35. The method of claim 34, wherein said hybridizing is performed before said enriching of (a).

36. The method of claim 34, further comprising separating said substantially complementary strand from said fragment.

37. The method of claim 34, wherein said substantially complementary strand hybridizes to said fragment under stringent hybridization conditions.

38. The method of any proceeding claim, wherein said quantifying of (b) comprises employing a machine learning algorithm.

39. The method of claim 38, wherein said machine learning algorithm classifies said cell-free DNA sample from said subject as benign or malignant for a cancer at greater than about 80% sensitivity.

40. The method of claim 38, wherein said machine learning algorithm classifies said cell-free DNA sample from said subject as benign or malignant for a cancer at greater than about 80% specificity.

41. The method of claim 38, wherein said machine learning algorithm is trained using a training set of samples.

42. The method of claim 41, wherein said training set of samples comprises cell-free DNA

samples.

43. The method of claim 41, wherein said training set of samples comprises cell-free DNA samples and genomic DNA samples.

44. The method of claim 41, wherein said training set of samples comprises a sample having a sequence comprising a CpG island.

45. The method of claim 41, wherein said training set of samples comprises a combination of malignant samples and benign samples.

46. The method of claim 41, wherein when said machine learning algorithm identifies said cell- free DNA sample as benign for said disease or condition, assaying a second sample from said subject to monitor a change over time.

47. The method of any proceeding claim, further comprising selecting a treatment for said

subject.

48. The method of any proceeding claim, wherein said subject is suspected of having a disease or condition.

49. The method of any proceeding claim, wherein said subject is asymptomatic for a disease or condition.

50. The method of any proceeding claim, wherein said subject has not previously been

diagnosed with a disease or condition.

51. The method of any proceeding claim, wherein said subject has cancer.

52. The method of any proceeding claim, wherein said cell-free DNA sample is identified as benign for a disease or condition in an absence of said subject having a further medical procedure.

53. The method of claim 52, wherein said further medical procedure comprises: obtaining a biopsy from said subject, performing an imaging scan of said subject, or a combination thereof.

54. The method of any proceeding claim, wherein the method is performed in an absence of a screening procedure.

55. The method of claim 54, wherein said screening procedure comprises a genetic mutation screen, a colonoscopy, an assay performed on a stool sample provided by said subject, a sigmoidoscopy, or any combination thereof, a mammograph, an MRI, a breast self- examination, prostate specific antigen (PSA) assay, rectal examination, a bronchial lavage, chest x-ray, CT scan, or any combination thereof.

Description:
BIOMARKERS FOR DISEASE DETECTION

CROSS-REFERENCE

[0001] This application claims the benefit of U.S. Provisional Application No. 62/822,249, filed March 22, 2019, which is incorporated by reference herein in its entirety.

SUMMARY

[0002] The methods and kits as described herein may provide identification of samples from a subject as benign or malignant for a cancer. This method may be an improvement in the field of analyzing samples from a subject.

[0003] An aspect of the present disclosure provides a method, such as a method of preparing a sample for a diagnostic analysis. The method can comprise: (a) enriching a cell-free DNA sample obtained from a subject for one or more fragments comprising an epigenetic

modification; and (b) quantifying a plurality of cell-free DNA sequencing reads corresponding to at least a portion of a region of a genome of the subject and a plurality of genomic DNA sequencing reads corresponding to the region of the genome, wherein the plurality of cell-free DNA sequencing reads are produced by sequencing the one or more fragments, and wherein the plurality of genomic DNA sequencing reads are (i) produced by sequencing at least a portion of genomic DNA from a genomic DNA sample obtained from the subject or (ii) based on a reference. In some embodiments, the region of the genome can comprise a gene body, an intergenic region, an enhancer, or any combination thereof. In some embodiments, the region of the genome comprises a base. In some embodiments, the region of the genome comprises a plurality of regions in the genome. In some embodiments, the method can further comprise identifying a tissue of origin of at least a portion of the cell-free DNA sample. In some embodiments, the identifying of the tissue of origin is based at least in part on the plurality of genomic DNA sequencing reads. In some embodiments, the method can further comprise: measuring a presence or an absence of a peak of the cell-free DNA sequencing reads at the region of the genome, wherein the peak of the cell-free DNA sequencing reads comprises at least about 50% overlap with a peak of the genomic DNA sequencing reads at the region of the genome. In some embodiments, the overlap is at least about 60%. In some embodiments, the presence or the absence of the peak comprises a level of the peak. In some embodiments, the method can further comprise measuring the presence or the absence of the peak of the cell-free DNA sequencing reads in a reference. In some embodiments, the reference comprises DNA obtained from a healthy subject. In some embodiments, the sequencing comprises nanopore sequencing. In some embodiments, the sequencing comprises high throughput sequencing. In some embodiments, the cell-free DNA sample is obtained from a blood sample. In some embodiments, the cell-free DNA sample comprises fetal DNA, maternal DNA, or a combination thereof. In some embodiments, the genomic DNA sample is obtained from a tissue sample. In some embodiments, the tissue sample comprises a tissue biopsy, a buccal swab, a fine needle aspirate, or any combination thereof. In some embodiments, the cell-free DNA sample and the genomic DNA sample are obtained from a same sample. In some embodiments, the subject has or is suspected of having a disease or condition. In some embodiments, the subject has or is suspected of having a cancer. In some embodiments, the method can comprise identifying a presence or an absence of a disease or a condition in the subject. In some embodiments, the presence of the absence of the disease or a condition is based at least in part on the presence or the absence of the peak of the cell-free DNA sequencing reads. In some embodiments, the disease or condition comprises a cancer. In some embodiments, the cancer can comprise a colorectal cancer, a breast cancer, a lung cancer, a prostate cancer, or any combination thereof.

In some embodiments, the identifying comprises distinguishing a cancer type from a plurality of cancer types. In some embodiments, the distinguishing can comprise a specificity of at least about 80%. In some embodiments, the distinguishing can comprise a sensitivity of at least about 80%. In some embodiments, the method can further comprise identifying a stage of the disease or condition. In some embodiments, the disease or condition comprises a cancer, and wherein the stage comprises stage I, stage II, stage III, or stage IV. In some embodiments, the epigenetic modification comprises a plurality of epigenetic modifications. In some embodiments, the epigenetic modification can comprise a methylated base, a hydroxymethylated base, a formylated base, a carboxylated base, or any combination thereof. In some embodiments, the epigenetic modification comprises the hydroxymethylated base. In some embodiments, the hydroxymethylated base comprises a 5 -hydroxymethylated cytosine. In some embodiments, the enriching comprises (a) hybridizing a substantially complementary strand to a fragment of the one or more fragments comprising the epigenetic modification and (b) amplifying the substantially complementary strand in a reaction in which the fragment is substantially not present. In some embodiments, the hybridizing is performed before the enriching of (a). In some embodiments, the method can further comprise separating the substantially complementary strand from the fragment. In some embodiments, the substantially complementary strand hybridizes to the fragment under stringent hybridization conditions. In some embodiments, the quantifying of (b) comprises employing a machine learning algorithm. In some embodiments, the machine learning algorithm classifies the cell-free DNA sample from the subject as benign or malignant for a cancer at greater than about 80% sensitivity. In some embodiments, the machine learning algorithm classifies the cell-free DNA sample from the subject as benign or malignant for a cancer at greater than about 80% specificity. In some embodiments, the machine learning algorithm is trained using a training set of samples. In some embodiments, the training set of samples comprises cell-free DNA samples. In some embodiments, the training set of samples comprises cell-free DNA samples and genomic DNA samples. In some embodiments, the training set of samples comprises a sample having a sequence comprising a CpG island. In some embodiments, the training set of samples comprises a combination of malignant samples and benign samples. In some embodiments, the machine learning algorithm identifies the cell- free DNA sample as benign for the disease or condition, assaying a second sample from the subject to monitor a change over time. In some embodiments, the method can further comprise selecting a treatment for the subject. In some embodiments, the subject is suspected of having a disease or condition. In some embodiments, the subject is asymptomatic for a disease or condition. In some embodiments, the subject has not previously been diagnosed with a disease or condition. In some embodiments, the subject has cancer. In some embodiments, the cell-free DNA sample is identified as benign for a disease or condition in an absence of the subject having a further medical procedure. In some embodiments, the further medical procedure can comprise: obtaining a biopsy from the subject, performing an imaging scan of the subject, or a combination thereof. In some embodiments, the method is performed in an absence of a screening procedure. In some embodiments, the screening procedure can comprise a genetic mutation screen, a colonoscopy, an assay performed on a stool sample provided by the subject, a sigmoidoscopy, or any combination thereof, a mammograph, an MRI, a breast self-examination, prostate specific antigen (PSA) assay, rectal examination, a bronchial lavage, chest x-ray, CT scan, or any combination thereof.

INCORPORATION BY REFERENCE

[0004] All publications, patents, and patent applications herein are incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede or take precedence over any such contradictory material.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] The novel features herein are set forth with particularity in the appended claims. A better understanding of the features and advantages herein will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles herein are utilized, and the accompanying drawings (also“figure” and“FIG.” herein), of which:

[0006] FIG. 1 shows an ARCH4 representation of RNA-seq profiles in normal tissue, showing relatedness of organ tissue samples that have high incidence or prevalence rates.

[0007] FIG. 2 shows an extracted panel 3(G) from Li, et al. (2017), showing predictive capacity of CRC cfDNA classifier over several tissue types.

[0008] FIG. 3 shows a bar chart showing distribution of gender across cancer indications.

[0009] FIG. 4 shows a box plot showing distribution of age across cancer indications demonstrating lower patient ages in breast cancer.

[0010] FIG. 5 and FIG. 6 show a first two principal components in PC A of the Quality Control (QC) metrics for Pulldown and Input libraries.

[0011] FIG. 7 and FIG. 8 show the number of peaks vs the average enrichment in the peak, for two peak callers MACS2 (FIG. 7), and EPIC (FIG. 8).

[0012] FIG. 9A - 9B shows Key Pulldown Quality Control (QC) metrics for the Tissue Map Project.

[0013] FIG. 10A-10B and FIG. 11A-11B show additional QC metrics that indicate the quality of the sample.

[0014] FIG. 12A-12C show additional QC metrics that indicate the quality of the input library.

[0015] FIG. 13A-13C show PCA plots of Tumor gene body profiles.

[0016] FIG. 14 shows Breast - Tumor v. Three.

[0017] FIG. 15 shows Breast - Tumor v. Normal.

[0018] FIG. 16 shows Breast - Tumor v. Normal+3.

[0019] FIG. 17 shows Colorectal - Tumor v. Three

[0020] FIG. 18 shows Colorectal - Tumor v. Normal [0021] FIG. 19 shows Colorectal - Tumor v. Normal+3

[0022] FIG. 20 shows Prostate - Tumor v. Three

[0023] FIG. 21 shows Prostate - Tumor v. Normal

[0024] FIG. 22 shows Prostrate - Tumor v. Normal+3

[0025] FIG. 23 shows Lung - Tumor v. Three

[0026] FIG. 24 shows Lung - Tumor v. Normal

[0027] FIG. 25 shows Lung - Tumor v. Normal+3

[0028] FIG. 26A-26B shows example genes, RP11-404P21.8 and FAT1, with high specificity in colorectal cancer tumors relative to the three other tumor types.

[0029] FIG. 27 shows a Number of Narrow Peaks called in Tissue Map II.

[0030] FIG. 28 shows a Number and Length of Broad Peaks in Tissue Map II.

[0031] FIG. 29 shows a Hierarchical clustering of peak set over all samples in the tissue map2 project.

[0032] FIG. 30 shows a Hierarchical clustering of peak set over all samples in the tissue map II and tissue map I project.

[0033] FIG. 31A-F through FIG. 38A-38C shows examples of the identification of CRC specific gDNA peaks and the same discriminatory regions in cfDNA profiles.

[0034] FIG. 39 shows 110 total colorectal cancer (CRC) and healthy volunteer (HV) plasma samples processed through the HMCP v2 protocol.

[0035] FIG. 40 shows the HMCP-110 protocol overview.

[0036] FIG. 41 shows one example of the 5-hydroxymethylcytosine (5-hmC) Pulldown Label Copy Enrich (HMCP_LCE) method detailed herein.

[0037] FIG. 42 shows one example of the 5-hmC Pulldown Copy Label Enrich

(HMCP_CLE) method detailed herein.

[0038] FIG. 43 shows one example of the 5-hmC Pulldown Label Random prime Enrich (HMCP_LRE) method detailed herein.

[0039] FIG. 44 shows one example of the 5-hmC Pulldown Random primer Label Enrich (HMCP_RLE) method detailed herein.

[0040] FIG. 45 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich (HMCP_LLSE) method detailed herein.

[0041] FIG. 46 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP_LSLE) method detailed herein. [0042] FIG. 47 shows an example of the methods as described herein including obtaining a cfDNA sample and a gDNA sample and performing sequencing on the cfDNA sample and the gDNA sample. In some cases, a plasma sample can be obtained and cfDNA can be extracted from the plasma sample. Library preparation, PCR, and sequencing can be performed on the extracted sample. A tissue sample can also be obtained and gDNA can be extracted from the tissue sample. In some cases, the extracted gDNA may be sheared into smaller fragment sizes. Library preparation, PCR, and sequencing can be performed on the gDNA sample. Sequencing of cfDNA and gDNA can be performed in separate containers. Sequencing of cfDNA and gDNA can be performed in a single container.

[0043] FIG. 48A-B shows sequencing depth achieved for Tissue Map phase I libraries.

[0044] FIG. 49 shows enrichment of 100bp 2hmC spike-ins from the pulldown libraries of Tissue Map phase I.

[0045] FIG. 50 shows genebody, genehancer, and peak data.

[0046] FIG. 51A-B show development of a data driven 5hmC peak reference set from HMCP110 dataset.

[0047] FIG. 52 shows additional versions of data sets produced applying a stricter read count filter on both gene bodies, genehancers and peak data.

[0048] FIG 53A-C shows data to address cross reactivity using 5hmC Tissue Map Project.

[0049] FIG. 54 shows peak data.

[0050] FIG. 55 demonstrates that adjacent normal tissue samples cluster closer to cancer than healthy tissues, suggesting there may be an epigenetic cancer field effect.

[0051] FIG. 56A-B shows 5hmC genebody and peak data can distinguish between tumor types.

[0052] FIG. 57A-B shows gDNA and cfDNA peaks discriminate between cancer types.

[0053] FIG. 58A-B shows cancer-specific peaks were identified in gDNA.

[0054] FIG. 59 shows a pairwise approach to identify 135 CRC-specific peaks from tissue map cfDNA data.

[0055] FIG. 60 shows an updated version of FIG. 58A showing the cancer-specific peaks identified in gDNA. DETAILED DESCRIPTION

[0056] While various embodiments have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only.

It should be understood that various alternatives to the embodiments herein may be employed.

[0057] Methods as described herein may include methods of preparing a sample for diagnostic testing. Methods may include methods of analyzing a sample. Methods may include methods of detecting one or more features in a sample. Methods may include enriching a portion of a sample comprising one or more features. Methods may not include an enrichment. Methods may include enriching a sample comprising one or more nucleic acids for one or more features, such as an epigenetic modification, a variant (such as a single nucleotide polymorphism), a mutation, a chemical modification, or any combination thereof. Methods may include analysis, or sample preparing or detection in a DNA or RNA sample. In some cases, analysis, detection, enrichment, sample preparation or others may be performed on a sample, a portion of a sample, or on a plurality of samples. Samples may comprise nucleic acids, such as DNA or RNA.

Samples can comprise cell-free DNA, genomic DNA, or a combination thereof. A first sample can comprise cell-free DNA and a second sample can comprise genomic DNA.

[0058] Methods may include sequencing one or more fragments of a nucleic acid sequence. One or more sequencing reads may be aligned to at least a portion of a region of a genome. A presence or absence of sequencing reads may be quantified, such as at said portion. A number or amount of sequencing reads may be quantified, such as at said portion. A presence or an absence of a peak of sequencing reads may be quantified, such as at said portion. A peak may at least in part be a function of a number of substantially the same sequencing reads of substantially the same portion of the region of the genome. Methods may include comparing an overlap of one or more peaks, such as at a portion of a region of a genome. For example, a first peak of cell-free DNA may overlap with a second peak of genomic DNA at a portion of a region of a genome. A first peak of genomic DNA may overlap with a second peak of cell-free DNA at a portion of a region of a genome. An overlap of one or more peaks may be at least about: 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% or more. An overlap of one or more peaks may be from about 40% to about 95%. An overlap of one or more peaks may be from about 50% to about 95%. An overlap of one or more peaks may be from about 60% to about 95%. An overlap of one or more peaks may be from about 70% to about 95%. An overlap of one or more peaks may be from about 80% to about 95%. A presence or an absence of sequencing reads may be quantified at a region of a genome. One or more sequencing reads may be aligned to at least a portion of a region of a genome. Methods may include quantifying a plurality of sequencing reads, in one or more samples, such as a cell-free DNA sample and a genomic DNA sample. Quantification of a plurality of sequencing reads may occur at a region of the genome, such as a single base or spanning across a genomic region such as a portion of a gene body, an intergenic region, a gene enhancer, or any combination thereof. Quantification of a plurality of sequencing reads may occur at multiple regions of a genome to form a signature or panel of regions.

[0059] Cell -free DNA may be circulating DNA found in a blood sample of a subject.

Methods to identify the tissue of origin of the cell-free DNA may be advantageous. Assaying both a cell-free DNA sample and a genomic DNA sample at a region of a genome may provide for regions of the genome with diagnostic potential that may be tissue specific. For example, analysis of genomic DNA may correlate with a tissue of origin and analysis of a cell-free DNA (such as cell-free DNA having epigenetic modifications or others such as variants or mutations) may correlate with a presence or an absence of a disease or condition. Methods as described herein that combine sample preparation or analysis of both cell-free DNA and genomic DNA may provide a panel of regions in a genome or a panel of tissue-specific markers for the disease or condition. Methods as described herein may reduce off-targeting effects, for example, methods may distinguish one cancer from a plurality of cancers, such as distinguishing a presence or an absence of a breast cancer from a uterine or ovarian cancer. Methods may include comparing an overlap of one or more peaks, such as at a portion of a region of a genome. For example, a first peak of cell-free DNA may overlap with a second peak of genomic DNA at a portion of a region of a genome. A first peak of genomic DNA may overlap with a second peak of cell-free DNA at a portion of a region of a genome.

[0060] Methods may include enriching a sample obtained from a subject for one or more fragments comprising an epigenetic modification. In some cases, samples may be enriched for one or more variants or one or more mutations, epigenetic modifications, or any combination thereof. Methods may include quantifying a plurality of sequencing reads corresponding to at least a portion of a region of a genome of a subject. Sequencing reads may be produced by sequencing one or more fragments, such as nucleic acid fragments. Quantifying a plurality of sequencing reads may be based on a reference, such as a database of sequencing reads, such as sequencing reads obtained from a healthy subject, a subject diagnosed with a disease or condition, or a combination thereof.

[0061] A region can comprise any portion of a genome. A region can comprise a base. A region can comprise a plurality of bases. A region can comprise a gene body, an intergenic region, an enhancer, or any combination thereof. A region can comprise from about 1 to about 200 basepairs. A region can comprise from about 1 to about 100 basepairs. A region can comprise from about 1 to about 50 basepairs. A region can comprise a plurality of regions, such as a first region associated with a first gene and a second region associated with a second gene.

A region can comprise a panel of regions or a signature of regions that may be specific for a tissue type or a disease type.

[0062] Methods may include identifying a tissue of origin of a sample, such as a cell-free DNA sample. Identification of a tissue of original may be based at least in part on a plurality of genomic DNA sequencing reads. For example, a plurality of cell-free DNA sequencing reads may form a peak at a region in a genome. The peak may at least partially overlap with a peak formed by genomic DNA sequencing reads at the region in the genome. Overlapping peaks may indicate a tissue of origin of the cell-free DNA sample. Two peaks may overlap by at least about: 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% or more.

[0063] Methods may include sequencing nucleic acid sequences. Sequencing may comprise nanopore sequencing or high throughput sequencing. Sequencing analysis may include aligning sequencing reads to a portion of a genome. Methods may include measuring a presence or an absence of a peak of sequencing reads, such as cell-free DNA sequencing reads or genomic DNA sequencing reads, or reference sequencing reads. A presence or an absence of a peak may comprise a level of a peak, a height of a peak, an area under a curve of the peak, a width of the peak (such as a narrow peak or a broad peak), or any combination thereof. A presence or an absence of a peak may be indicative of a disease or condition.

[0064] Cell-free DNA may be obtained from a sample, such as a blood sample or a tissue sample. Cell-free DNA may comprise fetal DNA or maternal DNA or combination thereof. A cell-free DNA may comprise bacterial DNA, viral DNA, plant DNA or any combination thereof. Cell-free DNA may comprise a portion of genomic DNA. Genomic DNA may be obtained from a tissue sample. A tissue sample may comprise a tissue biopsy, a buccal swab, a fine needle aspirate, or any combination thereof. A cell-free DNA sample may be obtained from a same sample as a genomic DNA sample. A cell-free DNA sample may be obtained from a sample that may be different from a sample from which a genomic DNA sample is obtained.

[0065] Methods as described herein may include sample preparation or analysis of a sample obtained from a subject. The subject may be a subject that has or is suspected of having a disease or condition. The subject may have a cancer. The subject may be suspected of having a cancer. The subject may be a healthy subject. The subject may be asymptomatic for the cancer. The subject may have received a previous diagnosis. The subject may not have received a previous diagnosis. Methods may include identifying a presence or an absence of a disease or condition in the subject - in some cases, based at least in part on quantifying a plurality of sequencing reads. The disease or condition may be cancer, such as a colorectal cancer, a breast cancer, a lung cancer, a prostate cancer, a uterine cancer, an ovarian cancer, a skin cancer, a thyroid cancer, a pancreatic cancer, a bone cancer, or any combination thereof. Methods may include distinguishing a cancer type from a plurality of cancer types. Measuring a presence or absence of a peak of sequencing reads may distinguish a cancer type from another cancer type.

A panel of peaks at a plurality of regions in a genome may be tissue-specific, such as distinguishing a breast cancer from an ovarian cancer. Methods may include identifying a stage of a disease or condition, such as a stage I, II, III, IV of a cancer. Identifying (i) a presence or an absence of a disease or condition, (ii) identifying a stage of a cancer, (iii) distinguishing a cancer from a plurality of cancers, or (iv) a combination thereof may comprise a specificity of at least about: 70%, 75%, 80%, 85%, 90%, 95% or more. Identifying (i) a presence or an absence of a disease or condition, (ii) identifying a stage of a cancer, (iii) distinguishing a cancer from a plurality of cancers, or (iv) a combination thereof may comprise a sensitivity of at least about: 70%, 75%, 80%, 85%, 90%, 95% or more.

[0066] One or more fragments of a nucleic acid sequence may be enriched. Enrichment may be based at least in part on a presence of at least one epigenetic modification in the sequence, or a plurality of epigenetic modifications in the sequence. An epigenetic modification can comprise mC, hmC, caC, fC, or any combination thereof. Enrichment may be based in part on a presence or an absence of an epigenetic marker, presence or an absence of a variant, a mutation, or other genetic marker, or any combination thereof.

[0067] Methods as described herein may include identifying a disease or condition, such as a cancer. The cancer may be any type of cancer such as a breast cancer, a prostate cancer, a colon cancer, or a lung cancer. Methods may include distinguishing one disease from a different disease. Methods may include identifying a disease subtype or stage of disease. Methods may include identifying a risk of disease occurrence. Methods may include identifying a disease or condition with at least: 90% specificity, 90% sensitivity, 90% accuracy, or any combination thereof. Methods to distinguish one disease from a different disease may include identifying unique non-overlapping feature spaces between the two diseases, such as unique epigenetic peaks, gene bodies, genehancers, or any combination thereof. For example, gene expression data, sequence variant data, or epigenetic peak data that overlaps between two or more diseases may be filtered out or removed.

[0068] Methods may identify a disease with at least: 90% specificity, 90% sensitivity, and 90% clinical accuracy by assaying for epigenetic peak data. Clinical accuracy, specificity, sensitivity, or a combination thereof may be improved by including gene bodies data, genehancer data, or a combination thereof.

[0069] Methods may include identifying an epigenetic-based disease field effect. A field effect may include detectable changes in normal tissue that may be adjacent or proximal to a diseased tissue. Assaying a sample from normal sample may permit positive diagnosis of a presence of the disease tissue. In some cases, assaying normal tissue adjacent or proximal to the disease tissue may permit a less invasive sample to be obtained. In some cases, detecting a change in a normal tissue may permit identifying of a risk of disease occurrence in a subject that may be asymptomatic for the disease. In some cases, sampling normal tissue adjacent or proximal to a diseased tissue may permit detection or identification of a presence of the disease based on detecting an epigenetic change in the normal tissue.

[0070] To identify a disease or condition, methods may include assaying for one or more of: a gene expression product, a polymorphism or sequence variant, an epigenetic modification, or any combination thereof. To identify a disease or condition, methods may include assaying for one or more of: gene bodies, genehancers, epigenetic peaks, or any combination thereof.

Methods may include identifying a presence of an epigenetic modification, a level of an epigenetic modification, a pattern of an epigenetic modification, a presence of an epigenetic modification at a specific location in a sequence of a sample, or any combination thereof.

Epigenetic-based data may be included in the method to identify a disease or condition with a higher level of accuracy as compared to comparable data that does not include epigenetic-based data.

[0071] Data obtained from assaying a reference sample or obtained from a reference database may be input into a machine learning algorithm to train a classifier. A reference sample or a reference database may include epigenetic peak data. Data obtained from assaying an independent sample may be input into a classifier to identify a presence or absence of a disease or condition in the independent sample. Sample that may be assayed may include liquid biopsies, fine needle aspirate samples, buccal swabs, tissue biopsies, blood samples,

extracellular fluid samples, or any combination thereof. Portions of a sample that may be assayed may comprise genomic DNA (gDNA), cell free DNA (cfDNA), or a combination thereof. Prior to assaying, samples may be enriched for portions of the sample comprising an epigenetic modification.

[0072] Methods may include establishing an epigenetic reference set. The epigenetic reference set may include epigenetic peak data. The epigenetic reference set may include data for hydroxymethylated bases, methylated bases, formylated bases, carboxylated bases or any combination thereof. Peak data may include a specific epigenetic modification at a specific location in a sequence that is predictive of a disease condition. Peak data may include a level of epigenetic modification in a sample that is predictive of a disease condition. Peak data may include a pattern of epigenetic modification in a sample that is predictive of a disease condition. Definitions

[0073] The terminology used herein is for the purpose of describing particular cases only and is not intended to be limiting. As used herein, the singular forms“a”,“an” and“the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Furthermore, to the extent that the terms“including”,“includes”,“having”,“has”, with”, or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term“comprising”.

[0074] The term“about” or“approximately” can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, e.g ., the limitations of the measurement system.

For example,“about” can mean plus or minus 10%, per the practice in the art. Alternatively, “about” can mean a range of plus or minus 20%, plus or minus 10%, plus or minus 5%, or plus or minus 1% of a given value. Alternatively, particularly with respect to biological systems or processes, the term can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term“about” meaning within an acceptable error range for the particular value should be assumed. Also, where ranges and/or subranges of values are provided, the ranges and/or subranges can include the endpoints of the ranges and/or subranges.

[0075] The term“substantially” as used herein can refer to a value approaching 100% of a given value. In some cases, the term can refer to an amount that can be at least about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.9%, or 99.99% of a total amount. In some cases, the term can refer to an amount that can be about 100% of a total amount.

[0076] The term“homology” can refer to a % identity of a sequence to a reference sequence. As a practical matter, whether any particular sequence can be at least 50%, 60%, 70%, 80%, 85%, 90%, 92%, 95%, 96%, 97%, 98% or 99% identical to any sequence described herein (which may correspond with a particular nucleic acid sequence described herein), such particular polypeptide sequence can be determined conventionally using known computer programs such the Bestfit program (Wisconsin Sequence Analysis Package, Version 8 for Unix, Genetics Computer Group, University Research Park, 575 Science Drive, Madison, Wis.

53711). When using Bestfit or any other sequence alignment program to determine whether a particular sequence is, for instance, 95% identical to a reference sequence, the parameters can be set such that the percentage of identity is calculated over the full length of the reference sequence and that gaps in homology of up to 5% of the total reference sequence are allowed.

[0077] For example, in a specific embodiment the identity between a reference sequence (query sequence, i.e., a sequence of the present invention) and a subject sequence, also referred to as a global sequence alignment, may be determined using the FASTDB computer program based on the algorithm of Brutlag et al. (Comp. App. Biosci. 6:237-245 (1990)). In some embodiments, parameters for a particular embodiment in which identity is narrowly construed, used in a FASTDB amino acid alignment, can include: Scoring Scheme=PAM (Percent

Accepted Mutations) 0, k-tuple=2, Mismatch Penalty=1, Joining Penalty=20, Randomization Group Length=0, Cutoff Score=1, Window Size=sequence length, Gap Penalty=5, Gap Size Penalty=0.05, Window Size=500 or the length of the subject sequence, whichever is shorter. According to this embodiment, if the subject sequence is shorter than the query sequence due to N- or C-terminal deletions, not because of internal deletions, a manual correction can be made to the results to take into consideration the fact that the FASTDB program does not account for Isl and C-terminal truncations of the subject sequence when calculating global percent identity. For subject sequences truncated at the N- and C-termini, relative to the query sequence, the percent identity can be corrected by calculating the number of residues of the query sequence that are lateral to the N- and C-terminal of the subject sequence, which are not matched/aligned with a corresponding subject residue, as a percent of the total bases of the query sequence. A determination of whether a residue is matched/aligned can be determined by results of the FASTDB sequence alignment. This percentage can be then subtracted from the percent identity, calculated by the FASTDB program using the specified parameters, to arrive at a final percent identity score. This final percent identity score can be used for the purposes of this embodiment. In some embodiments, only residues to the N- and C-termini of the subject sequence, which are not matched/aligned with the query sequence, are considered for the purposes of manually adjusting the percent identity score. That is, only query residue positions outside the farthest N- and C-terminal residues of the subject sequence are considered for this manual correction. For example, a 90 residue subject sequence can be aligned with a 100 residue query sequence to determine percent identity. The deletion occurs at the N-terminus of the subject sequence and therefore, the FASTDB alignment does not show a matching/alignment of the first 10 residues at the N-terminus. The 10 unpaired residues represent 10% of the sequence (number of residues at the N- and C-termini not matched/total number of residues in the query sequence) so 10% is subtracted from the percent identity score calculated by the FASTDB program. If the remaining 90 residues were perfectly matched the final percent identity may be 90%. In another example, a 90 residue subject sequence is compared with a 100 residue query sequence. This time the deletions are internal deletions so there are no residues at the N- or C-termini of the subject sequence which are not matched/aligned with the query. In this case the percent identity calculated by FASTDB is not manually corrected. Once again, only residue positions outside the N- and C-terminal ends of the subject sequence, as displayed in the FASTDB alignment, which are not matched/aligned with the query sequence are manually corrected for.

[0078] The term“fragment,” as used herein, may be a portion of a sequence, a subset that may be shorter than a full length sequence. A fragment may be a portion of a gene. A fragment may be a portion of a peptide or protein. A fragment may be a portion of an amino acid sequence. A fragment may be a portion of an oligonucleotide sequence. A fragment may be less than about: 20, 30, 40, 50 amino acids in length. A fragment may be less than about: 20, 30, 40, 50 oligonucleotides in length.

[0079] The term“epigenetic modification” as used herein, may be any covalent modification of a nucleic acid base. In some cases, a covalent modification may comprise (i) adding a methyl group, a hydroxymethyl group, a carbon atom, an oxygen atom, or any combination thereof to one or more bases of a nucleic acid sequence, (ii) changing an oxidation state of a molecule associated with a nucleic acid sequence, such as an oxygen atom, or (iii) a combination thereof. A covalent modification may occur at any base, such as a cytosine, a thymine, a uracil, an adenine, a guanine, or any combination thereof. In some cases, an epigenetic modification may comprise an oxidation or a reduction. A nucleic acid sequence may comprise one or more epigenetically modified bases. An epigenetically modified base may comprise any base, such as a cytosine, a uracil, a thymine, adenine, or a guanine. An epigenetically modified base may comprise a methylated base, a hydroxymethylated base, a formylated base, or a carboxylic acid containing base or a salt thereof. An epigenetically modified base may comprise a 5-methylated base, such as a 5-methylated cytosine (5-mC). An epigenetically modified base may comprise a 5 -hydroxymethylated base, such as a 5 -hydroxymethylated cytosine (5-hmC). An epigenetically modified base may comprise a 5-formylated base, such as a 5-formylated cytosine (5-fC). An epigenetically modified base may comprise a 5-carboxylated base or a salt thereof, such as a 5- carboxylated cytosine (5-caC). In some cases, an epigenetically modified base may comprise a methyltransferase-directed transfer of an antivated group (mTAG).

[0080] An epigenetically modified base may comprise one or more bases or a purine (such as Structure 1) or one or more bases of a pyrimidine (such as Structure 2). An epigenetic modification may occur one or more of any positions. For example, an epigenetic modification may occur at one or more positions of a purine, including positions 1, 2, 3, 4, 5, 6, 7, 8, 9, as shown in Structure 1. In some cases, an epigenetic modification may occur at one or more positions of a pyrimidine, including positions 1, 2, 3, 4, 5, 6, as shown in Structure 2.

[0081]

Structure 1

[0082]

Structure 2

[0083] A nucleic acid sequence may comprise an epigenetically modified base. A nucleic acid sequence may comprise a plurality of epigenetically modified bases. A nucleic acid sequence may comprise an epigenetically modified base positioned within a CG site, a CpG island, or a combination thereof. A nucleic acid sequence may comprise different epigenetically modified bases, such as a methylated base, a hydroxymethylated base, a formylated base, a carboxylic acid containing base or a salt thereof, a plurality of any of these, or any combination thereof. [0084] The term“nucleic acid sequence” as used herein may comprise DNA or RNA. In some cases, a nucleic acid sequence may comprise a plurality of nucleotides. In some cases, a nucleic acid sequence may comprise an artificial nucleic acid analogue. In some cases, a nucleic acid sequence comprising DNA, may comprise cell-free DNA, cDNA, fetal DNA, or maternal DNA. In some cases, a nucleic acid sequence may comprise miRNA, shRNA, or siRNA.

[0085] The term“substantially complementary strand” as used herein, may comprise from about 70% - 100% bases that base pair with bases of a nucleic acid sequence. This percentage of base pairing may be measured by UV absorption of the nucleic acid sequence. In some cases, a substantially complementary strand may be hybridized to at least a portion of a nucleic acid sequence under stringent hybridization conditions.

[0086] The term“substantially free of an epigenetically modified base” as used herein, may comprise a complementary strand having no epigenetically modified base, or a complementary strand having from about 0.000001% to about 5% of a plurality of epigenetically modified bases of a nucleic acid sequence.

[0087] The term“click-chemistry” as used herein may comprise a reaction having at least one of the following: (a) high yielding, (b) wide in scope, (c) create only byproducts that may be removed in the absence of chromatography, (d) stereospecific, (e) simple to perform, (f) conducted in easily removable or benign solvents. In some cases, click-chemistry comprises tagging, such as tagging a nucleic acid sequence or a complementary strand. In some cases, click-chemistry may associate a nucleic acid sequence with a label. Click-chemistry may comprise a reaction having a [3+2] cycloaddition; a thiol-ene reaction; a Diels- Alder reaction, an inverse electron demand Diels- Alder reaction; a [4+1] cycloaddition; a nucleophilic substitution; a carbonyl-chemistry-like formation of urea; an addition to a carbon-carbon double bond; or any combination thereof. In some cases, a [3+2] cycloaddition may comprise a Huisgen 1,3 -dipolar cycloaddition. In some cases, a [4+1] cycloaddition may comprise a cycloaddition between an isonitrile and a tetrazine. Click-chemistry may comprise a copper(I)-catalyzed azide-alkyne cycloaddition (CuAAC); a strain-promoted azide-alkyne cycloaddition (SPAAC); a strain- promoted alkyne-nitrone cycloaddition (SPANC); or any combination thereof.

[0088] The term“sequencing” as used herein, may comprise bi sulfite-free sequencing, bisulfite sequencing, TET-assisted bisulfite (TAB) sequencing, ACE-sequencing, high- throughput sequencing, Maxam-Gilbert sequencing, massively parallel signature sequencing, Polony sequencing, 454 pyrosequencing, Sanger sequencing, Illumina sequencing, SOLiD sequencing, Ion Torrent semiconductor sequencing, DNA nanoball sequencing, Heliscope single molecule sequencing, single molecule real time (SMRT) sequencing, nanopore DNA sequencing, shot gun sequencing, RNA sequencing, Enigma sequencing, or any combination thereof.

[0089] In some cases, a method may comprise sequencing. The sequencing may include bisulfite sequencing or bisulfite-free sequencing. In some cases, a method may comprise oxidizing one or more bases of a nucleic acid sequence or complementary strand or combination thereof. In some cases, a method may comprise selectively enriching for a nucleic acid sequence that contains at least one epigenetic modification.

[0090] The term“tissue” as used herein, may be any tissue sample. A tissue may be a tissue suspected or confirmed of having a disease or condition. A tissue may be a sample that may be substantially healthy, substantially benign, or otherwise substantially free of a disease or a condition. A tissue may be a tissue removed from a subject, such as a tissue biopsy, a tissue resection, an aspirate (such as a fine needle aspirate), a tissue washing, a cytology specimen, a bodily fluid, or any combination thereof. A tissue may comprise cancerous cells, tumor cells, non-cancerous cells, or a combination thereof. A tissue may comprise colon tissue, colorectal tissue, rectal tissue, a polyp, a blood sample (such as a cell-free DNA sample), or any combination thereof. A tissue may be a sample that may be genetically modified.

[0091] As used herein, the term“cell-free” refers to the condition of the nucleic acid sequence as it appeared in the body before the sample is obtained from the body. For example, circulating cell-free nucleic acid sequences in a sample may have originated as cell-free nucleic acid sequences circulating in the bloodstream of the human body. In contrast, nucleic acid sequences that are extracted from a solid tissue, such as a biopsy, are generally not considered to be“cell-free” In some cases, cell-free DNA may comprise fetal DNA, maternal DNA, or a combination thereof. In some cases, cell-free DNA may comprise DNA fragments released into a blood plasma. In some cases, the cell-free DNA may comprise circulating tumor DNA. In some cases, cell-free DNA may comprise circulating DNA indicative of a tissue origin, a disease or a condition. A cell-free nucleic acid sequence may be isolated from a blood sample. A cell- free nucleic acid sequence may be isolated from a plasma sample. A cell-free nucleic acid sequence may comprise a complementary DNA (cDNA). In some cases, one or more cDNAs may form a cDNA library.

[0092] The term“subject,” as used herein, may be any animal or living organism. Animals can be mammals, such as humans, non-human primates, rodents such as mice and rats, dogs, cats, pigs, sheep, rabbits, and others. Animals can be fish, reptiles, or others. Animals can be neonatal, infant, adolescent or adult animals. Humans can be more than about: 1, 2, 5, 10, 20,

30, 40, 50, 60, 65, 70, 75, or about 80 years of age. The subject may have or be suspected of having a condition or a disease, such as cancer. The subject may be a patient, such as a patient being treated for a condition or a disease, such as a cancer patient. The subject may be predisposed to a risk of developing a condition or a disease such as cancer. The subject may be in remission from a condition or a disease, such as a cancer patient. The subject may be healthy.

[0093] A nucleic acid sequence may be from a sample. A sample may be isolated from a subject. A subject may be a human subject. A sample may comprise a buccal sample, a saliva sample, a blood sample, a plasma sample, a reproductive sample (such as an egg or a sperm), a mucus sample, a cerebral spinal fluid sample, a tissue sample, a tissue biopsy, a surgical resection, a fine needle aspirate sample, or any combination thereof. In some cases, a sample may comprise a blood sample. In some cases, a sample may comprise a buccal sample.

[0094] In some cases, a subject may have previously received a diagnosis of a disease or condition prior to performing a method as described herein. A subject may have previously received a positive diagnosis of a disease, such as a cancer. A subject may have previously received an indeterminate or inclusive diagnosis of a disease, such as a cancer. A subject may be a subject in need thereof, such as a need for a definitive diagnosis or a need for a selection of a therapeutic treatment regime.

[0095] A result of the method or a result output from the trained algorithm may include a recommendation for a treatment. A treatment may include further monitoring of the subject, such as obtaining a second sample from the subject and repeating a method as described herein. A treatment may include performing surgery or removing of a tissue from the subject, performing an imaging scan on the subject, performing a diagnostic test on a sample from the subject, performing radiation, chemotherapy, or other cancer treatment procedure.

[0096] In some cases, a subject may not have previously received a diagnosis of a disease or condition prior to performing a method as described herein. In some cases, a subject may be suspected of having a disease or condition, such as having one or more symptoms of a disease or condition. In some cases, a subject may be at risk of developing a disease or condition, such as a subject having a biomarker or genetic indication that may be indicative of a risk of developing a disease or condition. In some cases, a disease or a condition may comprise a cancer.

[0097] A nucleic acid sequence may comprise a cytosine guanine (CG) site, a cytosine phosphate guanine (CpG) island, a portion of any of these, or a combination thereof. A CpG island may comprise one or more CG sites. A nucleic acid sequence may comprise one or more CG sites or portions thereof. A nucleic acid sequence may comprise dense CG sites, dense CpG islands or a combination thereof. A nucleic acid sequence may comprise a plurality of CG sites or portions thereof. A nucleic acid sequence may comprise one or more CpG islands or portions thereof. A nucleic acid sequence may comprise a plurality of CpG islands or portions thereof. One or more bases of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified base, such as a methylated base or a hydroxymethyl ated base. One or more cytosines of a nucleic acid sequence comprising a CG site, a CpG island, a portion thereof, or any of these may comprise an epigenetically modified cytosine, such as a methylated cytosine or a hydroxymethyl ated cytosine. A CpG island (or a CG island) may be a region with a high frequency of CG sites. A CpG island may be a region of a nucleic acid sequence with at least about 200 basepairs (bp) and a GC percentage that may be greater than about 50% and with an observed-to-expected CpG ratio that may be greater than about 60 %. An "observed-to-expected CpG ratio" may be derived where the observed may be calculated as:

[0098] (number of CpGs)

[0099] and the expected may be calculated as:

[00100] (number of C * number of G) / length of sequence

[00101] or the expected may be calculated as:

[00102] ((number of C + number of G) / 2) 2 / length of sequence

Samples

[00103] The methods of the present invention provide for storing the sample for a time such as seconds, minutes, hours, days, weeks, months, years or longer after the sample is obtained and before the sample is analyzed by one or more methods of the invention. In some cases, the sample obtained from a subject can be subdivided prior to the step of storage or further analysis such that different portions of the sample may be subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof.

[00104] In some cases, a portion of the sample may be stored while another portion of said sample is further manipulated. Such manipulations may include but are not limited to molecular profiling (epigenetics, gene expression levels, sequence variant, copy number); sequencing, labeling, cytological or histological staining; flow cytometry analysis; nucleic acid (RNA or DNA) extraction, detection, or quantification; gene expression product (RNA or Protein) extraction, detection, or quantification; fixation; and examination. The sample may be fixed prior to or during storage by any method known to the art such as using glutaraldehyde, formaldehyde, or methanol. In other cases, the sample is obtained and stored and subdivided after the step of storage for further analysis such that different portions of the sample are subject to different downstream methods or processes including but not limited to storage, cytological analysis, adequacy tests, nucleic acid extraction, molecular profiling or a combination thereof. In some cases, samples are obtained and analyzed by for example cytological analysis, and the resulting sample material is further analyzed by one or more molecular profiling methods of the present invention. In such cases, the samples may be stored between the steps of cytological analysis and the steps of molecular profiling. Samples may be stored upon acquisition to facilitate transport, or to wait for the results of other analyses. In another embodiment, samples may be stored while awaiting instructions from a physician or other medical professional.

Classifiers

[00105] The results obtained from the assaying can be analyzed using feature selection techniques including filter techniques which assess the relevance of features by looking at the intrinsic properties of the data, wrapper methods which embed the model hypothesis within a feature subset search, and embedded techniques in which the search for an optimal set of features is built into a classifier algorithm.

[00106] Filter techniques useful in the methods of the present invention include (1) parametric methods such as the use of two sample t-tests, ANOVA analyses, Bayesian frameworks, and Gamma distribution models (2) model free methods such as the use of

Wilcoxon rank sum tests, between-within class sum of squares tests, rank products methods, random permutation methods, or TNoM which involves setting a threshold point for fold-change differences in expression between two datasets and then detecting a threshold point in each gene that minimizes the number of missclassifications (3) and multivariate methods such as bivariate methods, correlation based feature selection methods (CFS), minimum redundancy maximum relevance methods (MRMR), Markov blanket filter methods, and uncorrelated shrunken centroid methods. Wrapper methods useful in the methods of the present invention include sequential search methods, genetic algorithms, and estimation of distribution algorithms.

Embedded methods useful in the methods of the present invention include random forest algorithms, weight vector of support vector machine algorithms, and weights of logistic regression algorithms.

[00107] Selected features may then be classified using a classifier algorithm. Illustrative algorithms include but are not limited to methods that reduce the number of variables such as principal component analysis algorithms, partial least squares methods, and independent component analysis algorithms. Illustrative algorithms further include but are not limited to methods that handle large numbers of variables directly such as statistical methods and methods based on machine learning techniques. Statistical methods include penalized logistic regression, prediction analysis of microarrays (PAM), methods based on shrunken centroids, support vector machine analysis, and regularized linear discriminant analysis. Machine learning techniques include bagging procedures, boosting procedures, random forest algorithms, and combinations thereof.

[00108] Classifiers may be developed using top varying genes, enhancers, or a combination thereof to demonstrate the predictive power of 5-hmC in diagnosing cancer, early detection of cancer, recurrence of cancer, metastasis of cancer, presence of a malignant tissue, or any combination thereof. A trained model may successfully predict a disease status, a risk of occurrence or recurrence of a disease, or any combination thereof in a test set with greater than about 90% sensitivity and greater than about 80% specificity.

[00109] In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 90% sensitivity and greater than about 95% specificity.

[00110] In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 95% sensitivity and greater than about 95% specificity.

[00111] In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 80% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 85% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 90% specificity. In some cases, the trained model provides a result having greater than about 98% sensitivity and greater than about 95% specificity. [00112] In some cases, the trained algorithm provides a result having greater than about 80% sensitivity. In some cases, the trained algorithm provides a result having greater than about 85% sensitivity. In some cases, the trained algorithm provides a result having greater than about 90% sensitivity. In some cases, the trained algorithm provides a result having greater than about 95% sensitivity. In some cases, the trained algorithm provides a result having greater than about 96% sensitivity. In some cases, the trained algorithm provides a result having greater than about 97% sensitivity. In some cases, the trained algorithm provides a result having greater than about 98% sensitivity.

[00113] In some cases, the trained algorithm provides a result having greater than about 70% specificity. In some cases, the trained algorithm provides a result having greater than about 75% specificity. In some cases, the trained algorithm provides a result having greater than about 80% specificity. In some cases, the trained algorithm provides a result having greater than about 85% specificity. In some cases, the trained algorithm provides a result having greater than about 90% specificity. In some cases, the trained algorithm provides a result having greater than about 95% specificity. In some cases, the trained algorithm provides a result having greater than about 96% specificity.

[00114] In some cases, the trained algorithm provides a result having greater than about 80% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 85% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 90% clinical diagnostic accuracy. In some cases, the trained algorithm provides a result having greater than about 95% clinical diagnostic accuracy.

[00115] Sensitivity typically refers to TP/(TP+FN), where TP is true positive and FN is false negative. Number of Continued Indeterminate results divided by the total number of malignant results based on adjudicated histopathology diagnosis. Specificity typically refers to

TN/(TN+FP), where TN is true negative and FP is false positive. The number of benign results divided by the total number of benign results based on adjudicated histopathology diagnosis. Positive Predictive Value (PPV) typically refers to TP/(TP+FP) and Negative Predictive Value (NPV) typically refers to TN/(TN+FN). The clinical accuracy as used herein includes specificity, sensitivity, positive predictive value, negative predictive value, or any combination thereof.

Biomarkers

[00116] Methods as described herein may assay for at least one biomarker or an active fragment thereof. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 or more biomarkers may be assayed. In some cases, about 2 biomarkers may be assayed. In some cases, about 5 biomarkers may be assayed. In some cases, about 10 biomarkers may be assayed. In some cases, about 15 biomarkers may be assayed. In some cases, at least 20 biomarkers may be assayed.

[00117] Methods as described herein may utilize at least one biomarker or an active fragment thereof to classify a sample. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers may be utilized to classify a sample. In some cases, about 2 biomarkers may be utilized to classify a sample. In some cases, about 5 biomarkers may be utilized to classify a sample. In some cases, about 10 biomarkers may be utilized to classify a sample. In some cases, about 15 biomarkers may be utilized to classify a sample. In some cases, about 20 biomarkers may be utilized to classify a sample.

[00118] Methods as described herein may select at least one biomarker or an active fragment thereof. In some cases, about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100,

150, 200 biomarkers may be selected. In some cases, about 2 biomarkers may be selected. In some cases, at least 5 biomarkers may be selected. In some cases, about 10 biomarkers may be selected. In some cases, about 15 biomarkers may be selected. In some cases, about 20 biomarkers may be selected.

[00119] Methods as described herein may compare a result to at least one biomarker or an active fragment thereof of a control or derivative thereof, such as a reference sample. In some cases, a result may be compared to about: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 30, 40, 50, 100, 150, 200 biomarkers. In some cases, a result may be compared to about 2 biomarkers.

In some cases, a result may be compared to about 5 biomarkers. In some cases, a result may be compared to about 10 biomarkers. In some cases, a result may be compared to about 15 biomarkers. In some cases, a result may be compared to about 20 biomarkers.

[00120] A biomarker or active fragment thereof may be a gene, a portion of a gene, a genehancer, a transcription factor, or any combination thereof. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least: 70%, 75%, 80%, 85%, 90%, 95%, 99% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 70% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 75% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 80% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 85% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 90% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 95% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 96% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 97% sequence homology to the biomarker. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 98% sequence homology to the biomarker. A biomarker may be a genehancer. A nucleotide sequence from a sample may comprise a nucleotide sequence having at least 99% sequence homology to the biomarker. A biomarker may be a transcription factor. A biomarker may be a site that is proximal to a gene. A biomarker may be a site associated with a gene but more than 10 basepairs away from the gene.

[00121] A biomarker may not have been previously associated with a cancer. An expression of a biomarker may be associated with cancer but a change in an epigenetic modification in the biomarker may not have been previously associated with a cancer. A presence or absence of an epigenetic modification may be indicative of a cancer.

[00122] A presence of an epigenetic modification may comprise a level of methylation or a level of hydroxymethylation. A presence of an epigenetic modification may comprise a number of methylated sites, hydroxymethylated sites, hypo-hydroxymethylated sites, hyper- hydroxymethylated sites, or any combination thereof.

[00123] One or more biomarkers or active fragments thereof may be selected for use in the methods described herein. About: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be selected. From 1 to 5 biomarkers may be selected. From 1 to 10 biomarkers may be selected. From 1 to 20 biomarkers may be selected. From 1 to 40 biomarkers may be selected. From 1 to 50 biomarkers may be selected. From 1 to 60 biomarkers may be selected. From 1 to 100 biomarkers may be selected. From 2 to 5 biomarkers may be selected. From 2 to 10 biomarkers may be selected. From 2 to 20 biomarkers may be selected. From 2 to 50 biomarkers may be selected. From 2 to 100 biomarkers may be selected. From 5 to 10 biomarkers may be selected. From 5 to 20 biomarkers may be selected. From 5 to 30 biomarkers may be selected. From 5 to 40 biomarkers may be selected.

[00124] One or more biomarkers may be assayed accordingly to the methods described herein. At least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be assayed. From 1 to 5 biomarkers may be assayed. From 1 to 10 biomarkers may be assayed. From 1 to 20 biomarkers may be assayed. From 1 to 40 biomarkers may be assayed. From 1 to 50 biomarkers may be assayed. From 1 to 60 biomarkers may be assayed. From 1 to 100 biomarkers may be assayed. From 2 to 5 biomarkers may be assayed. From 2 to 10 biomarkers may be assayed. From 2 to 20 biomarkers may be assayed. From 2 to 50 biomarkers may be assayed. From 2 to 100 biomarkers may be assayed. From 5 to 10 biomarkers may be assayed. From 5 to 20 biomarkers may be assayed. From 5 to 30 biomarkers may be assayed. From 5 to 40 biomarkers may be assayed.

[00125] A result from one or more biomarkers may be compared to a result from a control sample. A result from at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100 biomarkers or more may be compared. From 1 to 5 biomarkers may be compared. From 1 to 10 biomarkers may be compared. From 1 to 20 biomarkers may be compared. From 1 to 40 biomarkers may be compared. From 1 to 50 biomarkers may be compared. From 1 to 60 biomarkers may be compared. From 1 to 100 biomarkers may be compared. From 2 to 5 biomarkers may be compared. From 2 to 10 biomarkers may be compared. From 2 to 20 biomarkers may be compared. From 2 to 50 biomarkers may be compared. From 2 to 100 biomarkers may be compared. From 5 to 10 biomarkers may be compared. From 5 to 20 biomarkers may be compared. From 5 to 30 biomarkers may be compared. From 5 to 40 biomarkers may be compared.

[00126] In some cases, one or more biomarkers not previously associated with a cancer may be selected to use in the methods as described herein to identify a sample as benign or malignant for the cancer. In some cases, one or more biomarkers having an epigenetic marker or epigenetic change not previously associated with a cancer may be selected for use in the methods as described herein to identify a sample as benign or malignant for the cancer.

HMCP-110 Workflow

[00127] The HMCP-110 workflow may improve workflow and reduce sample attrition from 30% to 5% and eliminate strong operator biases seen in the HMCP-150 study. The analysis may identify many significantly differential hydroxymethyl ated features (both gene bodies and enhancers) that have been previously associated with cancer (such as CRC) or not previously associated with cancer.

[00128] As shown in FIG. 39, key improvements of the HMCP-110 protocol as compared to the HMCP-150 protocol. A total of 110 colorectal cancer (CRC) and healthy volunteer (HV) plasma samples are processed through the HMCP v2 protocol with significant improvements to project management, data analysis and overall execution. Improvements may include a reduction in operator bias, a reduction in attrition rate, or a combination thereof.

[00129] The HMCP-110 protocol is shown in FIG. 40. Day 1. Summary. cfDNA samples will undergo end repair, addition of an A-base overhang, adaptor ligation, and post ligation purification. -3.8% of the ligation product will be amplified, purified and QC’ed by Qubit and BioAnalyzer while the remainder is reserved for processing on day 2. Day 2. Summary. The remaining purified ligation product from day 1 is then denatured into single strands, these are copied to produce double stranded material, 5-hydroxymethylated cytosines are chemically labeled then bound to a biotin conjugate followed by a clean-up of this reaction. Day 3.

Summary. The Biotin conjugated 5hmC-containing DNA fragments material is bound to streptavidin beads. Using a magnet the unbound material (non 5hmC-containing fragments) is washed away. Following this, the bound DNA fragments are denatured into single stranded DNA leaving the copy strand in solution while, the biotin-conjugated original strand remains bound to the streptavidin beads. The single-stranded copy strand is amplified. The library size and molarity are determined for both the amplified enriched (5hmC-containing) libraries by the bioanalyzer.

HMCP v2 protocol

[00130] The HMCP-110 protocol may be a modified version of the HMCP v2 protocol as described above and in FIG. 39 and FIG. 40.

[00131] A method as described herein may comprise associating a label with an epigenetically modified base of a nucleic acid sequence to form a labeled nucleic acid sequence; hybridizing a substantially complementary strand to the labeled nucleic acid sequence; and amplifying the substantially complementary strand in a reaction in which the labeled nucleic acid sequence is substantially not present. One or more individual elements of the method need not be performed in a particular order. For example, associating a label may occur after the hybridizing. One or more individual elements of a given method may be performed in a different order than described herein.

Variation 1

[00132] FIG. 41 shows one example of the 5-hmC Pulldown Label Copy Enrich

( HMCP_LCE) method detailed herein. Advantages of the HMCP_LCE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.

[00133] In this example of FIG. 41, a first element 201 may be to prepare a plurality of double-stranded fragments 202, such as a library of oligonucleotide fragments. The plurality of double-stranded fragments may comprise cell-free DNA. The plurality of double-stranded fragments may comprise one or more epigenetic modifications on one or both strands. A second element 203 may be to associate a label (such as an azido-glucose label) with at least one of the oligonucleotide fragments from the plurality of double-stranded fragments to form a modified oligonucleotide fragment 204. The label may associate with an epigenetic modification present at one or more bases of the modified oligonucleotide fragment. A third element 205 may be to separate the modified oligonucleotide fragment to form one or more single-stranded modified oligonucleotide fragments 206. A fourth element 207 may be to hybridize a complementary strand, such as a substantially complementary strand, to a single-stranded modified

oligonucleotide fragment to form a modified oligonucleotide fragment 208, such as a labeled chimeric library. The complementary strand may lack one or both of the label and the epigenetic modification. A fifth element 210 may be to associate a label 209 with the modified

oligonucleotide fragment wherein the label 209 may also associate with a substrate. The label 209 may bind to an epigenetic modification or to a label previously associated with an epigenetic modification. The label 209 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond.

A sixth element 211 may be to enrich a sample for one or more complementary strands 212 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate. A seventh element 213 may be to amplify the enriched complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 214 of the complementary strand.

[00134] In FIG. 41, the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments. The oligonucleotide fragments may be DNA or RNA. The library may be a next-generation (NGS) library. The library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof. The adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library. The adaptor may be specific to or selective for double-stranded DNA.

[00135] In FIG. 41, a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on one or both strands of a double-stranded oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double-stranded oligonucleotide fragments and may not label single- stranded fragments. The label may be selective for single-stranded oligonucleotide fragments. The label may associate with (such as bind to) the epigenetic modification with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic modification by click chemistry. The label may be an azido-sugar, such as an azido-glucose.

[00136] In FIG. 41, a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation. A complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide. A complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor). A complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment. The substantially complementary strand may be absent (a) the label that may be present in the parent oligonucleotide fragment, (b) the epigenetic modification that may be present in the parent oligonucleotide fragment, or (c) a combination thereof. The substantially complementary strand may be hybridized to the parent oligonucleotide fragment by DNA extension or cDNA extension.

[00137] In FIG. 41, parent oligonucleotide fragments and the substantially

complementary strand may be indirectly associated with a substrate. The association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment. The substantially complementary strand may be free of any label and/or free of any epigenetic modification. The association between the label and the substrate may be disrupted.

[00138] In FIG. 41, oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.

Variation 2

[00139] FIG. 42 shows one example of the 5-hmC Pulldown Copy Label Enrich

(HMCP_CLE) method detailed herein. In some cases, the HMCP_CLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (c) any combination thereof.

[00140] In this example of FIG. 42, a first element 301 may be to prepare a plurality of double stranded oligonucleotide fragments 302, such as a library. The double stranded oligonucleotide fragments may comprise cell-free DNA. The double stranded oligonucleotide fragments may have epigenetic modifications on one or more bases of one or both strands. A second element 303 may be to separate the strands of a double-stranded oligonucleotide fragment of the plurality to form one or more single-stranded oligonucleotide fragments 304.

The one or more single-stranded oligonucleotide fragments may comprise one or more bases having an epigenetic modification. A third element 305 may be to hybridize a complementary strand, such as a substantially complementary strand, to at least one single-stranded

oligonucleotide fragment to form a modified oligonucleotide fragment 306. The complementary strand may be substantially free of the epigenetic modification present in the opposing single- stranded oligonucleotide fragment. A fourth element 307 may be to associate a label (such as an azido-glucose label) with the modified oligonucleotide fragment to form a labeled modified oligonucleotide fragment 308, such as a labeled chimeric library. The label may associate with an epigenetic modification present in the modified oligonucleotide fragment. The label may not be associated with the substantially complementary strand that may lack an epigenetic modification. A fifth element 310 may be to associate a label 309 with the modified

oligonucleotide fragment wherein the label 309 may also associate with a substrate. The label 309 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A sixth element 311 may be to enrich a sample for one or more complementary strands 312 by removing or separating or washing away from the substrate one or more complementary strands (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate. In some cases, enriching a sample for one or more complementary strands may comprise washing a substrate, such as stringent washing of a substrate. Washing may remove one or more non-covalently bound fragments, one or more non-specifically physisorbed fragments, or a combination thereof. Washing may not disrupt or alter an association between a modified oligonucleotide fragment and a substrate, such that a sample may be enriched for the complementary strand. A seventh element 313 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 314 of the complementary strand.

[00141] In FIG. 42, the library may comprise double-stranded oligonucleotide fragments or single- stranded oligonucleotide fragments. The oligonucleotide fragments may be DNA or RNA. The library may be a next-generation (NGS) library. The library may comprise an oligonucleotide fragment having an adaptor (such as an NGS adaptor) at (a) one or both ends of the fragment, (b) at one or both strands of the double-stranded oligonucleotide fragment, or (c) a combination thereof. The adaptor may uniquely identify the oligonucleotide fragment from other oligonucleotide fragments in a sample or in a library. The adaptor may be specific to or selective for double-stranded DNA. [00142] In FIG. 42, a double-stranded oligonucleotide fragment may be separated to form single stranded fragments, such as separating by denaturation. A complementary strand may be hybridized to at least a portion of a single stranded oligonucleotide. A complementary strand may be a primer, such as a primer that may be complementary to the adaptor (such as an NGS adaptor). A complementary strand may be a substantially complementary strand, such as substantially complementary along an entire length of the oligonucleotide fragment. The substantially complementary strand may be absent the epigenetic modification that may be present in the parent oligonucleotide fragment. The substantially complementary strand may be hybridized to the parent oligonucleotide fragment by cDNA extension.

[00143] In FIG. 42, a label may associate with an epigenetic modification (such as 5- hmC) or a type of epigenetic modification present at a base of the parent oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5- hmC). The label may be selective for double-stranded fragments and may not label single- stranded fragments. The label may be selective for single-stranded fragments. The label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic

modification by click chemistry. The label may be an azido-sugar, such as an azido-glucose.

[00144] In FIG. 42, parent oligonucleotide fragments and the substantially

complementary strand may be indirectly associated with a substrate. The association to the substrate may occur via the label associated with the epigenetic modification on the parent oligonucleotide fragment. The substantially complementary strand may be free of any label and/or free of any epigenetic modification. The association between the label and the substrate may be disrupted.

[00145] In FIG. 42, oligonucleotide fragments comprising an epigenetic modification may be separated from oligonucleotide fragments absent any epigenetic modifications or absent a type of epigenetic modification. Separation may occur by associating the label with a substrate, such that any fragment absent the epigenetic modification or the type of epigenetic modification may be removed. Removal may occur by washing, such as stringent washing of the substrate. Following removal of oligonucleotide fragments lacking an epigenetic modification or a type of epigenetic modification, the substantially complementary strand may be separated from the parent oligonucleotide fragment strand. The parent oligonucleotide fragment strand may remain associated with the substrate. The parent oligonucleotide fragment strand and the substrate may be discarded. The substantially complementary strand may be amplified in a reaction vessel that may be free of the parent oligonucleotide fragment strand.

Variation 3

[00146] FIG. 43 shows one example of the 5-hmC Pulldown Label Random prime Enrich (HMCP_LRE) method detailed herein. In some cases, the HMCP_LRE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (d) any combination thereof.

[00147] In this example of FIG. 43, a first element 401 may be to associate a label (such as an azido-glucose label) with a double stranded oligonucleotide fragment to yield a modified oligonucleotide fragment 402. The double stranded oligonucleotide may comprise cell-free DNA. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of one or both strands of the double stranded oligonucleotide fragment to form the modified oligonucleotide fragment 402. A second element 403 may be to separate the strands of the modified oligonucleotide fragment to form one or more single- stranded modified oligonucleotide fragments and then to hybridize a complementary strand, such as a substantially complementary strand to at least one of the single-stranded modified oligonucleotide fragments to form a double stranded modified oligonucleotide fragment 404 having a complementary strand and a modified oligonucleotide fragment having the label. The complementary strand may be absent the label and absent the epigenetic modification. A third element 405 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified

oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 406, such as a labeled chimeric library. A fourth element 408 may be to associate a label 407 with the modified oligonucleotide fragment wherein the label 407 may also associate with a substrate. The label 408 may bind to an epigenetic modification or to the label previously associated with an epigenetic modification. The label 408 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 409 may be to enrich a sample for one or more complementary strands 410 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand) and then separating the complementary strand from the modified oligonucleotide fragment that remains associated with the substrate. A sixth element 411 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 412 of the complementary strand.

[00148] In FIG. 43, a label may associate with an epigenetic modification (such as 5- hmC) present at a base of the parent oligonucleotide fragment. A label may associate with a plurality of epigenetic modifications present on the parent oligonucleotide fragment. A label may associate with a type of epigenetic modification (such as 5-hmC). A label may be selective for a type of epigenetic modification (such as a 5-hmC). The label may be selective for double- stranded fragments and may not label single-stranded fragments. The label may be selective for single-stranded fragments. The label may associate with (such as bind to) the epigenetic modification of the parent strand with an aid, such as an enzyme. The enzyme may be selective for double-stranded oligonucleotide fragments, such as beta-glucosyltransferase (bGT). The label may associate with the epigenetic modification by click chemistry. The label may be an azido-sugar, such as an azido-glucose.

[00149] In FIG. 43, a position of a label may be determined by the presence/absence of 5- hmC in a dsDNA parent fragment. A label may be an azido-glucose, transferred to a 5-hmC from UDP-6-azide-glucose (UDP-N3-glc) by beta-glucosyltransferase (bGT). Labeling may be performed directly on a purified circulating tumor DNA (ctDNA) extract. An advantage may be that a ctDNA may not have been through a series of library preps ahead of labeling. There may be likely more material at labeling (improved efficiency) and presenting a more representative sample to a labeling than may be the case post NGS prep. [00150] In some cases, hybridizing may comprise (i) priming (such as random priming), (ii) ligation (such as adapter ligation), or (iii) a combination thereof. For example, in FIG. 43, random priming may be performed by incubating an azido-labeled double-stranded DNA (dsDNA) duplex in the presence of an oligomer pool (where each oligo in the pool may comprise a degenerate N6, N7, N8, N9, N10 or beyond“head” attached to a“NGS-adapter” tail), a DNA polymerase (e.g. Klenow) and a native nucleoside triphosphate comprising deoxyribose (dNTP) mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins). A degenerate primer“head” randomly may prime a template DNA and may make multiple copies for each of the parent strands. If using a strand displacing polymerase, the random primer that primer closest to the 3’ end of the template strand may extend and displace the other copies, leading to a long, double stranded chimeric product with a 3’A-overhang at the end of the daughter copy. Random priming may achieve two elements in one by: 1) introducing an NGS-specific adapter sequence and 2) generating a modification-free copy (daughter strand) of the modified parent strand.

[00151] In FIG. 43, adapter ligation may occur by incubating a mono-adapted chimeric labelled duplex template with a NGS-platform specific adapter (a forked adapter, a linear duplex adapter, a hairpin adapter, or a combination thereof) with 3’ T overhang and 5’ P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes). The A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and may promote ligation efficiency. Only one end of each duplex (that being formed by the 3’ end of the daughter strand) may be adapted. A successful ligation product may have a singly adapted azido-labeled parent strand (5’ adapted) and a doubly adapted non-modified daughter strand (both 3’ and 5’ ends). In some cases, amplification of such “library”, only a bottom strand may be amplifiable with an adapter-specific polymerase chain reaction (PCR) primer.

[00152] In FIG. 43, magnetic bead binding may enable selective enrichment of a labeled chimeric next generation sequencing (NGS) library fragments. This may be achieved directly (i.e. by Sharpless Azide-alkyne cycloaddition reaction (CLICK) chemistry between the azido- glucose label and dibenzocyclooctyne (DBCO)-magbead) or indirectly (i.e. by Sharpless Azide- alkyne cycloaddition reaction (CLICK) of a dibenzocyclooctyne (DBCO)-biotin linker and then conjugation of the product to streptavidin-magbeads). In some cases, only azido-labeled fragments (i.e. 5-hmC-containing) may bind to the magbead. Azido-labeled fragments may be immobilized to a bead, such as a magnetic bead. In some cases, this interaction may only occur via a labeled parent strand of the chimeric NGS library duplex. A copied complement may not be azido-labeled and thus may be immobilized to a bead by virtue of the hydrogen-bonding interaction between the complementary duplex strands. As this H-bonding interaction may be non-covalent, it may be disrupted and exploited in downstream steps.

[00153] In FIG. 43, enrichment by stringent washing may be essential to maximize a signal-to-noise ratio of an enrichment process. Chimeric NGS library immobilized beads may be washed stringently (e.g. specific buffers; mild heat; mild denaturants etc.) to selectively remove non-covalently bound NGS library fragments, non-specifically physiosorbed to their surface. In some cases, such types of fragments may cause noise in a final sequencing result. Chimeric NGS library fragments covalently bound to the bead surface may be selected for in the enrichment (i.e. signal, those whose may insert originally contained 5-hmC). After stringent washing, a daughter strand may be eluted from the bead (e.g. heat, high pH, low ionic strength buffer etc.) and taken forward to a PCR reaction. In some cases, the bead-immobilized fraction may be discarded. In some cases, these daughter strands may be exact complements of a labeled strands immobilized to a bead. However, they may not contain any epigenetic modifications and hence may be free from“5-hmC-density” amplification bias. Amplification of these eluted daughter strands may give a superior result over existing methodologies for two reasons: 1) an improved resolution (higher signal-to-noise) and 2) an improved representation (decreased selection bias).

[00154] The methods and systems as described herein may provide a result that may be far more representative of an extent to which a nucleic acid may be marked epigenetically. In some cases, the methods and systems may be superior to other methods of identification of epigenetic modifications. Other methods of identification may include the HMCP method or a method that comprises associating a sugar, a protein, an antibody, or a fragment of any of these with an epigenetic modification and detecting a presence of the sugar, the protein, the antibody, or fragment thereof. In some cases, nucleic acid sequences, such as fragments containing a high density of epigenetic modifications may not be detected using other methods of identification of epigenetic modifications. The unbiased approach of the present methods and systems provides for detection of high density epigenetic modifications of nucleic acid sequences, such as short fragments yielding an unbias detection.

[00155] In FIG. 43, a daughter strand PCR amplification may occur. In some cases, PCR may be employed using only an eluted daughter strand as amplification template using standard protocols and procedures. In some cases, minimizing a number of PCR cycles may minimize duplicates. In some cases, using UMI-codes within an adapter sequence may help quantitation during downstream analysis. In some cases, a genome wide library of enriched fragments may be ready for sequencing.

Variation 4

[00156] FIG. 44 shows one example of the 5-hmC Pulldown Random prime Label Enrich (HMCP_RLE) method detailed herein. In some cases, the HMCP_RLE method may provide: (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) a substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (d) any combination thereof.

[00157] FIG. 44 is similar to the method of FIG. 43 except that in some cases, priming (such as random priming) and ligation (such as adapter ligation) may occur before labeling as shown in FIG. 44 and in some cases, priming and ligation may occur after labeling as shown in

FIG. 43

[00158] As shown in FIG. 44, a first element 501 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment (having one or more epigenetic modifications at one or more bases on one or both strands) and (ii) initiate random priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragment. Random priming may form a double stranded modified oligonucleotide fragment 502. The complementary strand formed by random priming may not have epigenetic modifications or may be substantially free of epigenetic modifications. A second element 503 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 504. A third element 505 may associate a label (such as an azido-glucose label) with the double stranded modified oligonucleotide fragment to yield a labeled fragment 506, such as a labeled chimeric library. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of the double stranded oligonucleotide fragment to form the labeled fragment 506. A fourth element 508 may be to associate a label 507 with the double stranded modified oligonucleotide fragment wherein the label 507 may also associate with a substrate. The label 507 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 509 may be to enrich a sample for one or more complementary strands 510 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the interaction between the complementary strand and the opposing strand). Upon separation, the modified oligonucleotide fragment may remain associated with the substrate. A sixth element 511 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 512 of the complementary strand.

Variation 5

[00159] FIG. 45 shows one example of the 5-hmC Pulldown Label Loci Specific Enrich (HMCP_LLSE) method detailed herein. In some cases, the HMCP_LLSE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (d) targeted regions of 5-hmC enriched DNA as compared with other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (e) any combination thereof.

[00160] As shown in FIG. 45, a first element 601 may associate a label (such as an azido- glucose label) with the double stranded oligonucleotide fragment, such as a cell-free DNA fragment to yield a labeled fragment 602. The label may associate with an epigenetic

modification or a type of epigenetic modification present at one or more bases of the double stranded oligonucleotide fragment to form the labeled fragment 602. A second element 603 may (i) separate strands of a labeled fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded oligonucleotide fragments. Loci specific priming may form a double stranded modified oligonucleotide fragment 604 having a label associated with an epigenetic modification of the parent strand. The complementary strand may be absent both epigenetic modifications and the associated label. A third element 605 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 606, such as a labeled and loci-enriched chimeric library. A fourth element 608 may be to associate a label 607 with the double stranded modified oligonucleotide fragment wherein the label 607 may also associate with a substrate. The label 607 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The interaction between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 609 may be to enrich a sample for one or more complementary strands 610 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate. A sixth element 611 may be to amplify the complementary strand in the absence of the modified oligonucleotide fragment to form one or more daughter strands 612 of the

complementary strand.

[00161] In this example, both strands of double stranded DNA (dsDNA) fragments containing 5-hmC may be labeled using beta-glucosyltransferase (bGT) and UDP-6-azide- glucose (UDP-N3-glc). This step may be dsDNA selective (bGT may not work on single stranded DNA (ssDNA)). Position of label may be determined by the presence/absence of 5- hmC in the dsDNA parent fragment. A label may be azido-glucose, transferred to the 5-hmC from UDP-N3-glc by bGT. The labeling may be performed directly on the purified circulating tumor DNA (ctDNA) extract. Advantage of this may be that the ctDNA may not have been through a series of library prep steps ahead of labeling. So there may be likely more material at the labeling (improved efficiency) and may present a more representative sample to a labeling than may be the case post NGS prep.

[00162] In some cases, hybridizing may comprise (i) priming (such as loci specific priming), (ii) ligation (such as adapter ligation), or (iii) a combination thereof. For example, in FIG. 45, loci specific priming may be performed by incubating azido-labeled dsDNA duplexes in the presence of an oligomer pool (where each oligo in the pool may comprise a loci specific “head” attached to a“NGS-adapter” tail), a DNA polymerase (e.g. Klenow) and a native dNTP mix in a given buffer, and performing a single extension reaction at 37 °C for a defined time (e.g. 10 mins). A loci specific head may be designed to be complementary to specific, defined regions of interest (ROI). Extension from an annealed loci specific primer may result in an A- overhang at an end of a daughter copy. A random priming may achieve two elements in one: 1) it may introduce an NGS-specific adapter sequence in a loci-specific manner and 2) it may generate a modification-free copy (daughter strand) of the modified parent strand.

[00163] In FIG. 45, a labelled loci-monoadapted chimeric duplex template may be incubated with a NGS-platform specific adapter (illustration shows forked adapter, but linear duplex adapter of hairpin adapter may be substituted) with 3’ T overhang and 5’ P04 end, a dsDNA ligase (e.g. T4 ligase) and necessary cofactors (e.g. Mg2+, adenosine triphosphate (ATP), polyethylene glycol (PEG)) in a given buffer, at 20 °C for a defined period of time (e.g. 15 minutes). The A overhang of the monoadapted chimeric labelled duplex may match with the T overhang of the adapter and promotes ligation efficiency. In some cases, only one end of each duplex (that being formed by the 3’ end of the daughter strand) may be adapted. A successful ligation product may have a singly adapted azido-labeled parent strand (5’ adapted) and a doubly adapted non-modified daughter strand (both 3’ and 5’ ends). Where one to amplify this “library” it may be that only a bottom strand may be amplifiable with adapter-specific PCR primers.

[00164] In FIG. 45, following adapter ligation, an enrichment of the daughter strand by a substrate may be employed followed by PCR amplification of the daughter strand that may be substantially free of epigenetic modifications.

Variation 6

[00165] FIG. 46 shows one example of the 5-hmC Pulldown Loci Specific Label Enrich (HMCP_LSLE) method detailed herein. In some cases, the HMCP_LSLE method may provide (a) an improved resolution as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (b) a decrease in a 5-hmC-density bias as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (c) an substantially improved robustness at low input mass as compared to other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; (d) targeted regions of 5-hmC enriched DNA as compared with other methods, such as a HMCP method or a method that may associate a sugar, an antibody, a protein, a fragment of any of these, a label, or any combination thereof with an epigenetically modified base of the nucleic acid; or (e) any combination thereof.

[00166] FIG. 46 is similar to the method of FIG. 45 except that in some cases, priming (such as loci specific priming) and ligation (such as adapter ligation) may occur before labeling as shown in FIG. 46 and in some cases, priming and ligation may occur after labeling as shown in FIG. 45

[00167] As shown in FIG. 46, a first element 701 may (i) separate strands of a double stranded oligonucleotide fragment, such as a cell-free DNA fragment and (ii) initiate loci specific priming to form a complementary strand, such as a substantially complementary strand, to at least one of the single stranded parent strands. Loci specific priming may form a double stranded modified oligonucleotide fragment 702. The double stranded oligonucleotide fragment may have one or more epigenetic modifications at one or more bases on one or both strands. The complementary strand, such as a substantially complementary strand, formed by loci specific priming may not have epigenetic modifications. A second element 703 may associate an adaptor to the double stranded modified oligonucleotide fragment (such as to one or both ends of one or both strands of the double stranded modified oligonucleotide fragment) to form a double stranded modified oligonucleotide fragment having one or more adaptors 704. A third element 705 may associate a label (such as an azido-glucose label) with the double stranded modified oligonucleotide fragment to yield a labeled fragment 706, such as a labeled chimeric library. The label may associate with an epigenetic modification or a type of epigenetic modification present at a base of the double stranded modified oligonucleotide fragment to form the labeled fragment 706. A fourth element 708 may be to associate a label 707 with the double stranded modified oligonucleotide fragment wherein the label 707 may also associate with a substrate. The label 707 may not bind directly to the complementary strand. The complementary strand may be indirectly associated with the substrate via the interaction between the substrate and the modified oligonucleotide fragment. The association between the complementary strand and the opposing strand may be disruptable, such as a disruptable bond. A fifth element 709 may be to enrich a sample for one or more complementary strands 710 by removing or separating or washing away from the substrate one or more complementary strands that lack a label associated with the substrate (such as by disrupting the bond between the complementary strand and the opposing strand). Upon separation, the opposing strand may remain associated with the substrate. A sixth element 711 may be to amplify the complementary strand in the absence of the parent strand to form one or more daughter strands 712 of the complementary strand.

[00168] The HMCP method may be referred to herein as the‘standard’ method. The HMCP method may be referred to herein as HMCP, HMCP-v1, HMCP v1, HMCP, v1HMCP, v1 HMCP, or V1. The CLE method may be referred to herein as HMCP_CLE, HMCP-v2,

HMCPv2, CLE-HMCP, v2HMCP, v2 HMCP, or V2.

[00169] For any of the methods described herein, including CLE, HMCP_LCE,

HMCP_CLE, HMCP_LRE, HMCP_RLE, HMCP_LLSE, HMCP_LSLE, one or more individual elements of a given method may be performed in the order as described herein. In some cases, one or more individual elements of a given method need not be performed in a particular order described herein. In some cases, one or more individual elements of a given method may be performed in a different order than described herein.

[00170] In some cases, the complementary strand may be a substantially complementary strand or may comprise a portion that may be substantially complementary to a portion of a nucleic acid sequence.

[00171] Hybridizing may comprise hybridizing at least two complementary strands to at least two portions of a nucleic acid sequence. Hybridizing may comprise hybridizing at least a portion of a complementary strand to an adapter sequence of the nucleic acid sequence.

Hybridizing may comprise extension, such as cDNA extension. Hybridizing may comprise priming, such as loci specific priming or random priming. Hybridizing may comprise ligation, such as adapter ligation. Hybridizing may comprise hybridizing a primer to a nucleic acid sequence and elongating from the primer to form a complementary strand. Hybridizing may comprise obtaining a complementary strand and hybridizing the complementary strand to the nucleic acid sequence. [00172] A label may be associated with an epigenetically modified base of a nucleic acid sequence. A label may be associated with an epigenetically modified base before hybridizing. A label may be associated with an epigenetically modified base after hybridizing.

[00173] The method may comprise amplifying the complementary strand in a reaction in which the nucleic acid sequence may be substantially not present. The amplifying may comprise associating the nucleic acid sequence and complementary strand with a substrate, such as by a label. The amplifying may comprise washing a substrate that may be associated with the nucleic acid sequence and complementary strand, such as stringent washing. The amplifying may comprise eluting a complementary strand from the substrate on which the nucleic acid sequence remains. The amplifying may comprise amplifying the complementary strand.

[00174] An epigenetic modification may comprise a DNA methylation. A DNA methylation may comprise a hyper-methylation or a hypo-methylation. A DNA methylation may comprise a modification of a DNA base, such as a 5-methylcytosine (5-mC), a 4- methylcytosine, a 6-methyladenine, or a combination thereof.

[00175] TissueMap2 may identify differences in 5hmC enrichment between the tumors of individuals with Breast, Colorectal, Prostate and Lung Cancer. 5hmC enrichment in plasma samples was profiled from the same patients and looked at the relationship with 5hmC signal in the tumor was also examined.

[00176] Table 10 - Genes identified to have differential signal at an FDR < 0.01 specific to one cancer (for both tumor and plasma cases) relative to the other three (e.g Breast vs

[Prostate+Colorectal + Lung] ).

[00177] Table 13 - A table reporting the number of genes that are specifically enriched for one cancer type in plasma given that they are also significantly enriched in the tumor.

[00178] FIG. 26A-B gives two key examples of enriched genes in the tumor which are also enriched in the plasma

[00179] Table 14 is similar to Table 10 in that it compares one cancer type to the other three for both tumors and plasma, but it reports results for a new feature type called a 5hmC enriched “peak”.

[00180] Table 15 describes a number of differential 5hmC enriched peaks comparing each tumor type to a set of normal tissue.

[00181] FIG. 31-38 give examples of peaks that are enriched in a tumor and also enriched in the plasma [00182] In some cases, standard tests may not identify differential 5hmC enrichment between cancers in plasma samples, promising candidates may be identified using other methods. Tables 19-22 give the genome coordinates for these potential cancer-specific 5hmC enriched peaks.

EXAMPLES

[00183] One of the challenges for non-invasive diagnostic tests for early cancer detection may be to avoid diagnosing off-target cancer events. In Example 1, an experiment was designed to identify tissue specific 5-hydroxymethylated cytosine (5hmC) regions that are specific to several key cancer indications and demonstrate the specificity of a 5hmC cell-free DNA

(cfDNA) colo-rectal cancer (CRC) classifier to off-target indications. In Example 1, the 5hmC cfDNA classifier (trained exclusively on colorectal cancer patients and healthy volunteers) can classify patients with off-target cancer indications as diseased (i.e. CRC). This may be demonstrated in at least prostate, lung or breast cancer, all of which have a high prevalence rate in the US population. Using methods described herein (such as a 5hmCP workflow), one can distinguish tumor specific regions and correspondence to functional genomic evidence. In Example 1, candidate regions may be identified that can be used to help prioritize cancer specific regions, deprioritize non-CRC specific regions, or a combination thereof.

[00184] The goal of Example I was to establish tumor specific 5hmC genomic regions for a set of cancer indications based on high prevalence, incidence and biological relatedness to aid in developing more precise genomic signatures from 5hmC cfDNA.

[00185] The identification of tumor specific 5hmC enriched genomic regions may provide locations in the genome that can be used to indicate the presence of tumor DNA from 5hmC cfDNA profiles. These regions can be prioritized in a machine learning strategy to indicate tumor presence. This may have particular importance for determining the difference between related cancers in a liquid biopsy diagnostic setting. There is some evidence that cancers in tissues that have high prevalence in the population (e.g. Prostate, Breast, Lung, Colon) may be close enough in their genomic profiles to cause misdiagnosis simply by the biological relatedness of their primary tissue (FIG. 1).

[00186] Advantages of methods as described herein may include: (i) obtaining genome-wide HMCP profiles of matching tumor genomic DNA (gDNA) and plasma cfDNA in multiple tissues, in replicates, and in healthy controls and cancer cases; (ii) conducting data quality study on a plurality of samples; (iii) obtaining tumor-specific regions in the genome for each profiled tumor type; (iv) assessing tumor-specific regions in cfDNA samples; (v) obtaining CRC classifier performance on cfDNA Prostate, Breast, Lung and CRC profiles; (vi) a showing of how tumor profiles may improve performance of the classifier; (vii) an evaluation of tissue meta-features/models including (a) assessing how this may contribute to a tumor burden measure; (b) assessing how this may enable a tissue of origin fractionated read out for cfDNA; (c) assessing how this may result in a-priori weighting in different genomic features for use in the classifier; (viii) an evaluation of SNV correspondence in patient tumor and patient cfDNA, including validating various regions with digital PCR or qPCR in cfDNA samples, or even snv array; (ix) or any combination thereof.

[00187] Referring to FIG 1, an ARCH4 representation of RNA-seq profiles in normal tissue shows relatedness of organ tissue samples that may have high incidence or prevalence rates. Colon, Prostate, Breast and Lung may inhabit the same“neighborhood” in the low dimensional spatial representation.

[00188] In Example 1, a phase II of the 5hmC tissue map project collected tumor biopsies and blood plasma from 24 patients harboring 4 types of primary tumors: Prostate, Lung, Colorectal and Breast. All tissue was collected using a fresh frozen protocol. Analysis of this Example 1 also utilized samples collected in Phase I of the tissue map project, which profiled 21 normal tissues.

[00189] Table 1: Sample Description Overview. Number of tissue samples for each indication, 24 patients provided matched tumor and plasma samples.

[00190] Table 2. Detailed Patient Information.

[00191] Sample Material. Approximately 300 mg of tissue was extracted that resulted in lOOng of gDNA for the tumor samples. 2 mLs of patient plasma was used to obtain ~10ng of cfDNA.

[00192] The Data/Experiment Balance and Biases. The gDNA and cfDNA extractions were performed separately on different dates. The method (HMCP method) was performed by a single operator. Gender was associated with cancer type due to the presence of gender specific cancers in the study (FIG. 3). There was a difference in the age of patients per cancer type (AOV p-value <0.01) (FIG. 4). Colorectal, lung and breast samples were mostly stage I samples. Prostate samples were split evenly between stage II and III (Table 3).

[00193] Referring to FIG. 3, a bar chart shows distribution of gender across cancer indications. Referring to FIG. 4, a box plot shows distribution of age across cancer indications demonstrating lower patient ages in breast cancer.

[00194] Table 3: Cancer Type by Cancer Stage. The Colorectal samples were all stage I, breast and lung samples were early stage, and prostate samples were mixed between stage II and stage III tumors.

[00195] Table 4: Summary of statistical tests between experimental variables to assess balance. Statistical tests were performed between experimental (categorical/continuous) variables to assess the balance of the cohort via either Fisher, ANOVA or correlation test (Pearson).

[00196] In Example 1, there were biases found between the clinical diagnosis and gender, extraction data, stage and histological grade. The gender variable was associated with the clinical diagnosis due to sex-specific indications (prostate, breast). The clinical diagnosis variable was also associated strongly with the extraction date and stage. The latter association may be due to the later stage cancer samples being nearly exclusively from prostate cancer donors. Gender, extraction date, stage and histological grade biases were identified due to the available samples.

[00197] Data Attrition. All samples were collected and processed through the methods described herein (HMCP method). There was no loss of samples at the sample collection of HMCP processing stage.

[00198] Data Preparation. Data preparation is described herein.

[00199] Quality Control (QC) checks of sequencing data

[00200] Pre-sequencing QC. This section provides basic information on the laboratory QC Readout using BioAnalyser/Qubit/Tapestation instruments. No samples were excluded before sequencing.

[00201] Post-sequencing QC. The quality of each sample was assessed using a set of sequencing metrics (Table 5) that have been specifically chosen to represent important quality components of the input and pulldown samples. These quality metrics were individually assessed for each sample for any unexpected values outside the normal ranges for cfDNA (Table 6) and gDNA (Table 7) from reference datasets.

[00202] To further aid in assessing trends across all of the metrics combined, a Principal Components Analysis (PC A) was employed to identify samples that may be outliers relative to the cohort. Outlying samples were identified using the Mahalanobis distance.

[00203] Failure Criteria. Two sample failures were identified based on at least one of the failure criteria: (1) Deduplicated read counts < 1,000,000 reads; and (2) 2hmC spike-in ratio < 1.

[00204] Outlier analysis using quality metrics. Using the Mahalanobis distance, six candidate outliers were identified. These were compared to 5hmC enrichment statistics (peak call statistics) which were deemed to be within acceptable ranges. Assessing performance of the classifier at different quality levels was part of the purpose of Example 1, and therefore these samples were not excluded (Table 5). Outliers were identified separately for the gDNA context and cfDNA context due to the considerable differences observed in the quality metrics (FIG. 5).

[00205] Excluded Samples. Two samples were excluded based on failure criteria. Zero samples were excluded based on outlier analysis.

[00206] Table 5: Table of quality metrics used to ascertain data quality of the HMCP profiles.

[00207] Table 6: QC metrics thresholds for cfDNA samples based on mean centered 2 x standard deviation (SD) thresholds.

[00208] Table 7: QC metrics thresholds for gDNA samples based on mean centered 2 x standard deviation (SD) thresholds.

[00209] Table 8: Excluded samples

[00210] Table 9: Mahalanobis distance on a PCA of quality score metrics.

[00211] Referring to FIG. 5 (Pulldown Library) and FIG. 6 (Input Library), a first two principal components in PCA of the QC metrics for Pulldown and Input libraries. These show separation between gDNA and cfDNA and the importance of assessing these separately in cfDNA and gDNA. [00212] Referring to FIG. 7 and FIG. 8, the number of peaks vs the average enrichment in the peak is shown, for two peak callers MACS2 (FIG. 7), and EPIC (FIG. 8). Overall, cfDNA samples had lower numbers and shorter peaks than gDNA.

[00213] Referring to FIG. 9A - 9B, a Key Pulldown QC metrics for the Tissue Map Project shows absolute failure of a pulldown library. FIG. 9A shows sequencing depth after de- duplication and FIG. 9B shows spike-in ratio for 2hmC controls. One sample was identified as an outlier based on both of these metrics (CEG74_150_017PC, colorectal cancer cfDNA).

Boxes are organized from right to left in pairs. For example, cfDNA Breast Cancer, gDNA Breast Cancer, cfDNA Colorectal Cancer, gDNA Colorectal Cancer, and so on.

[00214] Referring to FIG. 10A-10B and FIG. 11A-11B, additional QC metrics indicate the quality of a sample. These additional QC metrics may include the median insert size (FIG.

10A), genebody to intergenic ratio (FIG. 10B), percentage read duplication rate (FIG. 11 A) and the RPKM of the mitochondrial genes (FIG. 11B). Insert Size. Several outliers were present based on the sample insert sizes. (cfDNA average ~ may be approximately 166bp , gDNA was sheared to - 150 bp ). Gene body to intergenic ratio. A good quality pulldown may have a gene body to intergenic ratio of about > 1 but this may vary per sample type and an increase in the ratio for gDNA over cfDNA for pulldown libraries (FIG. 10B). Deduplication rate. The rates were similar to that achieved in additional examples, (i.e. HMCP110 project at -40-85%) and was lower for gDNA over cfDNA (FIG. 11 A). RPKM of the Mitochondria. May indicate the noisiness of a sample, as little to no reads on this region of the genome. For example, a single breast cancer gDNA sample had > 500 reads and was far higher than expected (FIG.

11B).

[00215] Referring to FIG. 12A-12C, additional QC metrics are shown that may indicate the quality of the input library. _FIG. 12A shows a sample including the diversity score, FIG. 12B shows a gc bias global error, and FIG. 12C shows a uniformity score. The diversity scores for Example I may be lower than other HiSeq runs (FIG. 12A). The uniformity scores were within ranges from reference data.

[00216] A focus of Example I was to identify cancer specific regions in tissue that may be informative in the cfDNA context. The tissue map project has produced 5hmC profiles from two key tissue sample sets: (1) Tissue Map Phase I normal tissues (TMapI); and (2) Tissue Map Phase II tumor tissues (TMapII)._This leads to the following comparisons to identify

discriminatory genes: (i) Using the TMapII data, the tumor samples of one cancer type were compared to the other three tumor types. For example, 6 CRC samples were compared to 18 samples (6 CRC tumor, 6 Prostate tumor, 6 Breast tumor). This was referred to subsequently as a TumorlVs3 comparison; (ii) Tumor samples of one type from TMapII were compared to the TMapI tissues. This was referred to as TumorVsNormal comparison; (iii) Tumor samples of one type from the TMapII were compared to a combined set of TMapI normal tissues and the other TMapII tumor samples. For example, 6 CRC samples were compared to 39 samples (6 CRC tumor, 6 Prostate tumor, 6 Breast tumor, 21 normal tissue types). This was referred to subsequently as TumorVsNormal+3 comparison; and (iv)_Where noted, a pairwise

comparison was also used that identifies a tumor-specific feature as one that was statistically significant in all three comparisons. For example, a CRC specific all three comparisons - CRC vs breast, CRC vs prostate, CRC vs lung, reach statistical significance).

[00217] The methods as described herein may show that the samples separated by tumor type using gene bodies or GeneHance feature sets. The methods may show that prostate and breast may have higher numbers of discriminatory genes when using a Tumor 1 vs 3 comparison. All indications may have similar numbers when in a TumorVsNormal comparison. The addition of the three other indications to the normal samples may reduce the significant genes by an order of magnitude in breast, colorectal and lung tumors(TumorVsNormal+3 comparison). The methods may show that for CRC, lung and prostate, the most relevant set of discriminatory genes for the purpose of functional validation may be obtained using a control set with normal tissue and the addition of cancer tissue (TumorVsNormal+3). The methods may show candidate genes that may be significant in gDNA samples and may contribute to increased enrichment in the cfDNA of diseased patients. The methods may show a global assessment of 5hmC enrichment and the observation that the key separator may be the lower enrichment and shorter peak sizes in cfDNA compared to gDNA samples. However, a demonstrable separation between colorectal and prostate cancer in both cfDNA and gDNA may exist. The methods may show that using peak features (instead of genes and genehancers) may also produce specific clustering between the tumor samples and may separate from the cfDNA samples. Furthermore, the 21 normal tissue samples also may separate distinctly from the tumor tissue. The methods may show that using a TumorVs3 peak strategy identifies 1000s of tumor specific peaks in breast, colon and lung cancer. In prostate, tens of thousands of peaks may be identified. The methods may demonstrate a number of 5hmC enriched“peak” regions from tumor tissue (gDNA) and cfDNA that may be used to obtain more specific CRC features for cfDNA classification. The methods may demonstrate that a classifier trained on CRC and HV samples, assigns most of the prostate, lung cancer samples with the CRC class, while breast samples may be assigned a mix of CRC and HV classes.

[00218] In summary, 5hmC enrichment was quantified in Gene and GeneHancer features using a readcount filter of 30 reads per feature and a coefficient of variation of between 0.2 and 2. Principal Components Analysis showed that samples separate by tumor type (FIG. 13A-13C). For gene bodies, prostate tumors were separated by the first principle component. Breast tumors separate from the other samples on principal component 2 and 3. Lung and colorectal tumor profiles appear to cluster closer together on principal component 2.

[00219] FIG. 13A-13C shows PCA plots of Tumor gene body profiles. The three plots show left to right: (FIG. 13A) PC1 vs PC2, (FIG. 13B) PCI vs PC3 and (FIG. 13C) PC2 vs PC3. The prostate samples may be clearly separated by PCI, the Breast Cancer and CRC samples separated by PC2, the Lung by PC3.

[00220] Using the HMCP Data Quality Pipeline, 5hmC enrichment levels for Genes and GeneHancer features were produced. Two statistical tests (Mann-Whitney U test and DESeq2 tests) were used to identify genes that differ in their 5hmC enrichment between the tumors and control group according to the schemes 1 to 3. The Mann Whitney U test was run on the ratio of pulldown RPKMs to input RPKMs. The DESeq2 test was also run on the pulldown library read count values. The results of the Mann-Whitney U test for gene bodies are discussed herein.

[00221] Using the Tumorl Vs3 strategy, a higher numbers of prostate and breast specific genes were identified.

[00222] The TumorVsNormal comparisons identified 1000’s significant genes but adding the Tumor tissue (TumorVsNormal+3 strategy) reduced this to 100s of significant genes, except in the Prostate cancer case which did not show the same reduction in number.

[00223] Tumor Tissue vs other Tumor Tissue (TumorVs3) (gDNA)

[00224] No differences were found in the cfDNA (Table 10) using the Tumorl Vs3 scheme for any of the indications. For the Tumor gDNA samples, using the same Tumorl Vs3 scheme, the following were identified: (i) No significant colorectal cancer genes; (ii) ~ 70 significant lung cancer genes; (iii) ~ 450 significant breast cancer genes; (iv) -6000-8000 significant prostate cancer specific genes.

[00225] Table 10: Number of significant genes <= 0.01 Benjamini-Hochberg FDR adjusted Mann-Whitney p-value in the comparison 1 vs All other Tumor Types.“Cofv filtered” uses only genes and genehancers that obtain > 30 reads, and with a coefficient of variation > 0.2 and < 2.

[00226] Cancer Tissue vs Other Cancer Tissue + All normal tissue (TumorVsNormal and TumorVs3 +Normal)

[00227] Genes were identified with specific 5hmC enrichment to a tumor type using two control sample strategies that employed normal tissues. A first control set was composed of 21 fferent normal tissue types. A second control set was composed of 21 normal tissue with the addition of the remaining three other tumor type samples. A Mann-Whitney U test was applied with adjustment for false discovery rate and filtered at adjusted p-value < 0.01.

[00228] As a result, the TumorVsNormal+3 strategy may reduce the number of significantly differential genes in the breast, colon and lung tumors from the thousands to the hundreds. This may not hold with the prostate tumor comparison, where the number of significant genes remained approximately the same.

[00229] Table 11: The number of significant genes that may be differential between a tumor type and two controls sets that include a range of normal tissue samples. The control set “vsNormal” = 21 Normal Tissue,“vsAll” = 21 Normal Tissue + the other Tumor Tissues.

FDR<.0.01.

[00230] Lists of significant genes, as described herein, were assessed for functional relevance using the GeneCards GeneAnalytics database. The lists were sorted by Benjamin-Hochberg FDR and then the top 100 statistically ranked genes were used as input to the GeneCards GeneAnalytics application. The disease specificity results were used to functionally validate the identified differential genes.

[00231] The three control strategies were assessed for functional relevance over each of the indications. Dependent on the control strategy, it may be found that all indications displayed some specificity towards the target comparison (Table 12). For colorectal, prostate and lung cancer comparisons, using the normal tissue with the addition of the three remaining cancer tissue as the control strategy may give good functional validation. Breast cancer enriched regions were functionally validated best by a Tumor 1 vs 3 control strategy.

[00232] Table 12: Functional relevance of the gene sets produced by different comparison

[00233] Summary of the functional validation by GeneCards GeneAnalytics database for each indication and control comparison strategy.

[00234] Breast: TumorVsThree is shown in FIG. 14. TumorVsNormal is shown in FIG. 15. TumorVsNormal+3 is shown in FIG. 16.

[00235] Colorectal: TumorVsThree is shown in FIG. 17. TumorVsNormal is shown in FIG. 18. TumorVsNormal+3 is shown in FIG. 19.

[00236] Prostate: TumorVsThree is shown in FIG. 20. TumorVsNormal is shown in FIG. 21. TumorVsNormal+3 is shown in FIG. 22.

[00237] Lung: TumorVsThree is shown in FIG. 23. TumorVsNormal is shown in FIG. 24. TumorVsNormal+3 is shown in FIG. 25.

[00238] Regions identified as tumor-specific based on our TumorVsNormal+3 strategy were utilized. It was tested whether fragments from patient cfDNA might be indication-specific in these regions.

[00239] Table 13 shows the number of genes identified from tumor samples as being specifically enriched in the tumor and have evidence for enrichment in the same indication in the cfDNA samples. None of the tests gave statistically significant results at an FDR of about 0.05. Candidates emerge with relaxed statistical criteria, about 100 genes in CRC may be specific in both gDNA and cfDNA. However, it may be clear that the tests are constrained by the available power in the cfDNA samples.

[00240] Gene features may be ordered by median difference in 5hmC enrichment between the tumor and control. Many of the most highly enriched CRC gDNA genes may also show CRC specificity in cfDNA. Furthermore, these genes may be enriched in CRC patients over healthy volunteers in the HMCP110 cohort study and thus may provide further candidates to promote based on higher CRC specificity.

[00241] Table 13: Summary of the number of tumor specific genes that show differences in cfDNA

[00242] It was tested whether tumor (gDNA) specific genes (FDR < 0.05) retain specificity in the plasma (cfDNA) samples. The grayed boxes highlight genes that may be both specific to the indication in both gDNA and cfDNA. The remaining rows show the number of features identified in cfDNA that may not be specific to the tumor indication (gDNA) in cfDNA. These may be potential candidate genes to deprioritize in a CRC classification approach (see the bolded rows for colon cancer case). Logic for deprioritization may follow the assumption that in the general population of cancer patients, genes that show fundamental specificity at the tumor level may be more reliable candidates. It may be the case that specificity for a cancer may be achieved via another fundamental mechanism (e.g. the immune system) in cfDNA. Thus, a caveat may exist, that genes selected for deprioritization may be deemed candidates that require further evidence of their behavior before being utilized in a classifier.

[00243] FIG. 26A-26B: RP11-404P21.8 and FAT1 are examples of genes with high specificity in colorectal cancer tumors relative to the three other tumor types. Both these genes are also enriched in CRC over HV in the HMCP110 project with FDR < 0.01.

[00244] Enriched 5hmC regions may be identified using models for both narrow and broad enrichment ( such as a MACS2 and EPIC methods, respectively) in the cfDNA and gDNA samples.

[00245] The median number of peaks may be significantly higher (p < 2.5x10-7;

MannWhitneyU test) than in gDNA (N=364303) than in cfDNA (N= 154900), although the average length of peaks (gDNA average length = 511 bp, cfDNA average length = 477 bp) may be less separated between the two types and not significant (p = 0.07). This highlights the overall flatter enrichment in cfDNA.

[00246] Visually, it may be apparent that the colorectal tumors have the least number of peaks, while prostate may be twice the amount. The breast and lung samples may be more variable and may have numbers in between colorectal and prostate. Interestingly, the CRC cfDNA samples may also have lower numbers of peaks relative to the other cfDNA samples (FIG. 27). This lower number of peaks in both cfDNA and gDNA in CRC may also hold over both narrow and broad peak calling.

[00247] FIG. 27: Number of Narrow Peaks called in Tissue Map II. gDNA samples may have significantly higher numbers of 5hmC peaks than cfDNA samples.

[00248] FIG. 28: Number and Length of Broad Peaks in Tissue Map II. gDNA samples may have significantly higher numbers of 5hmC peaks than cfDNA samples.

[00249] A peak set may be formed that captures enriched regions across the different indications specific to both gDNA and cfDNA. Tumor samples may cluster by tumor type and may be separated from cfDNA samples (FIG. 29). Using hierarchical clustering on this peak set, normal and cancer tissues may cluster separately on different branches (FIG. 30).

[00250] FIG. 29: Hierarchical clustering of peak set over all samples in the tissue map2 project. Y axis = 1 - correlation. The peak set used to identify regions of the genome to count fragments is the intersection of the cfDNA and gDNA peaks.

[00251] FIG. 30: Hierarchical clustering of peak set over all samples in the tissue map II and tissue map I project. Y axis = 1 - correlation. The peak set used to identify regions of the genome to count fragments is the intersection of the cfDNA and gDNA peaks. Notably, lung and colorectal cancers may cluster together. Breast and normal tissue samples may cluster in the branch. Prostate samples may be closer to the breast and normal tissue, than the lung and colorectal cancer samples.

[00252] A peak set may be created that captures enriched regions across the different indications specific to both gDNA and cfDNA. Two different control strategies may be used to identify tumor specific peaks: (1) Tumor 1 Vs 3 (e.g. Breast Cancer Samples vs CRC, Prostate, Lung Samples combined) and (2) Tumor Vs Normal Tissues. For each of these control strategies, a strategy may be followed to identify differential peaks with 5% FDR and 2-fold change. Using a Tumor 1 vs 3 approach, 1000’s of tumor specific peaks may be identified for breast, colorectal and lung cancer, and 10000’s for prostate cancer. Similar to the gene and genehancer features, one may not find any peaks differentially 5 -hydroxyl ated in cfDNA.

[00253] Table 14: Tumor 1 Vs 3 strategy.

[00254] Table 15: Tumor 1 Vs Tissue normals [panel of 12 normal tissues from

TissueMapl]

[00255] The top CRC tumor specific peaks may be ranked and the corresponding specificity for CRC in cfDNA peaks may be assessed. Peaks may be identified that show CRC specificity in gDNA and cfDNA and may be candidates to be prioritized / utilized in a model development strategy (FIG. 32 - FIG. 38). FIG. 32 - FIG. 38 show examples of the identification of CRC specific gDNA peaks and the same discriminatory regions in cfDNA profiles. The left plot shows the profile in TMapII tumor samples, the middle plot shows the sample in context with TMapI normal samples, the right plot shows the cfDNA TMapII samples in context with the HMCP cfDNA.

[00256] Use of 5hmCP tumor profiles may be explored to aid increasing the performance of a classifier, particularly focusing on decreasing the likelihood to predict off-target cancer indications. Many tumor specific features may be identified and validated with functional information from the GeneCards GeneAnalytics database. Breast cancer may be more closely related to normal tissue types and this may appear to be reflected in cfDNA, where it may be more likely to be assigned a HV class from the CEGX predictor. Candidate features may be identified that display specificity in the gDNA and cfDNA profiles that may be utilized (perhaps altogether) to indicate tumor type and may serve as candidates for prioritization in model development.

[00257] Primary Pipeline Processing: (1) create mapped mapq=1 deduplicated bams; (2) call narrow peaks per donor; (3) call broad peaks per donor.

[00258] Data quality may be assessed at several levels for quality:

[00259] A. Quality of Technical Factors [00260] 1. Failure Criteria: Failure criteria may be based on low sequencing depth of < 1M reads after de-duplication for either pulldown or input libraries and enrichment of the 100bp 2hmC spike-in controls at a ratio < 1 over 2mC and C spike-ins in the pulldown library.

[00261] 2. Quality Score Assessment: A series of thresholds may be applied to key QC metrics which may be specific to either pulldown or the input library. A quality score may be generated based on passing thresholds detailed in Table 5. In addition to this, the data may be compared to historic thresholds from HiSeq4000 data based on the Tissue Map phase I and HMCP110 projects, which are outlined in Table 6 and Table 7.

[00262] 3. Outlier analysis: A PCA outlier analysis may be performed, where the pulldown and input QC metrics may be used to generate the PCA and outliers may be identified using a mahalanobis distance test with a p-value threshold of 0.05.

[00263] 4. Qualitative Assessment: The candidate outliers and peak enrichment statistics may be discussed to decide on whether further samples may be excluded from the study.

[00264] B) Quality of Biological Factors

[00265] After outliers may be removed exploratory analysis (Feature based PCAs) and differential testing (Mann Whitney U test) may be performed based on the RPKM ratio matrix, per sample type to assess the strength of the biological signal.

[00266] Core Analysis Pipeline for Peak Set Generation

[00267] The union of peaks from all indications may be collected into one set and these may be used as the base set, these peaks may be then filtered according to Table 16:

[00268] Table 16:. Creation of peak set using MACS2 features

[00269] Tissue-map Reference Peak set [00270] For each group (eg cfDNA colorectal cancer, gDNA breast cancer), sequenced reads are pooled from all 6 samples to call a group-specific peak set. Peaks are called by using both narrow and broad peak calling methods and merge narrow and broad peaks. The following regions are removed: chromosomes X, Y, M ; peaks that do not contain CpGs; ENCODE blacklisted regions. When comparing between cfDNA and gDNA peak sets, 80% of cfDNA peaks are contained in gDNA peak set. This is observed in all 6 indications. Hence, the gDNA peak sets are used to create a tissue-map reference peak set. A simplified approach may be adopted to create a tissue-map reference peak set (i.e. taking the union of 4 tumor peak sets). Note that the union method may result in overlapping contiguous peaks. The reference peak set is filtered according to Table 16. Only filtered reference peak set is used for statistical test.

Each peak is annotated with nearest gene, nearest gene enhancer and number of CpGs in peak.

[00271] Quantification of peak counts

[00272] Bamreadcounts at mapq=1 deduplicated using reference peak set

[00273] Core Analysis Pipeline Description for identification of differential 5hmC enriched

Peaks

[00274] The following analysis strategy is carried out to obtain tissue-specific peaks: Deseq pipeline is used to find differential peaks and impose statistical significance to be at 5%FDR and 2 fold change.

[00275] A two step strategy is carried out to discover putative tissue specific cancer peak: First, tissue-specificity is tested by“1 vs 3” test (ie breast vs other 3 [colorectal, lung, prostate], colorectal vs other 3). Four“1 vs 3” tests were done. Hence for each peak, a statistical significance of tissue specificity for each of the four cancer indications is provided. As an example, a CRC-specific peak is the case where it is only statistically significant in CRC vs other 3 and not in any of the other“1 vs 3” tests. Second, a case-control or cancer-healthy comparison is tested. For each tumor gDNA, a panel of tissue normals from 12 samples in Tissue map2 as control is used. For cell free DNA, the results of HMCP110 study were used to identify CRC cancer peaks. By selecting peaks that are differential in two steps: peaks can be identified that are not only good cancer indicators but are specific to one cancer type.

[00276] Core Analysis Pipeline Description For Genomic Feature Analysis

[00277] Gencode Version v. 25, GeneHancers set.

[00278] Normalization: Samples can be normalized using the following formula for both the pulldown and input samples to gain the Reads Per Kilobase Million (RPKM) value for each feature: [00279] feature rpm = (feature_readcount + pseudo_count) /

(total_readcount+ pseudo_count) * 1000000

[00280] featue_rpkm = feature_rpm / feature_length * 1000

[00281] The log2 fold change of the pulldown over the input can be calculated.

[00282] log2_fold_change = log2(pd_rpkm / input_rpkm)

[00283] This log fold change value can be used in subsequent analysis, such as the machine learning pipeline and gene body / geneHancer analysis.

[00284] Feature Filtering: Two levels of filters have been applied: (1) A read count filter of > 1 read per features, (2) A read count filter of > 30 reads per feature and coefficient of variation filter > 0.2 and < 2

[00285] Software location

[00286] Core Analysis Pipeline for Genome Browser Snapshots

[00287] Core Analysis Pipeline for Functional Annotation

[00288] Core Analysis Pipeline for Classifier Model Building

[00289] Experimental Workflows and Plan

[00290] Summary

[00291] Overview:

[00292] 24 = 6 donors X 4 cancer types

[00293] Minimum of 13 working days to do the whole experiment

[00294] 3 operators but only one operator to do one step resulting in a reduced operator bias

[00295] All four tissues are present per experimental batch

[00296] Staggered batches

[00297] Validate libraries in MiniSeq

[00298] Check how many reads per index (low coverage - single end 25bp perhaps)

[00299] May ensure the pooling has succeeded

[00300] May also review sequencing coverage

[00301] Sequencing: r? x c? Plex in NovaSeq S4, at 75bp PE and 50M seq depth per sample

[00302] HMCP Experimental Workflow

[00303] FIG. 39 shows an experimental Workflow for the tissue profiling phase II.

[00304] Tissue Map phase I

[00305] The aim of tissue map phase I was to obtain samples from a wide variety of human tissues that can be processed with the HMCP protocol to begin a map of 5hmC diversity across tissues. The data may help us to begin to identify peaks of 5hmC throughout the genome and improve our feature definitions. A total of 23 samples were processed successfully with the HMCP protocol with both input and pulldown libraries sequenced on a HiSeq4000 (Run497). The samples span a wide variety of tissues (n=1) with two organs at n=2 (kidney and pancreas) from two vendors (AMSBio and Origene). The majority of samples were obtained with cause of death (COD), age, gender and ethnicity. This dataset was used to support Tissue Map phase II, which contains tumor tissue replicates (n=6) with matched cfDNA samples for breast, colorectal, lung and prostate cancer. For comparisons between tumor and normal, or indication- specific tumor comparisons 21 of the 23 tissue types were used. The excluded samples were due to the cause of death of cancer of another organ. It is important to note that the sequencing depth achieved for the Tissue Map phase I was an average of 48M read pairs for pulldown libraries, while the Novaseq data achieved an average of 24.3M read pairs for pulldown libraries.

[00306] Sample Information: The 23 tissues underwent gDNA extraction and the HMCP protocol using 100 ng of starting material and 10 PCR cycles for both input and pulldown libraries. The HMCP protocol was performed by a single operator (PG).

[00307] Quality Assessment and Outliers: The libraries were assessed using the draft quality scoring via the Data Quality System. All samples passed the threshold of 1M sequencing reads after de-duplication and successful 100bp 5hmC spike-ins (ratio > 1). Following these criteria the samples were assessed using the quality scoring system with the exception of the historic thresholds comparison as no previous gDNA sequencing data was available from the HiSeq for this assessment. Plots for individual QC metrics of importance can be completed for the pulldown and input libraries respectively. A library with no potential quality issues score 0.

[00308] Quality score: (1) Spike in failure / low sequencing depth (<1M) = + 5; (2) Low gene body to intergenic ratio (<1) = + 3; (3) High RPKM for mitochondrial genes (>1000) = + 3; (4) Less than 10% of reads in peaks (epic + macs2) = + 3; (5) Low input library uniformity (<0.8) = + 3.

[00309] Table 17: Sample information from for Tissue Map phase I normal tissues.

[00310] FIG. 40A-40B shows sequencing depth achieved for Tissue Map phase I libraries

[00311] FIG. 41 shows enrichment of 100bp 2hmC spike-ins from the pulldown libraries of

Tissue Map phase I.

[00312] FIG. 50. Genebody - genomic region from the transcription start site to the transcription termination site. Genebody definitions from GENCODE V25 human genome assembly 38 are used. Enhancer - (Genehancers) - genomic regions of open DNA sites where proteins can bind and regulate gene expression. Enhancer definitions from GeneCards are used. Peaks - data driven /derived features - highlights by dotted box.

[00313] FIG. 51A-B. Development of a data driven 5hmC peak reference set from

HMCP110 dataset. The development of a data drive feature set may be advantageous for several reasons (a) Cover more of the genome (b) reduction in the dilution of signal in many cases.

[00314] Three feature classes to utilize for signature discovery. Three sets of features are considered: (1) genebodies (56,788 features); (2) genehancers (218,187 features); peaks

(104,933 features*). *Peaks with the mean read count > 50 in the HMCP110 training set.

[00315] FIG. 52. Additional versions of data sets produced applying a stricter read count filter on both gene bodies, genehancers and peak. Features with median < 50 reads on the input across all samples were removed: (i) genebodies 50 reads filter (35,007 features kept); (ii) genehancers 50 reads filter (91,671 features kept); (iii) peaks 50 reads filter (-80,000 features kept).

[00316] FIG 53A-C. To address cross reactivity, 5hmC Tissue Map Project.

[00317] FIG. 54. About 80% of cfDNA peaks overlap with gDNA in tissue map samples. Peaks considered overlapping in overlap is >= 50% of bp. gDNA has [1.3X, 1.9X] number of peaks than cfDNA at the same sequencing depth. DQS reports better enrichment of gDNA than cfDNA. Starting DNA material: 100 ng gDNA >> 5ng cfDNA. 62% of HMCP110 cfDNA peaks over lap with gDNA.

[00318] FIG. 55. Adjacent normal tissue samples cluster closer to cancer than healthy, suggesting there may be an epigenetic cancer field effect. gDNA 5hmC peak call-based hierarchical clustering demonstrates distinct groupings by normal and cancer derived tissue. Adjacent‘normal’ clusters closer to cancer than healthy.

[00319] FIG. 56A-B. 5hmC genebody and peak data separates tumors.

[00320] FIG. 57A-B shows gDNA and cfDNA peaks discriminate between cancer types.

[00321] FIG. 58A-B. Cancer-specific peaks were identified in gDNA. FIG. 60 shows an updated version of the cancer-specific peaks identified in gDNA.

[00322] FIG. 59. To address potential cross-reactivity, a pairwise approach was used to identify 135 CRC-specific peaks from tissue map cfDNA data. Using a pairwise approach on the cfDNA samples, the CRC-specific peaks were detected. A peak was marked as CRC-specific if it showed a significant enrichment/depletion with respect all the other three cancer types.

Wilcoxon-test and p-value of 0.05 were used to identify specific peaks. This strategy allowed the identification of 135 peaks. 95 depleted in CRC and 40 enriched in CRC. No strong consistency was found between cfDNA and gDNA

[00323] Approximate categorization of 21 tissue/organ types in terms of developmental origin

Table 8

Table 19. Breast Cancer Specific

Table 20. Colorectal Cancer Specific

Table 21. Lung Cancer Specific

Table 22. Prostate Cancer Specific