Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HIGHLY SENSITIVE METHOD FOR DETECTING CANCER DNA IN A SAMPLE
Document Type and Number:
WIPO Patent Application WO/2022/029688
Kind Code:
A1
Abstract:
Described herein is a method for detecting cancer DNA in a test sample of DNA from a patient. In some embodiments, the method may comprise: (a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation present within the patient's cancer; (b) for each aliquot, for each target region: i. determining the number of sequence reads that have the sequence variation; ii. determining the total number of sequence reads; and iii. comparing i. and ii. to one or more error probability distribution models for the sequence variation, wherein the one or more models are obtained from DNA that does not contain the sequence variation; and (c) integrating the collective results of step (b) to determine if there is cancer DNA in the test sample.

Inventors:
PERRY MALCOLM (GB)
MARSICO GIOVANNI (GB)
OSBORNE ROBERT (GB)
ROSENFELD NITZAN (GB)
FORSHEW TIM (GB)
Application Number:
PCT/IB2021/057217
Publication Date:
February 10, 2022
Filing Date:
August 05, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INIVATA LTD (GB)
International Classes:
C12Q1/6806; C12Q1/6827; C12Q1/6869
Domestic Patent References:
WO2016009224A12016-01-21
WO2020031048A12020-02-13
WO2013036929A12013-03-14
WO2012142611A22012-10-18
Foreign References:
US5948902A1999-09-07
US20050233340A12005-10-20
US5635400A1997-06-03
EP0799897A11997-10-08
US5981179A1999-11-09
Other References:
KORNBERGBAKER: "DNA Replication", 1992, W.H. FREEMAN
LEHNINGER: "Biochemistry", 1975, WORTH PUBLISHERS
STRACHANREAD: "Human Molecular Genetics", 1999, WILEY-LISS
"Oligonucleotides and Analogs: A Practical Approach", 1991, OXFORD UNIVERSITY PRESS
"Oligonucleotide Synthesis: A Practical Approach", 1984, IRL PRESS
ZHANG ET AL., NATURE CHEMISTRY, vol. 4, 2012, pages 208 - 214
GORELENKOV ET AL., BIOTECHNIQUES, vol. 31, 2001, pages 1326 - 30
KEMENA ET AL., BIOINFORMATICS, vol. 25, 2009, pages 2455 - 65
LO ET AL., AM J HUM GENET, vol. 62, 1998, pages 768 - 75
BRENNER ET AL., PROC. NATL. ACAD. SCI., vol. 97, 2000, pages 1665 - 1670
SHOEMAKER ET AL., NATURE GENETICS, vol. 14, 1996, pages 450 - 456
FUNARI ET AL., BLOOD, vol. 128, 2016, pages 3176
HEUSER ET AL., DTSCH. ARZTEBL. INT., vol. 113, 2016, pages 317 - 322
SINT ET AL., METHODS ECOL EVOL., vol. 3, 2012, pages 898 - 90
SHEN ET AL., BMC BIOINFORMATICS, vol. 11, 2010, pages 143
YAMADA ET AL., NUCLEIC ACIDS RES., vol. 34, 2006, pages W665 - 9
LEE ET AL., APPL. BIOINFORMATICS, vol. 5, 2006, pages 99 - 109
VALLONE ET AL., BIOTECHNIQUES, vol. 37, 2004, pages 226 - 31
RACHLIN ET AL., BMC GENOMICS, vol. 6, 2005, pages 102
MARGULIES ET AL., NATURE, vol. 437, 2005, pages 376 - 80
RONAGHI ET AL., ANALYTICAL BIOCHEMISTRY, vol. 242, 1996, pages 84 - 9
SHENDURE, SCIENCE, vol. 309, 2005, pages 1728
IMELFORT ET AL., BRIEF BIOINFORM, vol. 10, 2009, pages 609 - 18
APPLEBY ET AL., METHODS MOL BIOL., vol. 513, 2009, pages 19 - 108
ENGLISH, PLOS ONE, vol. 7, 2012, pages e47768
MOROZOVA, GENOMICS, vol. 92, 2008, pages 255 - 64
FORSHEW ET AL., SCI. TRANSL. MED., vol. 4, 2012, pages 136 - 68
GALE ET AL., PLOS ONE, vol. 13, 2018, pages e0194630
WEAVER ET AL., NAT. GENET., vol. 46, 2014, pages 837 - 843
CASBON, NUCL. ACIDS RES., vol. 22, 2011, pages e81
ALEXANDROV, NATURE, vol. 578, 2020, pages 94 - 101
MARTINCORENACAMPBELL, SCIENCE, vol. 349, 2015, pages 1483 - 9
Download PDF:
Claims:
CLAIMS S

What is claimed is:

1. A method for detecting cancer DNA in a test sample of DNA from a patient, comprising:

(a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation present within the patient’s cancer;

(b) for each aliquot, for each target region: i. determining the number of sequence reads that have the sequence variation; ii. determining the total number of sequence reads; and iii. comparing i. and ii. to one or more error probability distribution models for the sequence variation, wherein the one or more models are obtained from DNA that does not contain the sequence variation; iv. eliminating variants that are above a threshold in a statistically improbable number of aliquots; and

(c) integrating the collective results of step (b) to determine if there is cancer DNA in the test sample.

2. The method of claim 1, wherein a statistically improbable number of aliquots are identified by: measuring the amount of test sample DNA added to each aliquot; calculating the fraction of cancer DNA in the test sample using sequencing data for all or a subset of the variants; and estimating the probability of observing the number of aliquots that contain the sequence variation above a threshold, based on i. and ii.

3. The method of any prior claim, wherein the fraction of cancer DNA in the test sample of DNA is equal or less than 0.01%.

4. The method of any prior claim, wherein step (a) comprises sequencing at least 10 target regions in at least 3 aliquots of the test sample.

5. The method of any prior claim, wherein the method comprises, before step (a), identifying a set of sequence variations that are present within the patient’s cancer.

6. The method of any prior claim, wherein the cancer is a blood cancer and the test sample comprises cellular DNA isolated from cells from peripheral blood, a lymph node or bone marrow.

7. The method of any of claims 1-5, wherein the cancer is a solid tumor and the test sample comprises cfDNA.

8. The method of any prior claim, wherein step (b) comprises: v.

(i) deriving an estimate of the number of molecules that have the sequence variation,

(ii) calculating the probability that there is at least one molecule that has the sequence variation,

(iii) determining if the frequency of sequence reads that have the sequence variation compared to the total number of sequence reads is above a threshold,

(iv) calculating a likelihood ratio for (i); and/or

(v) determining if any of (i), (ii) or (iv) is above a threshold.

9. The method of any prior claim, further comprising calculating the fraction of cancer DNA in the test sample or the total quantity based on the results of step (b).

10. The method of claim 8, wherein (b)(iv) is done by calculating a likelihood ratio between the likelihood of observing the results obtained in (b)(i) in samples:

(i) if cancer DNA is present

(ii) if cancer DNA is not present; and combining the individual likelihood ratios into a cumulative likelihood ratio score across all sequence variations and aliquots of the test sample

11. The method of any prior claim, further comprising identifying the patient as having cancer if the result of step (c) is at or above the threshold.

12. The method of any prior claim, further comprising administering a therapy to the patient.

13. The method of any prior claim, wherein the patient has previously undergone a first therapy and, based on the results of step (c), the method comprises administering a second therapy that is different to the first therapy to the patient.

14. The method of any prior claim, wherein the patient has or had cancer or has a clonal growth that is not yet cancer but has the potential to transform.

15. The method of any prior claim, wherein the patient has undergone or is undergoing treatment for the cancer.

Description:
HIGHLY SENSITIVE METHOD FOR DETECTING CANCER DNA IN A SAMPLE

CROSS-REFERENCING

This application claims the benefit of U.S. provisional application serial no. 63/061,568, filed on August 5, 2020, which application is incorporated by reference herein.

BACKGROUND

In many cases, cancer treatment may require at least two steps: a first treatment intended to remove the tumor cells then a second treatment aiming to eradicate any remaining cancer cells in the patient’s body if the initial treatment is not completely successful. The treatment used to eradicate the remaining cancer cells often differs from the first treatment.

The small number of cancer cells that remain in the person after initial treatment when a patient may apparently be in remission is often called “minimal residual disease” (MRD) or residual disease. These residual cells will ultimately be the cause of relapse in many cancers. It is critical to determine the likelihood of a patient having disease recurrence and relapsing following initial treatment so that those most likely to need additional treatment can receive additional treatment, while those that don’t need additional treatment are spared, thereby reducing harm to the patient and decreasing the cost of treatment. As such, effective methods for the detecting minimal residual disease are highly desirable. It is also critical to have sensitive methods that detect risks of cancer recurrence earlier than current methods (e.g., which are usually done by imaging or clinical analysis).

MRD has been successfully detected in some hematological malignancies because relatively large amounts of DNA can be analyzed and the frequency of common tumor specific fusions which can be measured in a straightforward way. There is now strong evidence that MRD can be detected for many solid tumors by assessing cell free DNA (cfDNA) for circulating tumor DNA (ctDNA). The problem with detecting minimal residual disease in cfDNA, however, is that many of the tests used to detect sequence variations in a sample are not sensitive enough. Many of today’s molecular tests are done by sequencing cfDNA for a panel of known genes. The problem with detecting minimal residual disease by sequencing cfDNA is that the amount of tumor DNA in cell-free DNA is often well below the limit of detection of such methods. Specifically, the frequency at which an individual tumor sequence variation is expected to occur in the cfDNA of patients that have minimal residual disease is typically well below the frequency at which sequencing artefacts are generated by PCR errors, base mis-calls and/or DNA damage. This problem is compounded by the fact that, in some cases, the level of mutant DNA may be so low that, on average, there is less than a single copy of each mutation being assessed in the cfDNA sample being analyzed. In addition, relatively small amounts of mutant DNA derived from white blood cells that have lysed in the bloodstream can lead to erroneous results. Thus, detection of minimal residual disease by sequencing-based approaches has remained challenging.

This disclosure provides a highly sensitive method for detecting tumor DNA. The method may be used to diagnose minimal residual disease, among other things. SUMMARY

Described below is a method for detecting cancer DNA in a test sample of DNA from a patient. In some embodiments, the method may comprise: (a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation present within the patient’s cancer; (b) for each aliquot, for each target region: i. determining the number of sequence reads that have the sequence variation; ii. determining the total number of sequence reads; and iii. comparing i. and ii. to one or more error probability distribution models for the sequence variation, wherein the one or more models are obtained from DNA that does not contain the sequence variation; and (c) integrating the collective results of step (b) to determine if there is cancer DNA in the test sample. In any embodiment, step (b) may comprise iv. eliminating variants that are above a threshold in a statistically improbable number of aliquots. These variants (i.e., the variants that are in a statistically improbable number of aliquots) can be identified by measuring the amount of test sample DNA added to each aliquot, calculating the fraction of cancer DNA in the test sample and estimating the probability of observing the number of aliquots with the variant above a threshold based on i and ii.

The present method relies on two features: (i) aliquot-based sequencing (i.e., sequencing the same target regions in multiple aliquots of the same sample, i.e., a sample that has been divided or partitioned) and (ii) analysis of multiple variants assessing for a signal in any of the aliquots (as opposed to identifying variant DNA in one aliquot and then determining that the sample definitely contains cancer DNA because the same variant can be found in another aliquot), and analyzing all of the data, after statistically improbable data points have been removed.

One problem solved by this method is that for some samples (i.e., samples that contain a small fraction of cancer DNA, e.g., less than 0.01% tDNA) the number of sequence reads that contain a particular sequence variation is virtually indistinguishable from the variations that are caused by noise (i.e., the combination of base-miscalls, PCR errors, damaged DNA, etc.). As such, in many cases it is simply impossible to reliably determine that a sample contains cancer DNA by conventional sequencing approaches.

As noted above, the present invention is aliquot-based. For example, in some embodiments, the method may involve sequencing at least 10 target regions in at least 3 aliquots of the test sample and, in practice, the method may involve sequence at least 24 target regions in at least 4 aliquots of the test sample. While aliquot-based sequence may initially seem like a waste of effort because the same number of wild type and variant molecules are still being sequenced (but split across multiple aliquots), the signal-to-noise ratio actually increases in the aliquot-based method. Specifically, in situations in which there are very few variant molecules in the sample (e.g., one or two variant molecules), the ratio of variant molecules to wild type molecules will be much higher in the aliquots that contains the variant molecule. This, in turn, eliminates mis-calls and makes the data more reliable. In addition to increasing the signal-to-noise ratio the method produces more data than conventional approaches, which, in turn, allows the data to be analyzed by more sophisticated statistical and/or threshold-based methods. For example: (i) so called “noisy bases” (i.e., positions that have a high intrinsic background that are frequently miscalled), can be identified and eliminated because the signal will be consistently high (relatively to background) in most or all aliquots and (ii) variants that are associate with improbably high signals (e.g., a variant that has three times the number of sequence reads than would be expected for a single variant molecule in one aliquot and a background number of sequence reads in the other aliquots, or a variant that appears to be in three of four aliquots when the other variants are only in one or zero of the aliquots) can be identified and eliminated. Various other advantages are described below.

Depending on how the method is implemented, the method may have certain advantages over conventional methods. For example, the method may be used to consistently and reliably determine whether a DNA sample has cancer DNA, even if the fraction of cancer DNA in the sample is less than 0.01%. This is well below the level of sensitivity of conventional methods, and well below the frequencies at which sequencing artefacts can be generated by errors. By assessing several sequence variations, the method is also able to detect cancer DNA in a sample of DNA in which there is on average less than a single copy of each individual sequence variation. The method can be implemented m a way that results m reaching the level of sensitivity without sacrificing specificity (i.e. generating many false positive results). The presence of ctDNA can be estimated at the level of variant molecules added to each aliquot, not variant reads following DNA sequencing. This can reduce false positives in some situations (for example, a low initial input of DNA molecules with high sequencing depth), and provides a more accurate estimate of the global fraction of cancer DNA.

Additionally, in some embodiments, the present method optionally determines whether the sample contains cancer DNA by scoring all variations in all aliquots in a probabilistic continuum (i.e. a probability distribution over the number of molecules observed), rather than calculating the number of positives (the number of aliquots with clear evidence of ctDNA), and determining a positive or negative result through the application of simple rules. This allows exploration of borderline signals which are not significant when taken individually, but can be combined into strong evidence of ctDNA across multiple variants, increasing sensitivity. It also allows for flexible reporting based on degree of confidence, and the potential to combine other data e.g. prior probability of disease recurrence based on cancer type or stage.

In addition, rare errors, such as DNA-damage prior to amplification or early-cycle PCR errors, can be directly modelled by this approach. This would appear to be real signal based on the estimation process described in the previous paragraph. These effects are not captured in most models of DNA sequencing errors and could therefore lead to false positives if left unaccounted for. Alternatively, these can be dealt with by requiring signal detected in aliquots (since 2 such events in a single sample would be very unlikely), however this reduces sensitivity. The method can model this effect by considering whether molecules detected in each aliquot are more likely to come from ctDNA or from a rare error, by considering factors such as the estimated cancer DNA fraction or type of DNA base change.

The method can use a further error-reduction strategy, by excluding variants which show an unusually high level of signal in multiple aliquots, based on the estimated cancer DNA fraction. Intuitively, if only a handful of variant molecules are detected in the sample as a whole, it is unlikely that these would all be present at a single location (barring amplification or copy number changes). This could result from Clonal Hematopoiesis of Indeterminate Potential (CHIP) mutations, contamination, or similar errors. It could also be due to a single DNA base producing many more sequencing errors than accounted for in the background model, which makes this method suitable for “one-shot” use without first sequencing against a panel of normal samples.

These and other advantages may become apparent in view of the following discussion.

BRIEF DESCRIPTION OF THE FIGURES

The skilled artisan will understand that the drawings, described below, are for illustration purposes only. The drawings are not intended to limit the scope of the present teachings in any way.

Fig. 1 is a flow chart showing how aliquot-based sequencing can be implemented. As would be apparent, the different aliquots of the test sample can be barcoded with different aliquot identifier sequences and then combined prior to sequencing.

Fig. 2 is a flow chart that follows from the flow chart of Fig. 1. Fig. 2 shows how the sequence reads can be processed to determine, (b) for each aliquot, for each target region, the number of sequence reads that have the sequence variation and the total number of sequence reads.

Fig. 3 is a flow chart that shows an example of how the workflow shown in the flow chart Fig. 2 can be implemented. The steps illustrated in Fig 3 can be done in any convenient order.

Fig. 4 is a flow chart that follows from the flow chart of Fig. 2. Fig. 4 shows how the variant and total read counts for each sequence variation and aliquot can be analyzed along with probability distributions for each sequence variations and then integrated to determine if there is cancer DNA in the sample.

Fig. 5 is a flow chart illustrating how probability distribution models for each sequence variation can be produced. Probability distributions include binomial, over- dispersed binomial, beta, normal, exponential or gamma probability distribution models. Such models may not be needed in embodiments that use molecular indexes.

Fig. 6 is a flow chart illustrating a threshold-based approach for analyzing data for each sequence variation in each aliquot. Fig. 7 is a now chart that illustrates a way to integrate the results of the threshold- based method illustrated in Fig. 6.

Fig. 8 is a flow chart illustrating a statistical approach for analyzing data for each sequence variation in each aliquot.

Fig. 9 is a flow chart illustrating how the statistical results shown in Fig. 8 can be integrated.

Fig. 10 is a flow chart illustrating the last step in Fig. 1, showing two approaches by which the results of one test sample can be compared to one or more additional samples.

Fig. 11 schematically illustrates some of the principles of an embodiment of the present method.

Fig. 12 illustrates the principles of a probability distribution for estimating the number of variant molecules.

Figs. 13A and 13B illustrate examples of error probability distributions. In the model shown in Fig. 13 A, the data corresponding to low frequency high signal events are hatched. The model shown in Fig. 13B is a mixture model. “VAF” refers to variant allele frequency. Such models are obtained from DNA that does not contain the sequence variation and they indicate the probability of different variant allele fractions in this normal DNA (or the no of variant reads over the total wt reads). Such distributions may differ from variant class to variant class and sequence depth to sequencing depth. In some cases, 2 or more distributions are required to account for the different types of error. In some cases, a threshold may be established in which one can be reasonably certain that a sequence variation identified in sequence reads is not an error.

Fig. 14 illustrates how data from “noisy” bases can be identified and eliminated using an aliquot approach.

Fig. 15 illustrates some of the difficulties in detecting cancer DNA by methods in which the individual aliquots are scored for whether they contain a particular variant or not.

Fig. 16 shows how the fraction of cancer DNA can be calculated.

Fig. 17 shows the results of an experiment in which over 40 sequence variations in four aliquots of each of three different samples containing varying levels of circulating tumor (ctDNA) were assessed.

DEFINITIONS

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Still, certain elements are defined for the sake of clarity and ease of reference.

Terms and symbols of nucleic acid chemistry, biochemistry, genetics, and molecular biology used herein follow those of standard treatises and texts in the field, e.g. Kornberg and Baker, DNA Replication, Second Edition (W.H. Freeman, New York, 1992); Lehninger, Biochemistry, Second Edition (Worth Publishers, New York, 1975); Strachan and Read, Human Molecular Genetics, Second Edition (Wiley-Liss, New York, 1999); Eckstein, editor, Oligonucleotides and Analogs: A Practical Approach (Oxford University Press, New York, 1991); Gait, editor, Oligonucleotide Synthesis: A Practical Approach (IRL Press, Oxford, 1984); and the like.

The term “nucleotide” is intended to include those moieties that contain not only the known purine and pyrimidine bases, but also other heterocyclic bases that have been modified. Such modifications include methylated purines or pyrimidines, acylated purines or pyrimidines, alkylated riboses or other heterocycles. In addition, the term “nucleotide” includes those moieties that contain hapten or fluorescent labels and may contain not only conventional ribose and deoxyribose sugars, but other sugars as well. Modified nucleosides or nucleotides also include modifications on the sugar moiety, e.g., wherein one or more of the hydroxyl groups are replaced with halogen atoms or aliphatic groups, or are functionalized as ethers, amines, or the like.

The term “nucleic acid” and “polynucleotide” are used interchangeably herein to describe a polymer of any length, e.g., greater than about 2 bases, greater than about 10 bases, greater than about 100 bases, greater than about 500 bases, greater than 1000 bases, greater than 10,000 bases, greater than 100,000 bases, greater than about 1,000,000, up to about 10 10 or more bases composed of nucleotides, e.g., deoxyribonucleotides or ribonucleotides, and may be produced enzymatically or synthetically (e.g., PNA as described in U.S. Patent No. 5,948,902 and the references cited therein) which can hybridize with naturally occurring nucleic acids in a sequence specific manner analogous to that of two naturally occurring nucleic acids, e.g., can participate in Watson-Crick base pairing interactions. Naturally-occurring nucleotides include guanine, cytosine, adenine, thymine, uracil (G, C, A, T and U respectively). DNA and RNA have a deoxyribose and ribose sugar backbone, respectively, whereas PNA’s backbone is composed of repeating N-(2- aminoethyl)-glycine units linked by peptide bonds. In PNA various purine and pyrimidine bases are linked to the backbone by methylenecarbonyl bonds. A locked nucleic acid (LNA), often referred to as inaccessible RNA, is a modified RNA nucleotide. The ribose moiety of an LNA nucleotide is modified with an extra bridge connecting the 2' oxygen and 4' carbon. The bridge “locks” the ribose in the 3'-endo (North) conformation, which is often found in the A-form duplexes. LNA nucleotides can be mixed with DNA or RNA residues in the oligonucleotide whenever desired. The term “unstructured nucleic acid,” or “UNA,” is a nucleic acid containing non-natural nucleotides that bind to each other with reduced stability. For example, an unstructured nucleic acid may contain a G residue and a C' residue, where these residues correspond to non-naturally occurring forms, i.e., analogs, of G and C that base pair with each other with reduced stability, but retain an ability to base pair with naturally occurring C and G residues, respectively. Unstructured nucleic acid is described in US20050233340, which is incorporated by reference herein for disclosure of UNA.

The term “nucleic acid sample,” as used herein, denotes a sample containing nucleic acids. Nucleic acid samples used herein may be complex in that they contain multiple different molecules that contain sequences. Genomic DNA samples from a mammal (e.g., mouse or human) are types of complex samples. Complex samples may have more than about 10 4 , 10 5 , 10 6 or 10 7 , 10 8 , 10 9 or 10 10 different nucleic acid molecules. Any sample containing nucleic acid, e.g., genomic DNA from tissue culture cells or a sample of tissue, may be employed herein.

The term “oligonucleotide” as used herein denotes a single-stranded multimer of nucleotide of from about 2 to 200 nucleotides, up to 500 nucleotides in length. Oligonucleotides may be synthetic or may be made enzymatically, and, in some embodiments, are 30 to 150 nucleotides in length. Oligonucleotides may contain ribonucleotide monomers (i.e., may be oligoribonucleotides) or deoxyribonucleotide monomers, or both ribonucleotide monomers and deoxyribonucleotide monomers. An oligonucleotide may be 10 to 20, 21 to 30, 31 to 40, 41 to 50, 51 to 60, 61 to 70, 71 to 80, 80 to 100, 100 to 150 or 150 to 200 nucleotides in length, for example.

“Primer” means an oligonucleotide, either natural or synthetic, that is capable, upon forming a duplex with a polynucleotide template, of acting as a point of initiation of nucleic acid synthesis and being extended from its 3' end along the template so that an extended duplex is formed. The sequence of nucleotides added during the extension process is determined by the sequence of the template polynucleotide. Primers are extended by a DNA polymerase. Primers are generally of a length compatible with their use in synthesis of primer extension products, and are usually in the range of 8 to 200 nucleotides in length, such as 10 to 100 or 15 to 80 nucleotides in length. A primer may contain a 5’ tail that does not hybridize to the template.

Primers are usually single-stranded for maximum efficiency in amplification, but may alternatively be double-stranded or partially double-stranded. Also included in this definition are toehold exchange primers, as described in Zhang et al (Nature Chemistry 2012 4: 208-214), which is incorporated by reference herein.

Thus, a “primer” is complementary to a template, and complexes by hydrogen bonding or hybridization with the template to give a primer/template complex for initiation of synthesis by a polymerase, which is extended by the addition of covalently bonded bases linked at its 3' end complementary to the template in the process of DNA synthesis.

The term “hybridization” or “hybridizes” refers to a process in which a region of nucleic acid strand anneals to and forms a stable duplex, either a homoduplex or a heteroduplex, under normal hybridization conditions with a second complementary nucleic acid strand, and does not form a stable duplex with unrelated nucleic acid molecules under the same normal hybridization conditions. The formation of a duplex is accomplished by annealing two complementary nucleic acid strand regions in a hybridization reaction. The hybridization reaction can be made to be highly specific by adjustment of the hybridization conditions under which the hybridization reaction takes place, such that two nucleic acid strands will not form a stable duplex, e.g., a duplex that retains a region of double- strandedness under normal stringency conditions, unless the two nucleic acid strands contain a certain number of nucleotides in specific sequences which are substantially or completely complementary. “Normal hybridization or normal stringency conditions” are readily determined for any given hybridization reaction. See, for example, Ausubel et al., Current Protocols in Molecular Biology, John Wiley & Sons, Inc., New York, or Sambrook et al., Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Press. As used herein, the term hybridizing or hybridization refers to any process by which a strand of nucleic acid binds with a complementary strand through base pairing.

A nucleic acid is considered to be “selectively hybridizable” to a reference nucleic acid sequence if the two sequences specifically hybridize to one another under moderate to high stringency hybridization conditions. Moderate and high stringency hybridization conditions are known (see, e.g., Ausubel, et al., Short Protocols in Molecular Biology, 3rd ed., Wiley & Sons 1995 and Sambrook et al., Molecular Cloning: A Laboratory Manual, Third Edition, 2001 Cold Spring Harbor, N.Y.).

The term “duplex,” or “duplexed,” as used herein, describes two complementary polynucleotide region that are base-paired, i.e., hybridized together.

“Genetic locus,” “locus,”, "locus of interest", “region” or “segment” in reference to a genome or target polynucleotide, means a contiguous sub-region or segment of the genome or target polynucleotide. As used herein, genetic locus, locus, or locus of interest may refer to the position of a nucleotide, a gene or a portion of a gene in a genome or it may refer to any contiguous portion of genomic sequence whether or not it is within, or associated with, a gene, e.g., a coding sequence. A genetic locus, locus, or locus of interest can be from a single nucleotide to a segment of a few hundred or a few thousand nucleotides in length or more. In general, a locus of interest will have a reference sequence associated with it (see description of "reference sequence" below).

The terms “plurality”, “population” and “collection” are used interchangeably to refer to something that contains at least 2 members. In certain cases, a plurality, population or collection may have at least 5, at least 10, at least 100, at least 1,000, at least 10,000, at least 100,000, at least 10 6 , at least 10 7 , at least 10 8 or at least 10 9 or more members.

The term “sample identifier sequence”, “sample index”, “multiplex identifier” or “MID” is a sequence of nucleotides that is appended to a target polynucleotide, where the sequence identifies the source of the target polynucleotide (i.e., the sample from which sample the target polynucleotide is derived). In use, each sample is tagged with a different sample identifier sequence (e.g., one sequence is appended to each sample, where the different samples are appended to different sequences), and the tagged samples are pooled. After the pooled sample is sequenced, the sample identifier sequence can be used to identify the source of the sequences. A sample identifier sequence may be added to the 5’ end of a polynucleotide or the 3’ end of a polynucleotide. In certain cases, some of the sample identifier sequence may be at the 5’ end of a polynucleotide and the remainder of the sample identifier sequence may be at the 3’ end of the polynucleotide. When elements of the sample identifier have sequence at each end, together, the 3 and 5 sample identifier sequences identify the sample. In many examples, the sample identifier sequence is only a subset of the bases which are appended to a target oligonucleotide. An identifier sequence can be appended to a polynucleotide by ligation or by primer extension. In the latter embodiments, the identifier sequence may be in the 5’ tail or the primer used for primer extension. In such embodiments the target polynucleotide is a copy of the original target polynucleotide.

The term “aliquot identifier sequence” refers to an appended sequence that allows sequence reads from different aliquots to be distinguished from one another. Aliquot identifier sequences work in the same way as sample identifier sequences described above, except that they are used on aliquots of a sample, rather than different samples. A single sequence may serve as a sample identifier and an aliquot identifier.

The term “variable”, in the context of two or more nucleic acid sequences that are variable, refers to two or more nucleic acids that have different sequences of nucleotides relative to one another. In other words, if the polynucleotides of a population have a variable sequence, then the nucleotide sequence of the polynucleotide molecules of the population may vary from molecule to molecule. The term “variable” is not to be read to require that every molecule in a population has a different sequence to the other molecules in a population.

The term “substantially” refers to sequences that are near-duplicates as measured by a similarity function, including but not limited to a Hamming distance, Levenshtein distance, Jaccard distance, cosine distance etc. (see, generally , Kemena et al, Bioinformatics 2009 25: 2455-65). The exact threshold depends on the error rate of the sample preparation and sequencing used to perform the analysis, with higher error rates requiring lower thresholds of similarity. In certain cases, substantially identical sequences have at least 98% or at least 99% sequence identity.

The term “sequence variation”, as used herein, is a variant that is different to a reference sequence, such as a reference genome or sequence from a sample of a patient not anticipated to contain somatic variants such as a buccal swab. In many instances a “sequence variation” is a variant that is present at a frequency of less than 50%, relative to other molecules in the sample. Many sequence variations, e.g., indels and nucleotide substitutions, are substantially identical to the molecules that do not contain the sequence variation. In some cases, a particular sequence variation may be present in a sample at a frequency of less than 20%, less than 10%, less than 5%, less than 1%, less than 0.5%, less than 0.1%, less than 0.05% or less than 0.01%. The term nucleic acid template is intended to refer to the initial nucleic acid molecule that is copied during amplification. Copying in this context can include the formation of the complement of a particular single-stranded nucleic acid. The “initial” nucleic acid can comprise nucleic acids that have already been processed, e.g., amplified, extended, labeled with adaptors, etc.

The term “tailed”, in the context of a tailed primer or a primer that has a 5’ tail, refers to a primer that has a region (e.g., a region of at least 12-50 nucleotides) at its 5’ end that does not hybridize or partially hybridizes to the same target as the 3’ end of the primer.

The term “initial template” refers to a sample that contains a target sequence to be amplified. The term “amplifying” as used herein refers to generating one or more copies of a target nucleic acid, using the target nucleic acid as a template.

The term “amplicon” as used herein refers to the product (or “band”) amplified by a particular pair of primers in a PCR reaction.

The “replicate amplicon” as used herein refers to the same amplicon amplified using different portions or aliquots of a sample. Replicate amplicons typical have near identical sequences, except for sequence variations in the template, PCR errors, and differences in the sequences of the primers used for each aliquot (e.g., differences in the 5’ ends of the primers such as in the aliquot identifier sequence, etc.).

A “polymerase chain reaction” or “PCR” is an enzymatic reaction in which a specific template DNA is amplified using one or more pairs of sequence specific primers.

“PCR conditions” are the conditions in which PCR is performed, and include the presence of reagents (e.g., nucleotides, buffer, polymerase, etc.) as well as temperature cycling (e.g., through cycles of temperatures suitable for denaturation, renaturation and extension), as is known in the art.

A “multiplex polymerase chain reaction” or “multiplex PCR” is an enzymatic reaction that employs two or more primer pairs for different targets templates. If the target templates are present in the reaction, a multiplex polymerase chain reaction results in two or more amplified DNA products that are co-amplified in a single reaction using a corresponding number of sequence-specific primer pairs.

The term “next generation sequencing” refers to the so-called highly parallelized methods of performing nucleic acid sequencing and comprises the sequencing-by-synthesis or sequencing-by-ligation platforms currently employed by Illumina, Life Technologies, Pacific Biosciences and Roche, etc. Next generation sequencing methods may also include, but not be limited to, nanopore sequencing methods such as offered by Oxford Nanopore or electronic detection-based methods such as the Ion Torrent technology commercialized by Life Technologies.

The term “sequence read” refers to the output of a sequencer. A sequence read typically contains a string of Gs, As, Ts and Cs, of 50-1000 or more bases in length and, in many cases, each base of a sequence read may be associated with a score indicating the quality of the base call.

The terms “assessing the presence of’ and “evaluating the presence of’ include any form of measurement, including determining if an element is present and estimating the amount of the element. The terms “determining”, “measuring”, “evaluating”, “assessing” and “assaying” are used interchangeably and include quantitative and qualitative determinations. Assessing may be relative or absolute. “Assessing the presence of’ includes determining the amount of something present, and/or determining whether it is present or absent.

If two nucleic acids are “complementary,” they hybridize with one another under high stringency conditions. The term “perfectly complementary” is used to describe a duplex in which each base of one of the nucleic acids base pairs with a complementary nucleotide in the other nucleic acid. In many cases, two sequences that are complementary have at least 10, e.g., at least 12 or 15 nucleotides of complementarity.

An “oligonucleotide binding site” refers to a site to which an oligonucleotide hybridizes in a target polynucleotide. If an oligonucleotide “provides” a binding site for a primer, then the primer may hybridize to that oligonucleotide or its complement.

The term “strand” as used herein refers to a nucleic acid made up of nucleotides covalently linked together by covalent bonds, e.g., phosphodiester bonds. In a cell, DNA usually exists in a double-stranded form, and as such, has two complementary strands of nucleic acid referred to herein as the “top” and “bottom” strands. In certain cases, complementary strands of a chromosomal region may be referred to as “plus” and “minus” strands, the “first” and “second” strands, the “coding” and “noncoding” strands, the “Watson” and “Crick” strands or the “sense” and “antisense” strands. The assignment of a strand as being a top or bottom strand is arbitrary and does not imply any particular orientation, function or structure. The nucleotide sequences of the first strand of several exemplary mammalian chromosomal regions (e.g., BACs, assemblies, chromosomes, etc.) is known, and may be found in NCBI’s Genbank database, for example. The term extending , as used herein, refers to the extension of a primer by the addition of nucleotides using a polymerase. If a primer that is annealed to a nucleic acid is extended, the nucleic acid acts as a template for extension reaction.

The term “sequencing,” as used herein, refers to a method by which the identity of at least 10 consecutive nucleotides (e.g., the identity of at least 20, at least 50, at least 100 or at least 200 or more consecutive nucleotides) of a polynucleotide is obtained.

The term “pooling”, as used herein, refers to the combining, e.g., mixing, of two or more samples or aliquots of a sample such that the molecules within those samples or aliquots become interspersed with one another in solution.

The term “pooled sample”, as used herein, refers to the product of pooling.

The term “portion”, as used herein in the context of different portions of the same sample, refers to an aliquot or part of a sample. For example, if one microliter of 100 ul sample is added to each of 10 different PCR reactions, then those reactions each contain different portions of the same sample.

As used herein, the term “cell-free DNA” (“cfDNA”) refers to DNA that is free in a bodily fluid, not cells. cfDNA can be isolated from blood plasma, blood serum, cerebrospinal fluid, urine, saliva, or stool, for example. “Cell-free DNA from the bloodstream” and “circulating cell-free DNA” refers to DNA that is circulating in the peripheral blood of a patient. The DNA molecules in cell-free DNA may have a median size that is below 1 kb (e.g., in the range of 50 bp to 500 bp, 80 bp to 400 bp, or 100-1,000bp), although fragments having a median size outside of this range may be present. Cell-free DNA may contain tumor DNA (tDNA), e.g., tumor DNA circulating freely in the blood of a cancer patient. cfDNA can be obtained by centrifuging the sample to remove all cells, and then isolating the DNA from the remaining liquid (e.g., plasma or serum). Such methods are well known (see, e.g., Lo et al, Am J Hum Genet 1998; 62:768-75). Circulating cell-free DNA can be double-stranded or single-stranded. This term is intended to encompass free DNA molecules that are circulating in the bloodstream as well as DNA molecules that are present in extra-cellular vesicles (such as exosomes) that are circulating in the bloodstream.

As used herein, the term “tumor DNA” (or “tDNA”) is tumor-derived DNA. tDNA can be identified because it contains mutations. tDNA can be isolated directly from a tissue biopsy, from circulating tumor cells (CTCs), from other cells that are no longer part of the tumor tissue but are not circulating such as those in the urine or stool samples, or it may be part of (a “fraction of’) the cfDNA of a patient. tDNA includes both clonal and sub-clonal mutations. In the evolution of a tumor, there is a transition between clonal and sub-clonal mutations. Sub-clonal mutations are only present in a subset of cells in the tumor: these occur after the most recent common ancestor of all cancer cells in the tumor sample. In contrast, clonal mutations occurred before the most recent common ancestor of all cancer cells. Clonal mutations are therefore present in all cells in the tumor unless there is some mechanism that has removed the mutation e.g. a structural variation in which case the entire locus will be lost in a subset of cells. ctDNA is of tumor origin and originates directly from the tumor or from circulating tumor cells (CTCs), which are viable, intact tumor cells that shed from primary tumors and can enter the bloodstream or lymphatic system. The precise mechanism of how ctDNA is released is unclear, although it is postulated to involve apoptosis and necrosis from dying cells, or active release from viable tumor cells. Circulating tDNA (ctDNA) can be highly fragmented and in some cases can have a mean fragment size about 100-250 bp, e.g., 150 to 200 bp long. The amount of ctDNA in a sample of circulating cell-free DNA isolated from a cancer patient varies greatly: typical samples contain less than 10% ctDNA, although many samples from patients being assessed for MRD may have less than 0.01% ctDNA and some samples have over 10% ctDNA. Molecules of ctDNA can be often identified because they contain tumorigenic mutations.

As used herein, the term “sequence variation” refers to the combination of a position and type of a sequence alteration. For example, a sequence variation can be referred to by the position of the variation and which type of substitution (e.g., G to A, G to T, G to C, A to G, etc. or insertion/deletion of a G, A, T or C, etc.) is present at the position. A sequence variation may be a substitution, deletion, insertion rearrangement of one or more nucleotides. In the context of the present method, a sequence variation can be generated by, e.g., a PCR error, an error in sequencing or a genetic variation.

As used herein, the term genetic variation” refers to a variation (e.g., a nucleotide substitution, an indel or a rearrangement) that is present or deemed as being likely to be present in a nucleic acid sample. A genetic variation can be from any source. For example, a genetic variation can be generated by a mutation (e.g., a somatic mutation), or it can be germ line such as in an organ transplant or pregnancy. If sequence variation is called as a genetic variation, the call indicates that the sample likely contains the variation; in some cases a “call” can be incorrect. In many cases, the term “genetic variation” can be replaced by the term “mutation”. For example, if the method is being used to detect sequence variations that are associated with cancer or other diseases that are caused by mutations, then “genetic variation” can be replaced by the term “mutation”. As used herein, depending on the context the term calling can mean indicating whether a particular genetic variation is present in a sequence, whether a sample contains a genetic variation or whether sample contains cancer DNA.

As used herein, the term “threshold” refers to a level of evidence (e.g., a ratio) that is required to make a call.

As used herein, the term “value” refers to a number, letter, word (e.g., “high”, “medium” or “low”) or descriptor (e.g., “+++” or ”++”) that can indicate the strength of evidence. A value can contain one component (e.g., a single number) or more than one component, depending on how a value is analyzed.

As used herein, the term “aliquot” refers to a portion of a sample. For example, if three volumes are independently removed from the same sample, each of the volumes can be referred to as an aliquot. Aliquots do not need to be the same volume.

As used herein, the term “cancer-associated cells” means cells that are part of or genetically related to the cells of a patient’s cancer. Cancer-associated cells can be part of a solid tumor a blood/ haematological cancer or a solid tumor. The presence of cancer- associated cells in a patient may be a sign that all cancer cells were not removed or killed during treatment. The cancer-associated cells have substantially the same somatic mutations as the cells of the patient’s cancer and, in some cases, may be progeny of one or more cells of a cancer. Cancer-associated cells may result from minimal residual disease or they could be generated by incomplete removal of a tumor, incomplete treatment, cancer recurrence or relapse at a primary or distal site and/or tumor metastasis (including micrometastasis).

As used herein, the term “sequence variation associated with (or present within) the patient’s cancer” is intended to mean a somatic mutation that is in the genome of cells of the patient’s cancer or was in the genome of cells of the patient’s cancer prior to any cancer treatment. It can also mean epigenetic changes present within a cancer sample.

As used herein, the term “minimal residual disease” (MRD), refers to the presence of cancer cells following a treatment with curative intent. MRD may also be referred to as “molecular residual disease” or residual disease” in some publications.

As used herein, the term “detecting recurrence” refers to detecting the recurrence of a tumor through the identification of mutant DNA. In this context, the term “early detection” refers to the detection of mutant DNA before tumor recurrence can be reliably detected through conventional standard-of-care/surveillance monitoring methods such as radiological imaging etc. This may be achieved for example by monitoring serially collected blood samples at a plurality of time points for the presence of ctDNA in cfDNA, as described below.

The term “cancer” is used herein to refer to any disease characterized by uncontrolled cell division. A cancer can be a cancer of the blood (i.e., haematological cancer), e.g., leukemia, lymphoma, or multiple myeloma, or a cancer can be neoplastic, e.g., associated with an abnormal mass of tissue in which cells grow and divide more than they should or do not die when they should. Neoplastic cancers, e.g., lung, breast or liver cancer, are associated with a solid tumor.

The term “cancer DNA” refers to DNA that is from cancerous cells. Cancer DNA may be present in DNA isolated from a population of cells that are isolated from lymph, bone marrow or the circulating blood of a patient, if the patient has a blood cancer. Cancer DNA from a solid tumor can be found in cfDNA, in which case it is referred to tDNA or ctDNA.

The terms “error probability distribution” and “error probability distribution model” refer to a distribution that estimates or models the probability that an observation (typically a variant allele fraction) is due to error. These terms capture both “high signal background events” (which may be due to DNA damage or very early cycle PCR errors) and “estimated background error rate” (which includes sequencer and PCR polymerase “errors”). Examples of such distributions are shown in Figs. 13A and B.

The term “collective” in the context of analyzing “collective results” means the results for all of the variants and aliquots (excluding any statistical outliers or other variants excluded for example as they are not present in the tumor DNA or are present in huffy coat DNA), not just a positive result.

Other definitions of terms may appear throughout the specification. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely”, “only” and the like in connection with the recitation of claim elements, or the use of a “negative” limitation.

DETAILED DESCRIPTION

Before the present invention is described in greater detail, it is to be understood that this invention is not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present invention, the preferred methods and materials are now described.

All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided may be different from the actual publication dates which may need to be independently confirmed.

It must be noted that as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.

As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present invention. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.

As may be apparent, each assay assessing multiple aliquots for two or more target regions may have a different lower limit at which it can reliably detect cancer DNA, sometimes referred to as Limit of Detection or LOD. It may also have a different limit at which amounts of cancer DNA can be accurately quantified, sometimes referred to as Limit of Quantification or LOQ. For such an assay to be most useful, in some cases it may be valuable to obtain an accurate estimate of either or both of the LOD or LOQ. Such an estimate can be obtained by combining factors which may include clonality, mappability, estimated error rate, estimated rate of high signal background events, presence within a region of copy number gain or amplification for each sequence variation associated with the patient’s cancer that is targeted. It may also include library preparation and sequencing run specific factors which may include: the number of aliquots, the total number of sequencing reads for the targeted regions and the number of molecules input into each aliquot.

As noted above, a method for detecting cancer DNA in a test sample of DNA from a patient (e.g., a cancer patient) is provided. In some embodiments, the method may comprise sequencing multiple aliquots of the test sample (e.g., at least 2, at least 3, at least 4, at least 5 or at least 6 aliquots of the sample) to produce, for each aliquot, sequence reads corresponding to two or more target regions (e.g., at least three, at least 5, at least 10, at least 20, at least 50, at least 100, at least 1000 or at least 5000 target regions) that each have a sequence variation present within the patient’s cancer. For example, the method may involve sequencing 3-10 aliquots of the test DNA sample to produce, for each aliquot, sequence reads corresponding to 8-100 target regions. In very general terms, sensitivity can be increased by increasing the number of aliquots, by increasing the number of variants, or by increasing the number of aliquots and variants. For example, in some embodiments the method may comprise sequencing at least two (e.g., three or four) aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to ten or more target regions that each have a sequence variation. In other embodiments the method may comprise sequencing at least ten aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two (e.g., three or four) or more target regions that each have a sequence variation. Indeed, the method can be performed using a single aliquot if a sufficient number of sequence variations are analyzed.

This method may comprise: (a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation present within the patient’s cancer; (b) for each aliquot, for each target region: i. determining the number of sequence reads that have the sequence variation; ii. determining the total number of sequence reads; and iii. comparing i. and ii. to one or more error probability distribution models for the sequence variation, wherein the one or more models are obtained from DNA that does not contain the sequence variation; and (c) integrating the collective results of step (b) to determine if there is cancer DNA in the test sample.

In these embodiments, the different aliquots contain different aliquots (i.e., portions) of the same sample. As would be appreciated, different barcode sequences can be added to the different samples and the different samples can be pooled prior to sequencing.

Flow charts

Some of the workflow for the present method is illustrated in the accompanying flow charts (Figs. 1-10). These flow charts are believed to be largely self-explanatory.

Before describing the method in more detail, is noted that the present method can be used to detect cancer DNA from both solid tumors and haematological cancers. Therefore, when this claim uses the term “cancer”, the term refers to blood cancers and solid tumors. For solid tumor embodiments, the method may identify cancer DNA (or, more accurately, tumor DNA) in cfDNA (e.g., circulating cfDNA). For blood cancer embodiments, the method may identify cancer DNA in DNA extracted from a cells taken from bone marrow, lymph node, or circulating white blood cells, or in cfDNA. For example, in blood cancer embodiments, one could take a bone marrow aspirate from an AML patient (pre treatment), find out the variants in their AML, then, following treatment, one could looks at further bone marrow aspirates, cell free DNA or urine to determine if the patient still has cancer DNA.

In addition, the nucleic acid analyzed in the method may be DNA or RNA. The present disclosure is written describing embodiments that make use of DNA (specifically ctDNA). However the method should also work when one uses RNA (or cDNA) made from the same.

In addition, while the present method is described in detail using examples that make use of “amplicon” sequencing, the present method may be readily applied to methods that make use of molecular barcodes or indexes, e.g., random sequences that are appended to the nucleic acid, pre-amplification. Molecular barcode sequences may vary widely in size and composition; the following references provide guidance for selecting sets of barcode sequences appropriate for particular embodiments: Casbon (Nuc. Acids Res. 2011, 22 e81), Brenner, U.S. Pat. No. 5,635,400; Brenner et al, Proc. Natl. Acad. Sci., 97: 1665-1670 (2000); Shoemaker et al, Nature Genetics, 14: 450-456 (1996); Morris et al, European patent publication 0799897A1; Wallace, U.S. Pat. No. 5,981,179; and the like. In particular embodiments, a barcode sequence may have a length in range of from 2 to 36 nucleotides, or from 6 to 30 nucleotides, or from 8 to 20 nucleotides. For example, the aliquot-based sequencing may be done on DNA that has been indexed, the number of molecules/the probability of a molecule being present can be estimated using index sequences in each aliquot.

It is noted that in the pre-calibration method shown in Fig. 5 the types and classes of variants may vary for which the error probability distributions are generated. For example, the specific variant may be analyzed within the context of its surrounding sequence. This can be achieved by sequencing the target region using DNA not expected to contain the valiant (e.g. DN A from a healthy donor) or by spiking in synthetic DNA/RNA for the target region that contains the wild type sequence and a barcode (outside of the variant region) enabling the separation of barcode and spike to the test reaction. In another example, the specific variant may be analyzed within the context of a class of variant. Classes of variants include: The same type of variant (e.g. An SNV such as A>T, an indel such as insertion of a TTTT, a doublet-base substitution such as CT> AA etc.); a transition or transversion; the single nucleotide variant and 1 to 5 bases either 3', 5' or both (e.g. A>T where the A has a 5'TTCA (TTCAA> TTCAT), or A> T where the A has a 5' T and a 3' G (TAG>TTG). Alternatively variants may be grouped into classes as above but where some or all of the bases 3' and/or 5’ of the variant may he one of multiple bases as described by the IUPAC degenerate nucleotide codes, (e.g. A>T where the A has a 5' K and a 3' S (KAS>KTS) (where K=G/T and S=C/G). In an alternative embodiment the local sequence context is explored by selecting a window of N 3' and or 5' bases around the variant of interest, where N is between 1 and 100, and extracting different sequence descriptors such as the base change at each location, the type of base change at each posi tion (e,g, transition or trans version), the distance from a primer end, the distance from a repeat sequence and these are then combined together to predict a categorical error rate class (e.g. high, medium, low) or a numeric error rate value by using a heuristic combination score or a machine learning method (unsupervised or supervised). The method as one of the above, but where a penalty score is assigned in the form of a multiplicative factor to the estimated error rate of a variant in proximity of pre-defined sequence features, such as mono-nucleotide repeats, repeat regions, or similar. This analysis can be done by sequencing DNA not expected to contain the classes of variants (e.g. DNA from a healthy donor). In this embodiment, enough regions must be targeted and sequenced so that each variant class is represented at least once (and ideally more e.g. 10 times or 50 times or 100 times).

In addition, the number and type of error probability distributions may vary. In some versions for each variant (or class) there is a single distribution for all error. In other embodiments, there are multiple distributions separating the different types of error. In some embodiments there are two error distributions for each variant, one of which is for the "estimated background error rate". These are typically sequencing error and PCR errors that happen later in library preparation (e.g. after the first few cycles of PCR). Then there are events that happen much less frequently but when they do, at much higher levels and typically at a similar level (in terms of variant allele frequency) to real variants in a sample. These “High signal background events" include things such as DNA damage and polymerase errors in the first few cycles of library preparation or pre amplification. These can be captured by a second distribution (e.g. One binomial distribution for the estimated background error rate and one for the High signal background events). In some embodiments, a different distribution is used for the estimated background error rate and the high signal background events (e.g. a beta distribution for the estimated background error rate and a binomial distribution for the High signal background events).

In some embodiments for each variant, the same variant class (e.g. 2 bp 3' and 2 bp 5') are used for both distributions. However as the two different distributions are sometimes the outcome of different error processes (e.g. DNA damage and PCR error) in some embodiments, for each variant, a different variant class is used for the two distributions.

The control material and methods for producing the distribution or distributions may also vary. For example, the probability distribution can be generated in the same library preparation and run as the test sample, in advance using control DNA, or in advance then adjusted using all bases other than the bases expected to contain variants when assessing the test sample(s).

In all cases the same sequencing process (including library prep, sequencer) and optimally die same sample type and extraction method (e.g. cfDNA extracted from blood drawn into a cfDNA blood collection tube) should be used to generate the model(s).

In some cases a different model is produced for a range of different DNA inputs and the test sample is analysed using the model with the best matched DNA input. For example, a maximum, minimum and median DNA input for each aliquot can be defined then a distribution or distributions obtained for all three for all the classes of variants tested for. When a test sample is assessed it is compared to the distribution whose DNA input is the closest match.

Optimally there would be tens, hundreds or thousands of samples tested to build the model. The distribution can be stored in a database and/or be downloaded from a public database.

In some embodiments, (e.g., as shown in Fig. 8) the amount of cancer DNA may be quantified) using the method. In these embodiments, one may determine the amount of cancer DNA in the test sample, a range of likely amounts in the test sample or an estimated tumor fraction using one or a combination of: a mean or median variant allele fraction (across the variants and aliquots), a corrected mean or median variant allele fraction (generated by subtracting a previously pre-determined offset or baseline error rate), maximum likelihood (testing a range of levels and determine the most likely), estimating tumor fraction: a grid based or an expectation maximisation search method to select the tumor fraction giving the maximum likelihood, Bayesian posterior or summing the number of estimated variant molecules for each variant (and optionally each aliquot). In another embodiment the amount of cancer DNA may be determined by counting the number of variant positive target regions (target region above a threshold) in each aliquot and comparing this against the total number of target regions multiplied by aliquots and quantifying the mean number of variant containing target sequences per target region per aliquot by applying a Poisson correction to the fraction of the positive results. In some embodiments, the rate of high signal background events estimated for the entire set of variants may also be used in the Poisson correction in order to give more accurate quantification.

General methodology

In some embodiments, the method comprising: (a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation present within the patient’s cancer; (b) for each aliquot, for each target region: deriving an estimate of the number of molecules that have the sequence variation, calculating the probability that there is at least one molecule that has the sequence variation, or determining if the frequency of sequence reads of (a) that have the sequence variation compared to the total number of sequence reads is above a threshold; and (c) determining if there is cancer DNA in the test sample using estimates, or probabilities or frequencies of step (b). In some embodiment, steps (b) may be done by a thresholding approach, described below and, in alternative embodiments, step (a) can be done without aliquoting as long as there are a sufficient number of target regions. In some embodiments, for each aliquot and target region, the number of molecules that have the sequence variation in the test sample or the probability that there is at least one molecule that has the sequence variation is estimated (b) using: (i) the number of sequence reads of (a) that have the sequence variation; (ii) the total number of sequence reads of (a); and (iii) the estimated background error rate for the sequence variation. The background error rate of (iii) may be expressed an error probability distribution. In addition, the probability that there is at least one molecule that has the sequence variation is estimated using the number of molecules input into each aliquot of (a). The estimated background error rate of (iii) is estimated by any convenient method, e.g., from prior sequencing reactions or publicly available information, e.g., from prior sequencing reactions, adjusted using data for control bases obtained in step (a), and/or from the current sequencing reaction, excluding the variant of interest. For example, the estimated background error rate of may be estimated by analysis of control sequencing reads produced in step (a).

In any embodiment, the background error rate can be estimated using a probability distribution. In some embodiments, there may be two distributions of the same family (e.g. 2 binomial distributions) or, if two different families are used, there may be one distribution for the background error rate and another for the estimated rate of high signal background events. As noted above, in any embodiment, the estimate is a probability distribution over the number of variant molecules present.

In any embodiment, (c) may be done by calculating a likelihood ratio between the likelihood of observing the estimates in (b) in samples: (i) if cancer DNA is present (ii) if cancer DNA is not present. Along similar lines, in any embodiment (c) may be done by calculating a likelihood ratio (LRi) between the likelihood of observing the estimates in (b) for each target region and aliquot: (i) if cancer DNA is present (ii) if cancer DNA is not present. In these embodiments, the individual likelihood ratios LRi may be combined into a cumulative LR score (product of LRi equivalent to sum of log-likelihoods) across all regions and aliquots of a sample. In these embodiments, the likelihood of observing the estimates of (b) if there is cancer DNA in the test sample may be calculated based on: (i) the estimates or probabilities of step (b); and optionally (ii) an estimate of the cancer DNA fraction in the test sample. Likewise, the likelihood of observing the estimates of (b) if there is no cancer DNA in the test sample may be calculated based on: (i) the estimates or probabilities of step (b); and (ii) the estimated rate of high signal background events

In any embodiment, step (c) may be calculated by using a mixture model incorporating: (i) the estimates or probabilities of step (b); and (ii) the estimated rate of high signal background events; and optionally (in) an estimate of the cancer DNA fraction in the test sample. For example, in some cases, step (c) may further comprise comparing the output of the mixture model or the likelihood ratio to a threshold, wherein an output that is at or above the threshold indicates that the test sample contains cancer DNA. The threshold may be determined by running at least 10 or at least 100 or at least 1000, or at least 10,000 samples without cancer DNA (or at least are not known to have cancer DNA) through the assay and selecting a threshold above the signal identified in the control samples or a threshold such that the false positive rate as determined using the control samples is estimated to be 1% or below, 0.1% or below or 0.01% or below. As would be apparent, the method may further comprise identifying the patient as having cancer cells if the result is at or above the threshold and, for example, administering a therapy to the patient. In these embodiments, the patient may have previously undergone a first therapy. In these cases, the method comprises administering to the patient a second therapy that is different to the first therapy.

In any embodiment, the method may further comprise determining the amount of cancer DNA or a range of likely amounts of cancer DNA in the test sample based on the estimates of step (b). This step may be done by, e.g., (i) calculating the mean or median variant allele fraction; (ii) maximum likelihood analysis; (iii) Bayesian posterior analysis; (iv) by counting the number of estimated mutant molecules for each variant and each aliquot or (v) by counting the number of variant positive target regions in each aliquot and comparing this against the total number of target regions multiplied by aliquots and quantifying the mean number of variant containing target sequences per target region per aliquot by applying a Poisson correction to the fraction of the positive results. This type of analysis has been done to calculate the number of starting molecules in digital PCR and can be adapted therefrom.

In any embodiment, the method may be performed on samples that are obtained from the patient during at least a first time point and a second time point, wherein the first time point is prior to a treatment and the second time point is after the treatment, and the method comprises determining if there is a change in the amount of cancer DNA or a range of likely amounts of cancer DNA between the first and second time points. This change may be determined using point estimates, confidence intervals or both, and wherein a significant decrease indicates the therapy is effective and no significant change or an increase indicates the therapy is not effective. In these cases, a change of at least 20%, at least 30%, at least 50%, at least 70% or at least 90% may be considered significant. In some embodiments a change is considered significant if the change is greater than a threshold such as 50% and the confidence intervals when quantifying cancer DNA for the first and second time point do not overlap. In these embodiments, a significant decrease indicates the therapy is effective and no significant change or an increase indicates the therapy is not effective.

In any embodiment, sequence variations that are identified in a statistically improbable number of the aliquots based on the estimated cancer DNA fraction, the number of DNA molecules added to each aliquot and optionally the number of times each variant is represented in an individual cancer call (which may be determined through copy number analysis) are excluded from the results of step (b) prior to step (c). In any embodiment, step (a) may comprises sequencing at least three aliquots, e.g., 3, 4, 5, 6, 7, 8, 9, 10, 11 or 12 or more aliquots.

In some cases, if a variant is amplified in a cancer cell, then it may be expected to be in all aliquots. As such, this part of the method can be further improved by inputting the copy number of each variant in a cancer cell and using this to estimate the likely number of aliquots the should be above a threshold for each variant.

In some embodiments, step (a) may also comprise sequencing positive and or negative controls which may include at least one of: cancer DNA from an aspirate, biopsy or surgery sample coming from the same patient, huffy coat DNA, buccal swab DNA, whole blood DNA, adjacent normal DNA, i.e., tissue that is adjacent to a tumor that appears normal or reference DNA. The sequencing of these samples may be performed at the same time as the test sample or it may be performed before or after sequencing the test sample.

In any embodiment, variants that are not detected in the cancer DNA are excluded. In addition or separately, variants that are detected in the huffy coat, buccal swab, adjacent normal or whole blood may be excluded.

In any embodiment the two or more target regions is at least 2, at least 4, at least 10, at least 20, at least 50, at least 100, at least 500, at least 1000 or at least 5,000 target regions. In many embodiments, 2-200, e.g., 10-100, target regions may be examined. The sequence variations of step (a) may be independently single nucleotide variants, indels, doublet-base substitutions (DBSs), transpositions, rearrangements, variable number tandem repeats, short tandem repeats or a viral genome (such as HPV) integrated into the patients genome.

In some embodiments, the variants may be epigenetic variants rather than sequence variants such as 5-methylcytosine (5mC) or 5-hydrossymethylcytosine. In certain embodiments sequence variants and epigenetic variants are selected when 2 or more are present less than lObp apart, less than 50bp apart or less than lOObp apart.

As noted above the sequence variations analyzed in the method are pre-identified sequence variations. For example, the sequence variations may be identified by sequencing a sample of: (i) DNA or RNA isolated from a tissue biopsy that comprises cancer cells, (ii) DNA or RNA isolated from a cancer tissue obtained at surgery that comprises cancer cells or (iii) sequencing cell-free DNA or RNA or (iv) DNA or RNA isolated from circulating cancer cells, wherein the sample is from the same patient, e.g., prior to any treatment. For blood cancers, the sequence variations may be identified by sequencing a sample of DNA or RNA from bone marrow, circulating blood cells or lymph node, for example. In some embodiments both DNA and RNA are sequenced and the variants identified in each combined. These sequence variations may be identified by sequencing the whole genome or by sequencing one or more of the whole exome, Genes frequently mutated in cancer (e.g. those in the COSMIC - Cancer Gene Census), the mitochondrial genome, Regions of common structural rearrangements (e.g. common gene fusions or the edges of common amplifications such as MYC), Regions of common amplification, Regions of common rearrangements (e.g. Chromothripsis), Regions of common localized hypermutation (e.g. Kataegis) or a region of the genome identified to typically contain sufficient numbers of mutations in the cancer type of interest that over 80% or 90% or 95% of the target patient population will have sufficient mutations identified to reach the required sensitivity (wherein the required sensitivity is pre-determined, as is the number of variants required to meet this sensitivity and this is compared to the rate of mutations per Megabase (Mb) and the variability between patients in the cancer type of in interest in order to determine the number of Mb of the genome to target).

In some embodiments, viral sequences are targeted in order to identify those that have integrated into the human genome and where they have integrated. In some embodiments either the whole genome or specific regions of the genome are assessed for epigenetic changes for example by Whole-Genome Bisulfite Sequencing, TET-assisted pyridine borane sequencing, Enzymatic methyl-sequencing, Reduced representation of bisulfite sequencing, Methylated DNA immunoprecipitation sequencing or Target bisulfite sequencing. Both epigenetic and genetic changes can also be identified by array. In some embodiments, an assay utilising either methylation changes and/or sequence variants is performed as an assay for early detection of cancer through the identification of these changes in ctDNA. In such an embodiment, when a patient is identified as likely to have ctDNA and therefore cancer, the epigenetic and/or sequence variants that are present in the patients ctDNA sample are identified and selected for targeting.

Hotspots could also be sequenced. Alternatively, the sequence variations may be identified by RNA-seq and optionally wherein RNA selection/depletion such as PolyA selection or Ribosomal RNA depletion is used to target specific types of RNA.

In some embodiments, a plurality of candidate sequence variations are first identified and then certain sequence variations may be selected. In some embodiments, the variations may be ranked and then the "best" variations may be selected, variants may be filtered removing any that are not optimal for tracking or variants may be first filtered then ranked. In some embodiments, the sequence variations are filtered, scored or ranked based on one or more of: i) clonality, wherein variants present throughout the tumor are preferred; ii) mappability, wherein variants whose reads are hard to map based on attempted alignment of any predicted PCR amplicons designed to amplify the region or presence within pre-annotated black-lister regions, overlapping repeat and homopolymer region annotations should be avoided; iii) estimated background error rate, wherein variants that have high error rate should are penalized or filtered; iv) estimated rate of high signal background events wherein bases with low rates are preferential; v) distance from another selected variant. In some embodiments, the variants should be spaced evenly throughout the genome and not clustered together for example, there no more than 10% of all variants on any chromosome, or any chromosome arm, or any 1Mb region. This is to prevent loss of a region of the genome (e.g. through loss of a chromosome arm during evolution) causing many variants no longer to be present for tracking. In another embodiment, if two variants are close enough to be targeted in a single sequencing read and present on the same chromosome, such variants are preferred. vi) predictive ability to sequence; vii) presence within a region of copy number gain or amplification wherein variants present in multiple copies in a single cancer cell are preferred; viii) proximity of any germ line variants which may be used for enriching the mutant allele; ix) likelihood of being somatic; x) likelihood of being somatic but not being from the target cancer, such as being clonal hematopoiesis of indeterminate potential; xi) presence on a region frequently lost in the cancer type being tested wherein avoiding such regions is preferred; xii) likelihood of variant being a common SNP/polymorphism xiii) likelihood of variant being artefactual occurring from specific protocol/sequencing method/capture kit

This includes through prevalence of variant in current and/or previous reaction/sequencing batch and variant profile matching that of known FFPE/other errors.

In some embodiments, all or a combination of these factors are scored, the variants are ranked by the score, and then selected. In some embodiments regions of the genome are ranked rather than specific variants. In such an embodiment the genome may be divided into overlapping or non overlapping windows. The windows can for example be lObp or 50bp or lOObp in length and these windows can overlap by 5bp, 25bp, 50bp or not at all. As would be apparent to someone skilled in the art, the window should be smaller than the typical length of DNA from the test sample and shorter than the sequencing read length of the intended sequencing platform. Therefore with high molecular weight DNA and long read sequencers, the window could be 100, or 1000 or 10,000bp as example. With Illumina sequencers and cfDNA the windows should always be less than 160bp (the typical length of cfDNA). In a preferred embodiment the window is between 20 and 100 by with an overlap that is half the length of the full window. Following the scoring of each variant, a score for each region is generated by combining the scores of all variants within the region, and optionally combining this with a score or scores for region specific features which may include mappability, predictive ability to sequence and presence within a region of copy number gain or amplification. In such an embodiment, the regions can be ranked and the best regions selected and an assay is designed to target these regions. An advantage of such a method is that it gives weight to regions of the genome where information may be obtained from multiple variants from a single molecule of test DNA (when the variants are is cis on the same chromosome) and simply getting more information from targeting a single region when the variants are in the same genomic region but are in trans i.e. on the other chromosomes. In some embodiments, different combinations of PCR primer pairs (forward and reverse) are designed to target the plurality of candidate sequence variations or regions identified and these are selected, scored, filtered out or ranked in order to identify one single best primer pair for each of the variations or regions based on features which may include: i) presence of repetitive region within the primer sequence (e.g., avoid homopolymer regions of >= 6 nucleotides); ii) presence of known Single Nucleotide Polymorphisms within primer sequence (wherein this is either avoided or the tumor sequencing is used to confirm is the SNP is present); iii) predicted formation of unintended PCR products that are likely to be sequenceable as they are produced using 1 forward and one reverse primer based on in silico PCR and/or local alignment and/or 3 ’-based alignment of primers to primers and/or or primers to amplicon regions (wherein there is a high penalty for such primer combinations); iv) as in iii), but the predicted formation of unintended PCR products that are likely to be unsequenceable (because they are either made with 2 forward primers or 2 reverse primers and such products would not allow sequencing as they would not contain both required sequencer adaptors) (wherein there is a low penalty for such primer combinations compared to iii)); v) total amplicon size in nucleotides; vi) number of times the predicted PCR product aligns to regions of the genome beyond the expected target (ranking score may be based on multiple mapping); vii) number of times the primer sequences align to regions of the genome other than the intended target; viii) number of times there is alignment of a primer pair constituted by a forward and a reverse primer other than the intended one in close proximity (i.e. less than 50, less than 100 or less than 150 nucleotides, based on a pre-defined threshold); ix) combined score of all variants present within the target amplicon. In some embodiments, the primers are filtered based on some or all of these features when a score is above a threshold. In some embodiments a composite scoring based on a linear or polynomial combination of some or all of the features is used to select the optimum multiplex. In some embodiments, a large number of variants are selected from a cancer DNA containing sample or cell line and a plurality of multiplex PCR panels are designed against these variants. A dilution series of the cancer DNA into normal DNA is generated then the plurality of multiplex PCR assays are used to generate sequencing libraries from the DNA. The process is optimally repeated with at least 10 or at least 100 samples. Some or all of the primer features along with the sequencing signal are inputted into a machine learning system or a neural network in order to determine the optimal combination of primers for detecting cancer DNA in a test sample.

In some embodiments, reagents to target the variants (e.g. capture baits or multiplex PCR primers) may be designed for all variants, then rather than selecting variants or regions, the best combination of primers or baits is selected. The primers or baits may be ranked and selected based on a combination of the score of all variants or regions targeted by each primer, pair of primers or baits and the predicted ability to amplify and/or enrich and/or sequence the targeted variants or regions within a multiplex of the other primers or baits. As would be apparent, it may be advantageous to select and rank the primers or baits in this way rather than the variants or regions. This is because the output of the assays is the integrated analysis of the collective results of multiple variants and it may therefore be preferable in some embodiments to assess larger numbers of variants at the expense of a few variants which may score highly but be challenging to multiplex with others.

In one embodiment, the best multiplex assay is designed after the top variants are selected.

In any embodiment, the patient has or had cancer or has a clonal growth that is not yet cancer but has the potential to transform. In some embodiments, the patient has undergone or is undergoing treatment for the cancer.

In any embodiment, the DNA is cell-free DNA, e.g., cell-free DNA is isolated from blood plasma, blood serum, cerebrospinal fluid, urine, saliva, or stool. In other embodiments, the DNA may be isolated from cells, e.g., bone marrow cells, cells from a lymph node or circulating white blood cells, in the case of a blood cancer or cells from a lymph node, cells from a tumors margin or other sample types such as CSF and whole blood that are currently screened for the presence of cancer cells from solids tumors presently by other means.

The fraction of cancer DNA in the test sample of DNA may be equal or less than 0.01%, equal or less than 0.005%, equal or less than 0.002%, or equal or less than 0.001%, and in some embodiments, the test sample comprises less than 25,000 genome equivalents of DNA, e.g., less than 20,000, less than 10,000, or less than 5,000 genome equivalents of DNA.

In some embodiments, the number of aliquots and the maximum number of molecules per aliquot is adjusted based on the total number of input molecules and the estimated background error rate such that the number of input molecules in a single aliquot is low enough that if a single variant molecule were present it would produce a signal significantly different to background.

In any embodiment, for each aliquot of each sequence variation, the read depth of step (a) may be at least 10,000, at least 25,000, at least 50,000 or at least 100,000 or at least 500,000. In any embodiment, the method may comprise measuring the amount of DNA in the test sample prior to step (a).

In any embodiment, the sequences of the target regions may enriched from the test sample prior to step (a) by PCR or by hybridization to a nucleic acid probe or using a one sided PCR approach wherein there is a universal sequence on one side of the target DNA molecule and at least one and optionally a further nested primer are used to target the other side of the molecule. Other methods known to those skilled in the art such as Linked Target Capture, Molecular inversion probes and ATOM Seq may also be used.

As noted above, the present method may be done using a threshold-based approach. In these embodiments, any target region in any aliquot may be determined to contain at least one mutant molecule: i) if the estimate of the number of molecules that have the sequence variation in step b is 1 or greater, ii) if the probability calculated in step b is above a specificity threshold (e.g. 95%, 99%, 99.9%), iii) if the frequency is above the threshold, or iv) by calculating a likelihood ratio for each variant in each aliquot between the likelihood of observing the estimates in (b) in samples: (i) if cancer DNA is present and (ii) if cancer DNA is not present, then confirming whether the result is at or above a threshold. In some embodiments where a target region contains 2 variants the region may be determined to contain at least one mutant molecule if signal for both variants is present within the same sequence.

In some embodiments, cancer DNA may be determined in step (c) of the method: i) if there are equal or more than a threshold number of target regions in any aliquots that are determined to contain at least one mutant molecule, and/ or ii) if there is at least 2 or at least 3 aliquots determined to contain at least one target region with at least one mutant molecule. In these embodiments, the threshold number of target regions may be: i) 2 or more (e.g., 3, 4, 5 or 10 or more) target regions in any aliquots that are determined to contain at least one mutant molecule, or 11) determined by combining the estimated rate of high signal background events for all target regions and aliquots to determine a threshold where one would expect the number of high signal background events to occur less than 5%, 0.5%, 0.1% or 0.01% or 0.001% of the time (for example, if there were 4 aliquots and 48 target regions, and for the specific combination of target regions and variants within these regions, it was estimated that you would get 4 of more high signal events across all aliquots less than 0.01% of the time, then a threshold of 4 would be set) or iii) A score rather than a fixed number of target regions or variants and wherein the threshold score is either 2 or 3, and wherein a positive target region or variant contributes a different score depending on its rate of high signal background events. In one embodiment, variants or classes of variants that never have high signal background events are given a score of 1 and the remaining variants or classes of variants are split into 1 or more groups based on their rate of high signal background events and given a lower score. For example there may be two groups. The 50% of variants or variant classes with the lowest rate of high signal events receive a score of 0.75 whilst the 50% with the highest rate get a score of 0.5 whenever positive.

In any embodiment, the threshold frequency of step (b) may be determined using a binomial, over- dispersed binomial, Beta, Normal, Exponential or Gamma probability distribution model of the background error rate for the sequence variation and wherein the frequency is selected such that a signal would be observed above this less than 5%, 2%,1%, 0.1%, 0.01% or 0.001% of the time, depending on the desired pre-defined per variant specificity, when no mutant molecules are present.

Further details, alternative steps and embodiments of the present are described below. Sequence variations that are associated with the patient’s cancer

The present method involves analyzing multiple sequence variations that are associated with the patient’s cancer in a sample, where such sequence variations are believed to be present in the cells of a patient’s cancer. Any individual sequence variations may be a driver mutation or a passenger mutation and, a sequence variation may be clonal or non- clonal. The sequence variations used in the present method are cancer-associated in the sense that they are believed to be only in the cancer cells and not the normal cells in the patient. The set of mutations that define a patient’s cancer are patient-specific in the sense that they vary from patient to patient, although some mutations (e.g., in KRAS, etc.), may occur in several patients and/or in several different types of cancer. Because the positions of passenger mutations in the genome are difficult to predict beforehand (although there may be some hotspots) and the positions of the sequence variations differ from patient to patient, the sequence variations that are analyzed in the present method may be identified on a patient- to-patient basis. In some embodiments, the sequence variations can be identified from samples where the cancer fraction is higher - for example, a bone marrow aspirate, a tissue biopsy sample or isolated circulating cancer cell(s). For example, the sequence variations may have been identified by sequencing DNA isolated from a bone marrow aspirate, tumor tissue biopsy or surgical resection, from circulating tumor cells (CTCs), from other cells that are no longer part of the tumor tissue but are not circulating such as those in the urine or stool samples, or cell-free DNA from the patient, where the sample from which the DNA is extracted was obtained from the patient prior to treatment for cancer when ctDNA levels are more likely to be high. In some embodiments, multiple sample types or multiple regions from the same sample may be sequenced in order to determine clonality. This sequencing step may be done by whole genome sequencing, exome sequencing or targeted sequencing (e.g., by sequencing a panel of cancer genes or by sequencing a panel of sequences that are hotspots for mutations), etc. as described above. As would be apparent, the patient may be a cancer patient, where the patient has undergone, may be undergoing treatment for the cancer or may be about to undergo treatment. In other words, the sequence variations may be identified in a sample in which they are present at a relatively high level, e.g., in a sample that was collected before any cancer treatment has been initiated.

Depending on how the method is performed, the sequence variations may be identified before the test sample has been analyzed or at the same time as the test sample is being analyzed. As such, some embodiments of the present method use “pre-identified” sequence variation, where “pre-identified” sequence variations are sequence variations that have previously been identified as being associated with a patient’s cancer, e.g., before or during treatment. In other embodiments, the sequence variation is not pre-identified and, instead, the sequence variations may be identified by comparing sequence reads from the test sample to sequence reads obtained from control samples (e.g., positive and negative control samples, as described below).

The sequence variations analyzed in the method may be independently single nucleotide variations, indels, transpositions or rearrangements. In general, the sequence variations can be identified by sequencing DNA isolated from a tissue sample (e.g., a biopsy, surgical resection or fine needle/large needle aspiration) that comprises cancer cells or sequencing cell-free DNA from the patient (e.g., whole genome sequencing, exome sequencing or a targeted sequencing approach), where multiple regions are sequenced. For example, in some embodiments a list of sequence variants may be obtained through sequencing at least 50kb of cancer DNA, through targeted sequencing of a large region of the genome or whole genome sequencing, where the cancer DNA is obtained from either tumor tissue (e.g., a biopsy) or a sample expected to have high levels of cancer DNA in it (such as a pre-treatment plasma DNA sample). In some embodiments just cancer DNA is sequenced. In an alternative embodiment, both cancer DNA and DNA expected to be normal, such as whole blood, huffy coat, apparently normal tissue adjacent to the tumor or buccal swab may be sequenced. Variants may be classified as somatic of germ line either by assessing the cancer and normal DNA or by assessing just the cancer DNA and using the variant allele fractions in addition to optionally using other features as is known in the art.

In some cases, analysis of the initial cancer DNA sample may result in a list of candidate sequence variations, where some of the candidate sequence variations are eliminated to produce a list of pre-identified sequence variations. In some embodiments, this method may comprise obtaining a list of candidate variants that are believed to be somatic from the patient whose sample is being assessed (e.g., by sequencing a biopsy) and then prioritizing the variations. In these embodiments, the prioritization may be based on, e.g., the probability of being a real variant as opposed to a sequencing artefact, probability of being a somatic genetic abnormality, the probability of being a clonal mutation, an estimate of the error rate, an estimate of the compatibility to multiplex with other variants and/or the map- ability of the variant and surrounding regions, the estimated number of copies of the variant in each cancer such as presence in a region of gain or an amplification, in episomes or double minute chromosomes or regions of chromoplexy etc. In addition to prioritizing the candidate variations, one or more of the candidate sequence variations may be eliminated and only a subset of the candidate sequence variations may be selected for future analysis. For example, after the candidate sequence variations are identified, the target regions that contain those sequence variations may be sequenced in DNA from normal cells (huffy coat, white blood cells, buccal swab, or adjacent tissue). This sequencing may be performed using that same approach as used for sequencing the tumor DNA or the sequencing may be performed using an assay designed to detect variants identified in the tumor DNA. Any variants identified in these normal cells may be eliminated from the candidates as being likely to be germline polymorphisms or clonal hematopoiesis and the remainder of the sequence variations can be prioritized. For example, in some embodiments, the method may further comprise sequencing at least some of the target regions in the DNA of white blood cells from the patient. In these embodiments, the method may involve comparing the candidate genetic variations to the genetic variations called using the white blood cell DNA. If a variation is identified in both samples, then it may be eliminated from being a pre- identified sequence variation. This embodiment provides a way to identify variations that may be potentially due to clonal hematopoiesis of indeterminate potential (CHIP) (see, generally, Funari et al, Blood 2016 128:3176 and Heuser et al, Dtsch. Arztebl. Int. 2016 113: 317-322) and germ line variants so that they can be eliminated from future analysis. In an alternative embodiment, the method may involve comparing the candidate genetic variations to the genetic variations called using the apparently normal tissue adjacent to the tumor. If a variation is identified in both samples, then it may be eliminated from being a pre-identified sequence variation. This embodiment provides a way to identify variations that may be potentially due to cancer field effect and germ line variants so that they can be eliminated from future analysis

As such, in any embodiment, the method may comprise sequencing one or more positive and/or negative controls samples (which may be run prior to or at the same time as the test sample). As would be apparent, this assay is “personalized” in that the initial cancer DNA sample, the control samples and the test sample are obtained from the same individual. Positive and negative controls samples include but are not limited to: tumor DNA from biopsy or surgery sample either from the primary tumor or a metastasis, huffy coat DNA, buccal swab DNA, whole blood DNA, DNA isolated from normal tissue (e.g., adjacent tissue) or reference DNA. In these embodiments, sequence variations that are not detected in the tumor DNA may be excluded and wherein sequence variations that are detected in the huffy coat, buccal swab, adjacent normal or whole blood are excluded. In any embodiment, a sequence variation may be prioritized based on one or more factors which may include: clonality, mappability, estimated error rate, distance from another selected variant, compatibility with other variants when designing a multiplex PCR or hybrid capture panel, predicted ability to sequence, presence within a region of copy number gain or amplification, and proximity of any germ line variants either in cis or trans which may be used for enriching the mutant allele. Methods that would enable enrichment of sequence variations in close proximity to a germ line variant include performing allele specific PCR wherein at least one of the primers is specific to the strand with the germline change and the variant is on the same stand (in cis), or targeting the germ line change for example with restriction enzyme, cas9 or similar method when the variant is on the opposite strand (or in trans) in order to remove wild type strands. In other embodiments a sequence variation may be prioritized based on its suitability for variant enrichment methods such as allele specific PCR, COLD-PCR or other methods know to those skilled in the art. As may be apparent, the sequence variations analyzed in the method may vary from patient to patient such that the sequence variations analyzed in the method are “customized” to each patient. As such, in many embodiments, the method may comprise identifying a first set of sequence variations from a DNA sample from a first patient, a second set of sequence variations from a DNA sample from a second patient, a third set of sequence variations from a DNA sample from a third patient, and so on.

Aliquot-based sequencing

The aliquot based-sequencing method may be practiced in a variety of different ways. In some embodiments, target regions that have the sequence variations may be sequenced using an “amplicon-based” approach in which the target fragments that have pre-identified sequence variations are directly amplified by PCR from the sample. In some embodiments the test sample may first be pre- amplified for example by the ligation of adaptors and performing PCR targeting the ligated adaptors. In these embodiments, the sequencing adapters may be added during amplification or may be ligated on after the amplification. In other embodiments, target regions that have pre-identified sequence variations may be sequenced using an “target enrichment-based” approach in which adapters are ligated to the sample, and fragments containing the target regions are enriched by hybridization to a nucleic acid probe prior to amplification using primers that hybridize to the adapters. In such embodiments, either aliquot ligation reactions may be performed, or adaptors with a plurality of barcodes may be ligated onto the DNA enabling the effective separation of groups of molecules into separate barcode groups or “aliquots”. As such, sequences of the target regions can be enriched from the sample by PCR or by hybridization to a nucleic acid probe. Other enrichments methods may be used. In other embodiments any other method with either physical replication or use of molecular barcodes may be utilized such as Molecule Inversion Probes (MIP) or Anchored Multiplex PCR (AMP). Some of the principles of the amplicon-based method are described below. Similar concepts can be applied to the target enrichment approach. In some embodiments the variant sequences may be enriched during the targeting step using methods including COLD-PCR, allele specific PCR targeting the variant, allele specific PCR targeting an adjacent germline change, digestion of wild type sequence through the utilization of adjacent germline changes or other methods known to those skilled in the art.

In embodiments that employ pre-identified sequence variations, multiple primer pairs are obtained after the pre-identified sequence variations have been identified, where each primer pair amplifies a target region that has one or more of the pre-identified sequence variations. In some embodiments, the length of each amplicon, independently, may be in the range of 50 bp to 500 bp, e.g., 70-150 bp, although longer or shorter amplicons may be used in some implementations. In some embodiments some of the variants are rearrangements. In these embodiments, primers are designed with one primer 3’ of the rearrangement and one primer 5’ wherein the rearranged sequence is used to design the primer pairs and dthe primers are specifically deigned to amplify the rearranged sequence. After the primer pairs have been obtained, the method may comprise setting up at least two multiplex PCR reactions (e.g., up to 10 multiplex PCR reactions, such as 2, 3, 4, 5, 6, 7, 8, 9 or 10 multiplex PCR reactions) each containing a portion of the same sample (i.e., different aliquots of the same sample). In this step, the multiplex PCR reactions can be identical to one another in that all the reactions have the same primers and different portions of the same sample. In this method, the number of aliquots and the maximum number of molecules per aliquot may be adjusted based on the total number of input molecules and the estimated background error rate such that the number of input molecules in a single aliquot is low enough that if a single variant molecule were present it would produce a signal significantly different to background. As would be apparent, each multiplex PCR reaction should contain compatible primers, where compatible primers are designed to specifically amplify regions of interest producing amplicons that correspond to the PCR primer pairs while minimizing the production of primer dimers and unintended or non-specific PCR products, when the reaction is subjected to appropriate thermocycling conditions with an appropriate template for the primers. Typically, although not always, each primer pair amplifies a single region of interest in a multiplex PCR reaction. Conditions for performing multiplex PCR and programs for designing compatible primers are well known (see, e.g., Sint et al, Methods Ecol Evol. 2012 3: 898-90 and Shen et al BMC Bioinformatics 2010 11: 143). Compatible primer pairs may be designed using any one of a number of different programs specifically designed to design primer pairs for multiplex PCR methods. For example, the primer pairs may be designed using the methods of Yamada et al. (Nucleic Acids Res. 2006 34:W665-9), Lee et al. (Appl. Bioinformatics 2006 5 :99-109), Vallone et al. (Biotechniques. 2004 37: 226-31), Rachlin et al. BMC Genomics. 2005 6:102 or Gorelenkov et al. (Biotechniques. 2001 31: 1326-30). In some embodiments, the method may employ at least 5 pairs of compatible primers, e.g., at least 10, at least 50, at least 100, at least 1000 or at least 5000 pairs of compatible primers. The amplicons amplified can be of any suitable length and may vary in length. In some embodiments, sequence variations may be prioritized based on the likely compatibility of primer designs in a multiplex PCR. Next, the amplicons produced by thermocycling the reaction, or amplification products thereof (if the amplicons are re-amplified by universal primers that hybridize to 5’ tails in the primers, for example) are sequenced to produce sequence reads. The various aliquot PCR reactions should produce replicate amplicons, where “replicate” amplicons are amplicons that are amplified by the same primers in the aliquots. Replicate amplicons generally have the same sequence (except for PCR errors, variations corresponding to genetic variations in the sample, any variations in the PCR primers, etc.).

In sequencing the amplicons, the amplicons derived from each different multiplex PCR reaction may be sequenced separately to one another or the amplicons may be barcoded with an aliquot identifier and then pooled prior to sequencing. In some embodiments, the primers in the multiplex PCR reactions may have a 5’ tail that contains the aliquot identifier such that, after the PCR reactions have been completed, the sequence of the 5’ tail of the primers is present in the amplicons. In other embodiments, the multiplex PCR reactions can be done without using primers that have a 5’ tail that contains an aliquot identifier. In these embodiments, the PCR products may be barcoded with an aliquot identifier in a second round of amplification that uses PCR primers that have a 5’ tail containing an aliquot identifier. Adapter sequences could also be ligated onto the products. Either way, the amplicons may be amplified prior to sequencing, using primers that have a 5’ tail that provides compatibility with a particular sequencing platform. In certain embodiments, in addition to an aliquot identifier, one or more of the primers used in this step may additionally contain a sample identifier. In some embodiments, one or both of the primers may contain a barcode, which either independently or in combination may be used to identify both the sample and aliquot. If the primers have a sample identifier, then products derived from different samples can be pooled prior to sequencing. In some embodiments, the target specific primers contain from 5’ to 3’ a universal “tagging” sequence, an optional aliquot barcode sequence followed by a sequence designed to the target of interest. The primers used to further amplify the initial products may contain a 5’ tail that provides compatibility with a particular sequencing platform, a sample barcode and optionally a aliquot barcode or a barcode that identifies both the sample and aliquot, and a sequence that can bind to either part or all of the reverse complement of the tagging sequence present on the target specific primers. Typically, the forward and reverse primers will have different tagging sequences. As would be apparent, the primers used for the amplification step may be compatible with use in any next generation sequencing platform in which primer extension is used, e.g., Illumina’s reversible terminator method, Roche’s pyrosequencing method (454), Life Technologies sequencing by ligation (the SOLiD platform), Life Technologies Ion Torrent platform or Pacific Biosciences’ fluorescent base-cleavage method and any other platforms e.g. Oxford Nanopore. Examples of such methods are described in the following references: Margulies et al (Nature 2005 437: 376-80); Ronaghi et al (Analytical Biochemistry 1996 242: 84-9); Shendure (Science 2005 309: 1728); Imelfort et al (Brief Bioinform. 2009 10:609-18); Fox et al (Methods Mol Biol. 2009;553:79-108); Appleby et al (Methods Mol Biol. 2009;513:19-39) English (PLoS One. 2012 7: e47768) and Morozova (Genomics. 2008 92:255-64), which are incorporated by reference for the general descriptions of the methods and the particular steps of the methods, including all starting products, reagents, and final products for each of the steps.

In alternative embodiments, the aliquot-based sequencing could target a panel of mutation hotspots, a panel of cancer genes. Alternatively, the sequencing step could be performed by exome or whole genome sequencing, or by sequencing at least 1 , at least 5 or at least 10 MB of the genome to a suitable depth. In these embodiments, the sequence variations do not need to be “pre-identified”. Rather, the sequence variations can be identified in the same assay in which the test sample is sequenced, i.e., by comparison of the data to controls that are also run in the same assay (e.g., the same sequencing run). Once the sequence variations have been identified using the control samples, those sequence variations can be analyzed in the test sample.

The sequencing step may be done using any convenient next generation sequencing method and may result in at least 100,000, at least 500,000, at least IM at least 10M at least 100M, at least IB or at least 10B sequence reads per reaction. In some cases, the reads may be paired-end reads.

Processing sequences, estimating variant molecules and determining presence of cancer DNA

The sequence reads are then processed computationally. The initial processing steps may include identification of barcodes (including sample identifiers or aliquot identifier sequences) and trimming reads to remove low quality or adaptor sequences. In addition, quality assessment metrics can be run to ensure that the dataset is of an acceptable quality. After the sequence reads have undergone initial processing, they may be analyzed to identify which reads correspond to the target regions. These sequences can be identified because they are identical or near identical to the sequence of a target regions. As would be recognized, the sequence reads that are identical or near identical to the target region can be analyzed to determine if there is a potential variation in the target sequence. Sequences may be aligned with a reference sequence, e.g., a genomic sequence, in this method or matched to a database of expected sequences.

After the sequence reads have been processed, the method may comprise, for each aliquot and each sequence variation, counting the number of sequence reads that have the sequence variation and counting the total number of sequence reads. Methods for counting reads may be adapted from those described by e.g., Forshew et al (Sci. Transl. Med. 2012 4:136ra68), Gale et al (PLoS One 2018 13:e0194630), and Weaver et al (Nat. Genet. 2014 46:837-843). Similar results can be obtained using an approach that employs molecular indexes. In these methods the total number of molecules sequenced and the number of variant molecules can be estimated using the indexes. Such molecule identifier sequences may be used in conjunction with other features of the fragments (e.g., the end sequences of the fragments, which define the breakpoints) to distinguish between the fragments. Molecule identifier sequences are described in (Casbon Nucl. Acids Res. 2011, 22 e81). As illustrated in Fig. 11 , after counting the number of sequence reads that have the variation and counting the total number of sequence reads, an estimate of the number of molecules in the original sample before amplification, that had the sequence variation can be determined for each aliquot of each target region. Alternatively, one can calculate the probability that there is at least one molecule that has the sequence variation, for each aliquot of each target region. The latter can be derived by, for example, summing the individual probabilities for all non-zero numbers (i.e., all positive integers) of molecules. In these embodiments, the estimate can be a probabilistic estimate, meaning that the estimate is not a point estimate but is a probability distribution. This step may be done by assigning each possible number of variant molecules in the aliquot with a probability, which may be done via a probability density function, an example of which is illustrated in Fig. 12. In these embodiments, for each aliquot and target region the estimate of the number of molecules that have sequence variation or the probability that there is at least one molecule that has the sequence variation may be calculated using: (i) the number of sequence reads that have the sequence variation, (ii) the total number of sequence reads, (iii) the number of molecules input into each aliquot, and (iv) the estimated background error rate for the sequence variation. In these embodiments, the sequence of the target region will be represented by a number of sequence reads (e.g., at least 10,000 reads, although this number can vary depending on the number of aliquots that are sequenced) and some of those reads may contain the sequence variations. These reads can be counted in order to provide input values (i) and (ii). Input value (iii) can be calculated by measuring the amount of DNA in the DNA sample prior to initiating the method. This can be done, for example, by measuring the total amount of DNA, the total amount of double stranded DNA, the total amount of double and single stranded DNA, the total amount of DNA within a specific size range or the total amount of DNA that can be amplified using primers with specific parameters such as amplicon size. This step can be done by digital PCR, qPCR, fluorometrically, through electrophoresis or using any of a variety of kits or other strategies. The estimated background error rate for each sequence variation, i.e., input value (iv), can be determined from prior sequencing reactions, e.g., sequencing reactions done on samples that are known to not have the sequence variation or on samples from individuals not known to have cancer and therefore not anticipated to have large numbers of somatic variants. Specifically, background error rate for each variation can be estimated through the sequencing of similar variants in DNA not expected to contain somatic mutations in the similar variants being assessed either in the same run, in historical runs or using historical runs then adjusting using select control bases (or bases not known to contain variants), and wherein variants are considered to be similar based on features which may include; the base change, the type of base change (transition/transversion) and the trinucleotide context, the pentanucleotide context, the position in the amplicon in reference to a primer, size of insertion, type and number of inserted bases, size of deletion, type and number of deleted bases or class of rearrangement, for example tandem duplication. A hypothetical error model is shown as a frequency distribution in Fig. 13A or a mixture model shown in Fig. 13B. In these examples, multiple samples (e.g., several hundred samples) that are not known to contain somatic variants are sequenced, and the fraction of sequence reads that have a particular type of sequence variation can be calculated for each sample. The variant sequence reads are largely caused by errors that occur during PCR, base mis-calls and pre-PCR events such as DNA damage (e.g., the oxidation of guanine to 8-oxoguanine, which base pairs with A, resulting in what appears to be a G to T variation in a sequence read). These fractions can be plotted as a frequency distribution which, in turn, can be used to calculate the probability of whether a sequence variation observed in a sequence read is really a genetic variation.

The presence or absence of cancer DNA in the sample can then be determined using the estimates (or probabilities) of variant molecules in each target region from each aliquot of the original sample. In some cases, the data can also be used to estimate the overall cancer DNA fraction in the sample. This estimate may be the most likely amount of cancer DNA or a range of likely amounts of cancer DNA in the test sample, and may be estimated based on the fraction of variant reads or estimates of variant molecules in the original sample, such as by mean or median variant allele fraction, maximum likelihood or Bayesian posterior.

In one embodiment, the presence or absence of cancer DNA in the sample can be determined via a likelihood ratio, by comparing the likelihood of observing the results given that cancer DNA is present with the likelihood that the same results could have been generated by a sample that does not contain any cancer DNA. If there is a higher likelihood that the same data could be produced by a sample that does not contain any cancer DNA, then the sample may not contain any cancer DNA. The first likelihood (the likelihood with cancer DNA present) may be calculated using (i) the estimated numbers of molecules with the sequence variation or probabilities, as calculated above for each aliquot of each target region; and, optionally, (ii) the cancer DNA fraction estimated in the sample. The second likelihood (the likelihood for the null hypothesis) may be calculated using (i) the probabilistic estimates or probabilities, as calculated above; and (ii) the estimated rate of high signal background events, where a “high signal background event” is an event which is not accounted for by the simple model of the background error rate per read. After the likelihood of there being cancer DNA in the sample and the likelihood of the null hypothesis have been calculated, they can be compared to obtain a likelihood ratio and, in turn, the likelihood ratio can be compared to a threshold. In some embodiments a likelihood ratio is determined for each aliquot of each target region. The individual likelihood ratios are then combined into a cumulative likelihood ratio score across all the regions and aliquots of the sample. A likelihood ratio that is at or above the threshold indicates that the DNA sample contains cancer DNA. Alternatively, the likelihood ratio can be interpreted as a probability that the sample contains cancer DNA, either directly or by comparison to a reference distribution calculated on control samples.

Specifically, as noted above there are at least three types of errors in the model in Fig. 13A and B: errors that occur during PCR, base mis-calls during sequencing and pre-PCR events such as DNA damage. The pre-PCR errors are “high signal” in the sense that they are rare (they are not associated with every sample) but when they do occur, they result in a much higher fraction of variant reads than the other errors consistent with variant molecules being present in the original sample, i.e. they mimic the appearance of a true positive ctDNA variant. In some instances, errors that occur in the first one, two or three cycles of PCR may also produce high signal events. The rate of such errors can be determined using a variety of different methods. In some cases, an error distribution or distribution of error probability may be used. In these embodiments, the errors skew the distribution as illustrated in Fig. 13A and B. Analysis of such an error distribution allows the high signal events to be identified as separate events. For example, in some cases, the events can be identified using a threshold (e.g., an event that is one, two or three standard deviations from the mean or median) as illustrated in Fig. 13A. Such a threshold can change from variation-to-variation but, in general, they can be identified as having a frequency that is above a defined threshold as illustrated in Fig. 13 A. These high signal events can be separately modeled and used to determine the rate of high signal background events for each sequence variation.

In another embodiment, a determination of whether the test sample contains cancer DNA is calculated by using a mixture model (Fig. 13B) incorporating: (i) the estimates or probabilities of variant molecules in each aliquot of each target region, the estimated rate of high signal background events and optionally a prior estimate of the cancer DNA fraction in the test sample. The output of the mixture model can be compared to a threshold, wherein an output that is at or above a threshold indicates that the test sample contains cancer DNA. Such a threshold for either method may be determined by analyzing a plurality of samples not known to contain cancer DNA and determining a distribution of results then setting a thresholds such that a false positive would be expected less than 0.01% of the time, less than 0.1% of the time, less than 0.5% of the time, less than 1% of the time or less than 5% of the time.

In some embodiments, the probabilistic estimates or probabilities for sequence variations that are identified in a statistically improbable number of the aliquots based on the estimated cancer DNA fraction are excluded, prior to calculating likelihood of there being cancer DNA in the sample, or prior to assessing the sample with a mixture model for cancer DNA or prior to determining if sufficient target regions, variants and or aliquots are above a threshold to indicate cancer DNA is present. For example, if the estimates or probabilities for most aliquots of most variations are relatively low indicating that they are unlikely to contain variant DNA, except for occasional aliquots that are relatively high, it would be statistically improbable that one sequence variation would be present in all or almost all aliquots with a relatively high probability. As a further example, in an embodiment with 4 aliquots, if the evidence for most variants supports either 0 or 1 aliquots containing variant DNA, any variants where the evidence for all 4 aliquots supports the presence of variant DNA is likely to be an outlier. These outliers (which may be caused by “noisy bases”, or non-cancer specific changes that are derived from CHIP, for example) can be identified and eliminated from the calculation. In another example, using the number of test DNA molecules added to each aliquot and an estimate of the tumor fraction calculated using all variants (or a subset), the chance of each individual variant in each aliquot containing at least one cancer molecule can be calculated. The number of aliquots above a threshold can then be compared with the total number of aliquots to determine if the variant is giving an improbable result. In some embodiments the copy number of each variant is corrected for during this calculation. This concept is illustrated in Fig. 14.

In the present method, variant-containing regions that result in more aliquots than would be expected with a high signal (given the cfDNA concentration and the estimated ctDNA fraction) can be identified and eliminated. This may be calculated using the probability of sampling at least one ctDNA molecule per partition given a known cfDNA concentration and an estimated ctDNA fraction. Variants for which this is statistically improbable (e.g., p<0.05) may be excluded. For example, if each of 4 partitions had a 0.2 chance of containing a variant (based on the estimated ctDNA fraction and number of input molecules), the likelihood of seeing 2 partitions with a high score can be calculated.

For clarity, some embodiments of this method does not involve identifying (“or calling”) variations in the different aliquots. Specifically, some embodiments of the method does not involve determining whether the frequency of a potential sequence variation is above or below the threshold in each aliquot. Rather, these embodiments of rely on analysis of the data as a whole.

While the method can be practiced on any type of sample that has cancer DNA in it, the method finds most use for the analysis of limited samples in which the fraction of cancer DNA is less than 0.01% (i.e., is less than 100 ppm), since this is when samples that contain cancer DNA become indistinguishable from samples that do not contain cancer DNA in other assays. For example, in some embodiments, the method may be used to detect cancer DNA in samples that contain 0.0001% (Ippm) to 0.001% (lOppm) cancer DNA, where the sample comprises less than 25,000 genome equivalents of DNA (e.g., 100 to 10,000, 500 to 5000 or 2000 to 20,000 genome equivalents of DNA), although these numbers may vary. Moreover, in order to obtain statistically significant results, each aliquot of each target region can be sequenced to a read depth of at least 5,000, at least 10,000, at least 20,000 or at least 100,000, as desired.

Estimating the amount of cancer DNA

In some embodiments, the amount of cancer DNA may be measured as a total number of variant containing molecules. In another embodiment, the amount of cancer DNA may be measured as an estimated variant allele fraction (VAF). In some embodiments, a mean or median VAF may be generated (i.e. a mean or median of all the variants analyzed), in other embodiments a corrected mean or median VAF may be determined (i.e. the mean or median level across the variants after subtracting a previously pre-determined offset or baseline error rate for each variant). In some embodiments, the VAF and the total number of cfDNA molecules added to the sequencing reaction may be multiplied together as a method for estimating the total number of variant tumor molecules that were added to the sequencing reaction.

In other embodiments, information obtained through sequencing the tumor tissue may be used to estimate the number of copies of each variant within a single cancer cell and this information may be used in combination with the variants detected in the sample and their frequencies to determine the number of tumor cells it represents, i.e., the “cancer cells represented”.

In some embodiments, the measure of variant containing molecules, or estimated numbers of cancer cells may be combined with the number of millilitres of fluid such as blood plasma from which the DNA was extracted in order to estimate the number of molecules per ml of sample. In examples of such an analysis one may calculate a range of outputs such as, mean variant molecules per ml of plasma, median variant molecules per ml of plasma, median tumor cells per ml of plasma or Median variant molecules per ml of CSF.

In some embodiments, this calculation may contain steps to correct for DNA lost between blood collection and sequencing analysis. This could include correcting for cfDNA extraction efficiency or correcting for library preparation efficiency. As example, when working out the median variant molecules per ml of blood plasma, one would first determine the number of mutant molecules that could be detected in the sample, and from what volume of plasma, the cfDNA sample used was extracted from. This number would then be corrected for the known number of molecules typically recovered by the extraction chemistry used and/or the rate of converting then sequencing such molecules during sequencing library preparation and analysis. In some embodiments, at least one synthetic spike DNA sequence with a known sequence is added to the sample prior to extraction and this sequence is analysed during sequencing to determine the efficiency of extraction and library preparation and then applied to correct previously described mutant molecules estimates. In certain embodiments, the spike sequence could contain a molecular barcode to enable counting the number of molecules successfully read.

Estimating limit of detection

As would be apparent to someone skilled in the art, a number of factors impact the sensitivity of a method such as this. Depending on the approach these factors could include the amount of DNA from the test sample added to the library preparation reaction and sequenced, the number of aliquots, the number of target regions and variants, the background error rate and the rate of high signal background events for each variant.

In some embodiments, a limit of detection is determined each time a sample is analysed. In some embodiment the amount of DNA from the sample added to the sequencing reaction is multiplied by the number of target regions in order to determine the number of DNA molecules assessed for variants. During analytical validation studies a range of samples with different numbers of molecules assessed for variants are tested in order to determine their limit of detection empirically. In some settings additionally the variants are separated into classes and the impact of each class is determine. When a samples is tested, its limit of detection is then estimated based on at least one of the number of variants, the amount of DNA added to each aliquot, the number of molecules assessed for variants or the class of variants assessed.

Utilizing cancer signatures

It is known in the art that a range of mutational processes drive somatic mutation formation in cancer genomes and that each of these generates a characteristic mutational signature (Alexandrov, Nature 2020 578: 94-101). Whilst some of these processes and therefore their signatures are common to many cancers others are specific to certain cancers. By sequencing a sufficiently large region of the genome such as the exome or whole genome, it is possible to detect these signatures in tumor DNA. In one embodiment of the present method, when tumor DNA from the patient is sequenced it may be analyzed in order to determine the signature(s) present. When the tumor is of unknown primary, the signature(s) may be used to infer the origin of the cancer. As example, an SBS7a signature (Alexandrov, supra) present within the tumor would be consistent with the primary tumor being a melanoma.

In another embodiment, the signature may be used to determine the likelihood that a variant identified in the tumor is a somatic change specific to the cancer rather than either artefact, germline, CHIP. In such an embodiment a plurality of potential tumor specific somatic variants are identified by sequencing tumor DNA. The type of tumor (e.g. melanoma) is identified as are the common signatures present in that tumor type (e.g. SBS7a which are mainly C>T at TCN). Variants that are consistent with the common signatures of the cancer type are included, prioritized or given a score indicating they are more likely to be real somatic changes when selecting, ranking or scoring variants for targeted sequencing, whilst variants that are not consistent with the mam signatures are either filtered out or given lower priority or score.

Method for assessing cƒDNA quality

The method wherein the test sample is cell free DNA and prior to sequencing cell free DNA from blood plasma, the cell free DNA is assessed to determine the quantity or proportion that is high molecular weight. Cell free DNA is typically short (~160bp). When blood samples are poorly handled or shipped, white blood cells may lyse and when they do they can release high molecular weight DNA which can mask the cfDNA. Therefore a high proportion of long DNA molecules can signify a poor sample with risk of false negative. The method wherein a ratio between the number of short DNA molecules and the number of long DNA molecules is determined and wherein short may be less than 50bp, 60bp, 70bp, 80bp, 90bp, 100bp, 110bp, 120bp, 130bp, 140bp, 150bp or 160bp and long is more than 320bp, 480bp, 1000bp or 2000bp. The method wherein if more than 1:10, 1:5, 1:4 , 1:3 or 1:2 of the DNA is long the sample is flagged for potentially containing high levels of long DNA molecules that may be a sign of white blood cell DNA released after blood collection.

The method wherein the ratio is measured using electrophoresis such as agarose gel analysis or commercial systems such as the fragment analyser or tapestation. The method wherein the ratio is measured using PCR based approaches. Examples include using digital PCR or qPCR with primers and probes targeting both long and short regions of the genome. Either one long and one short region could be targeted or the assay could be multiplexed with a range of different sizes or multiple markers of one size and multiple markers of another size. Advantages of such a method include the ability to compensate when some regions of the genome are impacted by copy number changes. Alternatively the assays could target repetitive sequences wherein a short region of a repetitive sequence is targeted and a long region of a repetitive sequence is targeted. An advantage of such an embodiment is that less of the test DNA is required in order to measure the ratio. In another embodiment, two or more pairs of primers which target short regions of the genome are used wherein the two regions are on the same chromosome but separated by great than 320bp, greater than 480bp, greater than 1000bp or greater than 2000bp. Replicate PCR reactions are performed on test DNA diluted such that there is typically less than a single copy of the genome per reaction in order to determine the number of times both regions amplify in the same reaction, the number of times just one or neither region amplifies in a reaction and the number of times neither region amplifies. The frequency of these three events can be used to estimate the number of long and short molecules. In another embodiment, next generation sequencing may be used. In one embodiment, a standard library is generated from the cfDNA by ligating on sequencer adaptor’s and optionally amplifying the DNA. In an alternative example, one or more primers that target one or more repetitive regions is used to amplify the cfDNA before sequencing. Sequencing reads are then aligned to the genome and the size of the molecules determined by identifying the start and end of each sequencing read. The ratio between short and long molecules can then be obtained by grouping the sequencing reads into groups based on the length of the sequencing read then determining a ratio. In such settings it may be important to use a correction factor as PCR and next generation sequencing methods both typically have a bias for shorter DNA molecules. Alternative methods that ligate adaptors on at least one side of the cfDNA molecules and PCR using one or more targeted primers and also primers targeting the adaptors followed by NGS can be used to obtain a measure of the cfDNA lengths. In some embodiments the test sample is cell free DNA and prior to generating a sequencing library, size selection is used to enrich for shorter cfDNA molecules and increase the fraction of ctDNA wherein this enrichment may be performed using beads or size selection on a gel and wherein short molecules are those that are less than 160bp or 150bp or 140bp in length.

Utility

If the DNA sample from the patient contains cancer DNA then the patient may have cancer associated cells resulting from minimal residual disease, early relapse or metastasis, for example. ctDNA is an especially powerful biomarker in this setting because it has a half- life of approximately 1 hour so if a tumor has been fully removed any remaining ctDNA should have been cleared rapidly.

In some cases, when testing for minimal residual disease using cell-free DNA taken from a patient after treatment, it may be valuable to first confirm if the tumor releases ctDNA at a sufficiently high level for accurate minimal residual disease detection. In one embodiment a cell free DNA sample is taken prior to treatment with curative intent and tested and any patient without detectable ctDNA prior to treatment or where the probability of the sample containing tumor DNA prior to treatment is below a certain threshold may be excluded from further analysis as they release too little ctDNA for accurate minimal residual disease detection. In an alternative embodiment patients may be excluded from further analysis if the pre treatment ctDNA is estimated to be below a threshold such as 0.01% VAF, 0.005% VAF or 0.001% VAF. In another embodiment, the level of ctDNA prior to treatment is correlated with tumor volume prior to treatment as assessed by imaging in order to give an estimate of the amount of ctDNA released by a set volume of tumor and thus a standardised measure of tumor ctDNA release. Patients may be excluded for whom this standardised measure is below a set threshold for example wherein a tumor of 1cm 3 would be predicted to release a level of ctDNA below the pre determined limit of detection of the assay. Alternatively changes in ctDNA level following treatment may be combined with this estimate to predict the tumor volume change and to determine if it is consistent with complete removal of the tumor or if it is equally constant with residual disease remaining.

The patient that provides the test sample may have cancer, may have been treated for cancer in the past (e.g., at least 2 weeks before, at least 3 months before, at least 6 months before, at least a year before), may be in complete remission and/or may have a clonal growth (e.g., a tumorous growth such as a nodule, polyp and cyst or lump) that has the potential to transform.

Likewise, the source of the cancer DNA in the sample may vary. For example, the cancer DNA may be the result of MRD, as a result of a clonal growth becoming malignant, tumor metastasis, incomplete tumor removal, or an ineffective treatment.

In some embodiments, the method may comprise providing a report indicating whether there is cancer DNA in the sample. In some embodiments, the report may contain the likelihood ratio, mixture model, score, or threshold number of variants and aliquot output described above or another number representing the same as well as a threshold to which the likelihood ratio or mixture model result can be compared to determine if the sample contains cancer DNA. In some embodiments, a report may additionally list approved (e.g., FDA approved) therapies for treatment of residual disease, e.g., chemotherapies or immunotherapies, etc. This information can help in diagnosing a disease (e.g., whether the patient has MRD) and/or the treatment decisions made by a physician.

In some embodiments, the report may be in an electronic form, and the method comprises forwarding the report to a remote location, e.g., to a doctor or other medical professional to help identify a suitable course of action, e.g., to diagnose a subject or to identify a suitable therapy for the subject. The report may be used along with other patients’s metrics to determine whether the subject is susceptible to a therapy, for example.

In any embodiment, a report can be forwarded to a “remote location”, where “remote location,” means a location other than the location at which the sequences are analyzed. For example, a remote location could be another location (e.g., office, lab, etc.) in the same city, another location in a different city, another location in a different state, another location in a different country, etc. As such, when one item is indicated as being "remote" from another, what is meant is that the two items can be in the same room but separated, or at least in different rooms or different buildings, and can be at least one mile, ten miles, or at least one hundred miles apart. "Communicating" information references transmitting the data representing that information as electrical signals over a suitable communication channel (e.g., a private or public network). "Forwarding" an item refers to any means of getting that item from one location to the next, whether by physically transporting that item or otherwise (where that is possible) and includes, at least in the case of data, physically transporting a medium carrying the data or communicating the data. Examples of communicating media include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the internet, including email transmissions and information recorded on websites and the like. In certain embodiments, the report may be analyzed by an MD or other qualified medical professional, and a report based on the results of the analysis of the sequences may be forwarded to the patient from which the sample was obtained.

In some embodiments, a sample may be collected from a patient at a first location, e.g., in a clinical setting such as in a hospital or at a doctor’s office, and the sample may be forwarded to a second location, e.g., a laboratory where it is processed and the above- described method is performed to generate a report. A “report” as described herein, is an electronic or tangible document which includes report elements that provide test results that may indicate the presence and/or quantity of cancer DNA in the sample. Once generated, the report may be forwarded to another location (which may be the same location as the first location), where it may be interpreted by a health professional (e.g., a clinician, a laboratory technician, or a physician such as an oncologist, surgeon, pathologist or virologist), as part of a clinical decision.

The patient analyzed in this method may have any type of cancer or may have previously undergone treatment for any type of cancer. For example, the patient may have or may have had melanoma, carcinoma, lymphoma, sarcoma or glioma. For example, the cancer may be melanoma, lung cancer (e.g., non-small cell lung cancer), breast cancer, head and neck cancer, bladder cancer, Merkel cell cancer, cervical cancer, hepatocellular cancer, gastric cancer, cutaneous squamous cell cancer, classic Hodgkin lymphoma, B-cell lymphoma, colorectal carcinoma, pancreatic carcinoma, gastric or breast carcinoma, among many others, including other solid tumors and blood cancers.

In some embodiments, the method may be used to guide treatment decisions. In some embodiments, the method may be used to determine if a patient should be treated again, e.g., with the same therapy or a second therapy. For example, if the patient has been previously been treated with a first cancer therapy and the patient has been identified as having MRD using the present method, then the patient may be treated with a second cancer therapy that is the same as or different to the first cancer therapy. For example, if the patient has previously been treated with surgery or an immune checkpoint inhibitor and the patient is identified as having MRD, then the patient may be treated with further surgery, the same or a different immune checkpoint inhibitor or another type of therapy, where immune checkpoint therapy includes administration of CTLA-4, PD1, PD-L1, TIM-3, VISTA, LAG-3, IDO or KIR checkpoint inhibitors, and the other types of therapy include, for example, (a) anthracycline therapy (e.g., by administering daunomycin, doxorubicin, or mitoxantrone), (b) alkylating agent therapy (e.g., by administering mechlorethane, cyclophosphamide, ifosfamide, melphalan, cisplatin, carboplatin, nitrosourea, dacarbazine and procarbazine or busulfan), (c) topoisomerase II inhibitor therapy (e.g., by administering etoposide or teniposide), (d) bleomycin therapy, (e) anti-metabolite therapy (e.g., by administering methotrexate, 5-fluorocil, cytarabine, 6-mercaptopurine or 6-thioguanine), (f) vinca alkyloid therapy (e.g., by administering vincrisene or vinblastine), (g) steroid therapy (e.g., by administering prednisone or dexamethasone and (h) radiation treatment, etc. Alternative therapies include targeted therapies and non-targeted chemotherapies, where targeted therapy includes treatment with erlotinib (Tarceva), afatinib (Gilotrif), gefitinib (Iressa) or osimertinib (Tagrisso) which may be administered to patients having an activating mutation in EGFR, crizotinib (Xalkori), ceritinib (Zykadia), alectinib (Alecensa) or brigatinib (Alunbrig) which may be administered to patients having an ALK fusion, crizotinib (Xalkori), entrectinib (RXDX-101), lorlatinib (PF-06463922), crizotinib (Xalkori), entrectinib (RXDX-101), lorlatinib (PF-06463922), ropotrectinib (TPX-0005), DS-6051b, ceritinib, ensartinib or cabozantinib which may be administered to patients having an ROS1 fusion, or dabrafenib (Tafinlar) or trametinib (Mekinist) which may be administered to patients having an activating mutation in BRAF. Many other actionable mutations are known. If the patient is going to be switched to a non-targeted chemotherapy, the therapy may be, for example, a platinum-based doublet chemotherapy (in which the platinum-based doublet chemotherapy may comprise a platinum-based agent selected from cisplatin (CDDP), carboplatin (CBDCA), and nedaplatin (CDGP)) and one third-generation agent (selected from docetaxel (DTX), paclitaxel (PTX), vinorelbine (VNR), gemcitabine (GEM), irinotecan (CPT-11), pemetrexed (PEM), and tegafur gimeracil oteracil (SI)). In some embodiments, the method may be used to monitor a treatment. For example, the method may comprise analyzing a sample obtained at a first timepoint using the method, and analyzing a sample obtained at a second time point by the method, and comparing the results, i.e., determining whether there is cancer DNA in the sample or determining if there is a change in the amount of cancer DNA or a range of likely amounts of cancer DNA between the first and second time points. In some embodiments, such a change may be determined using point estimates or confidence intervals and a significant decrease may indicate the therapy is effective whilst no significant decrease or an increase may indicate the therapy is not effective. The first and second timepoints may be before and after a treatment, or two timepoints after treatment. For example, by comparing results obtained from one timepoint to another, the method may be used to determine if the previously identified variations are no longer present in the subject during the course of a treatment. The time period between the first and second timepoints may be at least one month, at least 6 months or at least one year and in some cases a patient may be tested periodically, e.g., every three months, every six months or every year for several years, e.g., 5 years or more.

This method may also be used to determine if a subject is disease-free, or whether a disease is recurring. As noted above, the method may be used for the analysis of minimal residual disease and recurrence detection. In these embodiments, the primer pairs used in the method may be designed to amplify sequences that contain variations that have been previously identified in a patient’ s cancer through either sequencing cancer material, cfDNA at an earlier time point or sequencing another suitable sample.

In some embodiments when testing for minimal residual disease or recurrence detection, the test sample of DNA from a patient would be cell-free DNA. This cell-free DNA may be taken from a patient at any point after treatment. In some embodiments this cell free DNA may be taken at a point that any remaining ctDNA from a cancer would have been cleared if the cancer were successfully treated. This time point may depend on factors such as the initial amount of ctDNA and the treatment modalities. For methods where all tumor is removed at once such as surgery time points may be after 1 week, 2 weeks, 3 weeks or 4 weeks following treatment with curative intent. Where a treatment may more gradually remove the cancer these time points may be longer such as 1 month or 2 months. As would be apparent, other DNA extracted from alternative sources could also be assessed for the presence or quantity of cancer DNA. Examples include but are not limited to: the cellular fraction of cerebrospinal fluid, the cellular and cell-free fraction of cerebrospinal fluid, stool samples, cells present within urine, biopsy or fine needle aspirate materials. In some embodiments, the method may also be used to assess for the presence of remaining cancer cells within biopsy or fine needle aspirate materials such as from lymph nodes. As would be apparent such methods would be particularly powerful when the number of tumor cells in a biopsy sample may be at such a low level that it is not practical for histopathological analysis by a pathologist to review enough cells in the biopsy to identify the remaining cancer.

In some embodiments, the method may also be used to track a plurality of variants in parallel for example tracking predicted neoantigens-coding mutations following immunotherapy or personalized vaccine.

In some embodiments, the method may be employed in a clinical trial. For example, the method may be potentially used to identify specific group of patients for clinical enrollment or evaluate the efficacy of a new drug (e.g., a neoadjuvant therapy or adjuvant therapy that may be non-specific or targeted to a patient’s cancer, or any combination therapy). In some embodiments, the amount of ctDNA in a patient’s bloodstream could be estimated at multiple time points thereby allowing to alter the dose of a drug administered to a patient mid-trial, for example. In some embodiments, the amount of ctDNA in a patient’s bloodstream could be estimated at multiple time points during a clinical trial and used to determine if a particular therapy, level of treatment, duration of treatment or combination of treatment type and patient is working. As would be readily appreciated, many steps of the method, e.g., the sequence processing steps and the generation of a report indicating a presence of cancer DNA in a test sample of DNA may be implemented on a computer. As such, in some embodiments, the method may comprise executing an algorithm that calculates the likelihood of whether a patient has cancer DNA present in a test sample of DNA taken from a patient based on the analysis of the sequence reads, and outputting the likelihood. In some embodiments, this method may comprise inputting the sequences into a computer and executing an algorithm that can calculate the likelihood using the input measurements.

As would be apparent, the computational steps described may be computer- implemented and, as such, instructions for performing the steps may be set forth as programing that may be recorded in a suitable physical computer readable storage medium. The sequencing reads may be analyzed computationally.

EMBODIMENTS

Embodiment 1. A method for detecting tumor DNA in a test sample of DNA from a patient, comprising: (a) sequencing multiple aliquots of the test sample to produce, for each aliquot, sequence reads corresponding to two or more target regions that each have a sequence variation associated with the patient’s tumor; (b) for each aliquot, for each target region: deriving an estimate of the number of molecules that have the sequence variation, or calculating the probability that there is at least one molecule that has the sequence variation; and (c) determining if there is tumor DNA in the test sample using estimates or probabilities of step (b).

Embodiment 2. The method of embodiment 1, wherein for each aliquot, for each target region the number of molecules that have the sequence variation in the test sample or the probability that there is at least one molecule that has the sequence variation is estimated (b) using: (i) the number of sequence reads of (a) that have the sequence variation; (ii) the total number of sequence reads of (a); (iii) the number of molecules input into each aliquot of (a); and (iv) the estimated background error rate for the sequence variation.

Embodiment 3. The method of embodiment 2, wherein the estimated background error rate of (iv) is estimated from prior sequencing reactions.

Embodiment 4. The method of embodiment 3, wherein the estimated background error rate of (iv) is estimated from prior sequencing reactions, adjusted using data for control bases obtained in step (a).

Embodiment 5. The method of embodiment 2, wherein the estimated background error rate of (iv) is estimated by analysis of control sequencing reads produced in step (a). Embodiment 6. The method of embodiment 1, wherein the estimate is not a point estimate but a probability distribution over the number of variant molecules present.

Embodiment 7. The method of any prior embodiment, wherein (c) is done by calculating a likelihood ratio between the likelihood of observing the estimates in (b) in samples: (i) if ctDNA is present (ii) if ctDNA is not present.

Embodiment 8. The method of embodiment 7, wherein the likelihood of observing the estimates of (b) if there is tumor DNA in the test sample is calculated based on: (i) the estimates or probabilities of step (b); and optionally (ii) an estimate of the tumor fraction in the test sample.

Embodiment 9. The method of embodiment 7 or 8, wherein the likelihood of observing the estimates of (b) if there is no tumor DNA in the test sample is calculated based on: (i) the estimates or probabilities of step (b); and (ii) the estimated rate of high signal background events.

Embodiment 10. The method of any prior embodiment, wherein (c) is calculated by using a mixture model incorporating: (i) the estimates or probabilities of step (b); and (11) the estimated rate of high signal background events; and optionally (iii) an estimate of the tumor fraction in the test sample.

Embodiment 11. The method of embodiment 7 or 10, wherein step (c) further comprises comparing the output of the mixture model or the likelihood ratio to a threshold, wherein an output that is at or above the threshold indicates that the test sample contains tumor DNA.

Embodiment 12. The method of embodiment 11, further comprising identifying the patient as having tumor-associated cells if the result is at or above the threshold.

Embodiment 13. The method of embodiment 12, further comprising administering a therapy to the patient.

Embodiment 14. The method of embodiment 13, wherein the patient has previously undergone a first therapy and the method comprises administering a second therapy that is different to the first therapy to the patient.

Embodiment 15. The method of any prior embodiment, wherein the method further comprises determining the amount of tumor DNA or a range of likely amounts of tumor DNA in the test sample based on the estimates of step (b), such as by mean or median variant allele fraction, maximum likelihood or Bayesian posterior.

Embodiment 16. The method of embodiment 15, wherein the method is performed on samples that are obtained from the patient during at least a first time point and a second time point, wherein the first time point is prior to a treatment and the second time point is after the treatment, and the method comprises determining if there is a change in the amount of tumor DNA or a range of likely amounts of tumor DNA between the first and second time points.

Embodiment 17. The method of embodiment 16, wherein a change is determined using point estimates or confidence intervals, and wherein a significant decrease indicates the therapy is effective and no significant change or an increase indicates the therapy is not effective.

Embodiment 18. The method of embodiment 17, further comprising generating a report indicating whether the therapy is or is not effective.

Embodiment 19. The method of any prior embodiment, wherein estimates for sequence variations that are identified in a statistically improbable number of the aliquots based on the estimated tumor fraction are excluded from the results of step (b) prior to step (c). Embodiment 20. The method of any prior embodiment, wherein step (a) comprises sequencing at least three aliquots.

Embodiment 21. The method of any prior embodiment, wherein step (a) also comprises sequencing positive and or negative controls which may include at least one of: tumor DNA from biopsy or surgery sample huffy coat DNA buccal swab DNA whole blood DNA adjacent normal DNA reference DNA.

Embodiment 22. The method of embodiment 21, wherein variants that are not detected in the tumor DNA are excluded and wherein variant detected in the huffy coat, buccal swab, adjacent normal or whole blood are excluded.

Embodiment 23. The method of any prior embodiment, wherein the two or more target regions is at least 10 target regions.

Embodiment 24. The method of any prior embodiment, wherein the sequence variations of step (a) are independently single nucleotide variants, indels, transpositions or rearrangements.

Embodiment 25. The method of any prior embodiment, wherein the sequence variations are pre-identified sequence variations.

Embodiment 26. The method of any prior embodiment, wherein the sequence variations are identified by sequencing: (i) DNA isolated from a tissue biopsy that comprises tumor cells, (ii) DNA isolated from a tumor tissue obtained at surgery that comprises tumor cells or (iii) sequencing cell-free DNA or (iv) DNA isolated from circulating tumor cells.

Embodiment 27. The method of embodiment 26, wherein sequence variations are identified by sequencing the whole genome, the whole exome or a region of the genome selected due to commonly containing cancer mutations.

Embodiment 28. The method of embodiments 26-27, wherein a plurality of candidate sequence variations is first identified and the sequence variations are selected based on one or more of: clonality; mappability; estimated error rate; distance from another selected variant; predictive ability to sequence; presence within a region of copy number gain or amplification; and proximity of any germ line variants which may be used for enriching the mutant allele

Embodiment 29. The method of any prior embodiment, wherein the patient has or had cancer or has a clonal growth that is not yet cancer but has the potential to transform.

Embodiment 30. The method of any prior embodiment, wherein the patient has undergone or is undergoing treatment for the cancer.

Embodiment 31. The method of any prior embodiment, wherein the DNA is cell-free DNA.

Embodiment 32. The method of embodiment 31, wherein the cell-free DNA is isolated from blood plasma, blood serum, cerebrospinal fluid, urine, saliva, or stool.

Embodiment 33. The method of any prior embodiment, wherein the fraction of tumor DNA in the test sample of DNA is equal or less than 0.01%. Embodiment 34. The method of any prior embodiment, wherein the test sample comprises less than 25,000 genome equivalents of DNA.

Embodiment 35. The method of any prior embodiment, wherein the number of aliquots and the maximum number of molecules per aliquot is adjusted based on the total number of input molecules and the estimated background error rate such that the number of input molecules in a single aliquot is low enough that if a single variant molecule were present it would produce a signal significantly different to background.

Embodiment 36. The method of any prior embodiment, wherein for each aliquot of each sequence variation, the read depth of step (a) is at least 10,000.

Embodiment 37. The method of any prior embodiment, further comprising, measuring the amount of DNA in the test sample prior to step (a).

Embodiment 38. The method of any prior embodiment, wherein the sequences of the target regions are enriched from the test sample prior to step (a) by PCR or by hybridization to a nucleic acid probe.

EXAMPLES

The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how to make and use the present invention, and are not intended to limit the scope of what the inventors regard as their invention.

Fig. 15 shows why calling a sample as containing tumor DNA can be challenging, particularly for samples that have a low tumor fraction. As shown in the top panel, samples that have a high tumor fraction (TF), tumor DNA can be readily called because several positive signals are obtained in multiple aliquots. This eliminates most false positives. As shown in the bottom panel, samples that have a low tumor fraction are more difficult to call since the data may be accounted for by the background error rates. For example, if each positive variant has a 80% probability of corresponding to an actual sequence variation, the evidence shown for the low tumor fraction sample in Fig. 15 is insufficient to call the sample as containing tumor DNA. However, if the evidence is aggregated across multiple variants and aliquots there may be sufficient evidence to call a sample as containing tumor DNA. Fig. 11 shows how evidence can be combined across multiple variants. For dilute samples (« 0.1% tumor fraction), the fraction of mutant reads for individual variants in each sample is not expected to approximate the overall tumor fraction because of dropout effects. For example, many variants and aliquots will contain zero molecules. Instead, the effect of taking n/input reads per aliquot as a discrete distribution is modeled. In this example the tumor fraction is not measured directly. Rather, it is marginalized over all possible inputs, which provides an accurate estimate of the tumor fraction of the sample. Specifically, instead of guessing the number of variant molecules, the probabilities of all possible values are calculated based on: (i) the number of sequencing reads that have the sequence variation; (ii) the total number of sequencing reads; (iii) the number of molecules input into each aliquot; and (iv) the estimated background error rate for the sequence variation, and the value with the highest probability is identified. This avoids making assumptions. In Fig. 15, the variants are shown as present or absent for each aliquot. However, these are in fact probabilities which take into account many factors such as tumor fraction and per-base noise estimates. A ground truth line (Fig. 16) can be constructed. Fig. 14 shows that particularly noisy variations, i.e., variations that are identified in a statistically improbable number of the aliquots can be excluded from the analysis.

Fig. 17 shows the results of an experiment in which over 40 sequence variations in four aliquots of each of three different samples containing varying levels of circulating tumor DNA (ctDNA), were analyzed using the present method. The 52 ppm and 544 ppm samples are identified as having ctDNA, which illustrates the advantage of combining evidence across multiple aliquots and variants. In this figure, the color intensity correlates with the VAF (variant allele fraction), with the brightest color representing >=1%. Some variant names are greyed out in order to indicate their absence in the original tumor sample.

Example 1

In order to build an optimal assay for detecting residual disease, the cancer type of interest, in this instance, breast cancer was first selected. The mutational rate of the cancer was reviewed and identified to be over 0.5 mutation per Mb in approximately 90% of patients with the average patient having over 1 mutation per Mb (Martincorena and Campbell, Science 2015 349: 1483-9). In a pilot study of 22 early breast cancer patients it was identified that ctDNA is detected at a median of 0.06% VAF and down to 0.0007% VAF.

Studies diluting 3 cancer cell lines into normal DNA were performed using a personalized assay tracking 48 variants demonstrating that cancer DNA can be detected consistently at 0.001% VAF when analyzing 48 variants in combination but that the level of sensitivity halves each time the number of variants halve.

Based on the mutational rate of breast cancer, the observation that ctDNA is detected -50% of the time at below 0.06% and is detectable all the way down to 0.0007% VAF in the pilot study, a target of at least 90% of breast cancer samples having a limit of detection of at least 0.001% VAF was set. With a mutation rate of 0.5 mutations per Mb, a 96 Mb region of the genome was required for sequencing in breast cancer.

The main advantage of this approach include reproducibly achieving the levels of sensitivity needed for the cancer type of interest as in at least 90% of patients >48 variants are identified. Another advantage is that when a sample with a lower mutation rate is targeted, sequencing costs can be reduced.

Example 2

In order to design the optimal MRD assay, the system is designed to interrogate as many high quality variants as is possible. In order to do this a tumor biopsy is first obtained, it is macro-dissected targeting 50% tumor content, exome capture is performed then the sample is sequenced using an Illumina sequencer. All potential variants are identified using standard Illumina pipelines then given a combined score based on 1) the likelihood of being real, 2) the likelihood of being somatic, 3) the background error rate for the variant, 4) the high signal background error rate, 5) the probability of being clonal, 6) the level of amplification or copy number gain of the variant. The genome is divided into 50bp windows and these windows overlap by 25bp. Each window is given a combined score that includes 1) the scores of all variants present within the window, 2) a score for the ability to uniquely align the region (where penalty is given for regions that cant be uniquely aligned and the penalty is higher, the greater the number of mis alignments), 3) a score for the ability to amplify and sequence the region (where penalty is given to features know to challenge sequencing including repeats). The regions are then sorted by score and the top 100 are selected for designing PCR primers to. Where 2 regions that overlap are in the top 100 list, the region with the highest score is maintained and the region with the weaker score is discarded. The 101 st region is then added to the list and so on. A multiplex PCR is designed for the top 48 variants. Insilico PCR is performed using all primer pairs. When primer combinations are identified producing >2 non specific regions, the primer for the lowest scoring region which is causing this non specific product is discarded and alternative primers designed. If non overcome the non specific PCR problem, the region is discarded and the next region is added to the primer design.

One challenge with this tumor informed method of detecting cancer DNA in a test sample is the number of regions that can be robustly and cost effectively targeted, This strategy of ranking regions could maximize the number of variants that are successfully interrogated in the test DNA sample. When the variants are in cis (next to each other on the same chromosome) they can be read together and this increases the ability to separate signal from noise. When the variants are in trans, but still readable with the same primer pairs (or other targeting reagents like baits) the amount of information from the single targeted region should be doubled. The approach should also limit the number of reads wasted on non- specific products.

Example 3

In order to detect cancer DNA in test samples with high sensitivity, it is advantageous targeting multiple variants. For some cancer types it is sufficient to target just one type of variant. Sometimes though it is better to target multiple types of variants. In this example, it is identified that for certain breast cancer patients, a large number of structural variants are present, whilst in other patients there are more SNVs and indels. A large panel is designed to sequence breast cancer tumor DNA assessing for SNVs, indels and rearrangements. The optimal variant containing regions are identified. Primers are designed to target these regions. Where the regions contains 1 or more SNVs/indels, the primers are designed to flank all the SNVs and indels. Where the ’’region” is identified to contain a rearrangement, two different parts of the same chromosome or two different chromosomes will have been brought together. The rearrangement sequence is used for primer design and one primer is 3’ of the rearrangement and one is 5’. In instances where an SNV, indel or other variant (e.g. DBSs) is in cis with the rearrangement, the primers are designed to flank both the rearrangement and other variant(s) using the rearranged sequence obtained from the tumor. An advantage of this approach is the ability to consistently obtain a large number of variants for assessment of cancer DNA in a test sample,

Example 4

In order to determine both the background error rate and the rate of high signal background events, 50 different panels, each with 48 amplicons are designed. Each of the panels is designed against the exome of a patient that has either lung, CRC or breast cancer. Each amplicon m the panel is on average -100 bp long and within this there is on average ~60bp of sequence that is readable from the test DNA (i.e. non primer sequence). Blood is obtained from 200 healthy donors. Each donors blood is drawn into a Streck cell free DNA blood collection tube. The blood is spun to plasma, cell free DNA is extracted then the DNA is quantified by digital PCR. Each panel is tested with the cfDNA from 4 donors. A multiplex PCR with multiple aliquots (3) is setup using the panel and cfDNA. This PCR is barcoded. The barcoded products from patients is pooled together. These are run on an Illumina NovaSeq sequencer. The variants types to be assessed for are agreed as SNVs and indels. These variants are split into the following classes: Type of SNV (e.g. C>A, T>A or G>A), type and size of indel (e.g. Ibp, 2bp, 3bp del etc). The results from the donors are split into 3 groups (low DNA input, medium DNA input and high DNA input) based on digital PCR quantification of the cfDNA. Excluding primer sequences, a buffer of 3bp and all location wherein a potential germline variant has been reported in gnomAD, for the remaining bases at each location the total number of reads, the number of each non reference base and the count of each different type/size of indel are obtained. For each change (e.g. C>A) a beta distribution is fitted to the data. Both the mean and CV are obtained. Using a cumulative distribution function (CDF) for the particular base change a threshold of 0.9999 is used to determine an allele fraction cutoff at which the sample must be to be considered positive. This is the background error rate. To determine the rate of high signal background events, for each change (e.g. C>A), all instances of the change in the test panels are assessed and the rate of detecting a signal above the CDF determined allele fraction threshold is calculated.

Example 5

A panel is designed for the tumor of a breast cancer patient by obtaining a biopsy sample and sequencing 96Mb of the tumor’s genome, then selecting primers to amplify 48 regions wherein in total, the 48 regions include 50 variants (SNVs and indels) believed to be somatic and specific to the tumor. The patient specific primers are multiplexed and a multiplex PCR is setup using the tumor DNA. The PCR products are barcoded then sequenced on an Illumina sequencer. The variants not detected in the tumor DNA are bioinformatically filtered. The same panel is applied to the huffy coat DNA from the patient. A library is generated and sequenced. All variants identified at over 40% VAF are flagged as germline and filtered. All variants identified over the allele fraction cutoff as determined by the variant type and background error rate but below 40% are flagged as likely clonal hematopoiesis of indeterminate potential and filtered. If greater than 12 variants remain following the filtering, the panel is applied to the cfDNA extracted from the patient (if fewer remain, a panel redesign is attempted). CfDNA is split into 3 aliquots and a multiplex PCR performed using the patient specific primers on all 3 aliquots. The PCR products are barcoded, bead cleanup is performed then samples are pooled and sequenced. At the completion of sequencing, the reads are demultiplexed, trimmed, filtered based on quality and aligned to the reference genome. At each target region, for all variants in each target region, the number of wild type reads and the total number of reads are counted.

Example 6

Following the completion of sequencing of 3 aliquots of cfDNA from a breast cancer patient the total number of mutant and total reads for all aliquots of all variants excluding those filtered variants are obtained. The Variant allele fraction (mutant/total reads) is determined then this variant allele fraction is compared to the threshold generated using the background error rate. All aliquots for all variants are assessed to determine if they are positive or negative (above the threshold). The tumor fraction is estimated by first correcting all VAFs using the background error rate then averaging across all aliquots of all variants. The number of DNA molecules added to each library preparation is compared with the average VAF to determine how likely it is we would expect at least one mutant molecule in each aliquot of each variant. Each variant is then assessed to determine if there are more positive aliquots than would be expected by chance and those that are determined to have an improbable number of positive aliquots (P <0.05) are filtered. A score of 1 is then given to any variants who have no high signal background events (e.g. typically indels). For the remaining variants, they are separated into those with a high rate of “high signal background events” (the top 50%) and those with a low rate of “high signal background events” (all those that are in the bottom 50% excluding those that have no “high signal background events”. All variants with a low rate contribute a score of 0.75 and those with a high rate contribute a score of 0.5. If the test DNA sample is determined to have a total score of equal or greater than 2 and if at least 2 aliquots have a score of 0.5 or greater the test sample is deemed to have cancer DNA. There are a number of advantages of such an approach. In some approached one could simply determine if enough variants are above a threshold (e.g. 2 variants above a threshold). This is limited as some variants commonly produce high signal background events whilst others never do. This approach therefore enables confident calling with high specificity when just 2 variants are detected when these variants never produce high signal background events. When the variants identified are more prone to high signal background events the scoring approach is therefore more cautious and between 3 and 4 variants are needed in order to make a call enabling the assay to maintain high specificity. By requiring a score in more than one aliquot the assay prevents false positives due to contamination of a single aliquot whilst filtering out variants that are either present in huffy coat or present in more aliquots than is likely based on the estimated tumor fractions, common sources of false positives including CHIP and error prone bases are eradicated.

Example 6

Following the completion of sequencing of 3 aliquots of cfDNA from a breast cancer patient the total number of mutant and total reads for all aliquots of all variants excluding those filtered variants are obtained. The Variant allele fraction (mutant/total reads) is determined then this variant allele fraction is compared to the threshold generated using the background error rate. All aliquots for all variants are assessed to determine if they are positive or negative (above the threshold). The tumor fraction is estimated by first correcting all VAFs using the background error rate then averaging across all aliquots of all variants. The number of DNA molecules added to each library preparation is compared with the average VAF to determine how likely it is we would expect at least one mutant molecule in each aliquot of each variant. Each variant is then assessed to determine if there are more positive aliquots than would be expected by chance and those that are determined to have an improbable number of positive aliquots (P <0.05) are filtered. A calling threshold for the number of variants is then determined by obtaining the estimated rate of high signal background events for all remaining unfiltered variants then calculating a distribution of the likely number of high signal background events across all remaining aliquots and variants. A threshold number of positive variants is then obtained wherein there is less than 0.01% change of obtaining the number of positive events purely through high signal background events. The sample is then called positive if the total number of positive variants (variants above VAF threshold) is above this threshold number of positive variants and if at least 2 aliquots have a positive variant. There are a number of advantages of such an approach. In some approached one could simply determine if enough variants are above a threshold (e.g. 2 variants above a threshold). This is limited as some variants commonly produce high signal background events whilst others never do. This approach therefore enables confident calling by estimating how commonly high signal background events would be present and with what distribution. A personalized threshold is then set depending on how noisy the variants are and how many variants there are. This enables very high sensitivity but also balances this with specificity (for example when a large number of variants with common high signal background events are tested the threshold is higher than when a small number of variants that rarely have high signal background events is tested). By requiring a positive in more than one aliquot the assay prevents false positives due to contamination of a single aliquot whilst filtering out variants that are either present in huffy coat or present in more aliquots than is likely based on the estimated tumor fractions, common sources of false positives including CHIP and error prone bases are eradicated. Example 7

FFPE tumor material is obtained. The tissue is sectioned and total RNA is extracted from 10 slides. Ribosomal RNA depletion, reverse transcription and sequencing library preparation is performed. The sequencing library is barcoded then multiplexed with other libraries from patients. Sequencing on an Illumina NovaSeq platform is performed. The reads are demultiplexed, aligned then the variants called. The variants include SNVs, indels and gene fusions. These variants are then mapped from their RNA transcripts to the correct genomic DNA coordinates for primer design.