Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETECTING CROSS-CONTAMINATION IN CELL-FREE RNA
Document Type and Number:
WIPO Patent Application WO/2023/147509
Kind Code:
A1
Abstract:
The present disclosure relates to an improved method for analyzing sequencing data to detect cross-sample contamination in a test sample. Determining cross-contamination in a test sample can be informative for determining that the test sample will be less likely to correctly identify the presence of cancer in the subject. Pre-determined single nucleotide polymorphisms selected from: an allele present in a select database or a genotyping SNP associated with a sample type are used to identify. A sample is determined to be contaminated using the determined contamination probabilities of the one or more pre- determined SNPs.

Inventors:
MAUNTZ RUTH (US)
BAGARIA SIDDHARTHA (US)
BURKHARDT DAVID (US)
LARSON MATTHEW (US)
PORTELA DOS SANTOS PIMENTEL MONICA (US)
Application Number:
PCT/US2023/061502
Publication Date:
August 03, 2023
Filing Date:
January 27, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GRAIL LLC (US)
International Classes:
C12Q1/6809; C12Q1/6869; G16B30/00
Foreign References:
US20180237838A12018-08-23
US20220101135A12022-03-31
Attorney, Agent or Firm:
JACOBSON, Anthony, T. et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for identifying contamination in a sample, comprising:

(a) obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA);

(b) identifying sequencing reads that comprise one or more pre-determined single nucleotide polymorphisms (SNPs), thereby determining an observed allele frequency for each pre-determined SNP in the plurality of sequencing reads, wherein each of the one or more pre-determined SNPs are selected from: an allele present in one or more selected databases; or a genotyping SNP associated with a sample type; and

(c) determining whether the sample is contaminated using a determined contamination probability of the one or more pre-determined SNPs.

2. The method of claim 1, wherein the identified sequencing reads that comprise the one or more pre-determined SNPs comprise a sequencing depth of at least 10 reads per million mapped reads (RPM).

3. The method of claim 1 or 2, wherein the identified sequencing read comprising the one or more pre-determined SNPs each comprise an exonic sequence.

4. The method of claim 3, wherein the exonic sequence comprises an exon-exon junction.

5. The method of any one of claims 1-4, wherein the allele present in one or more select databases comprises an allele present in a universal human reference database.

6. The method of claim 5, wherein the one or more pre-determined SNPs are selected from Table 1.

7. The method of any one of claims 1-6, wherein the allele present in the one or more select databases comprises an allele present in a NCBI dbSNP database (Build 155) that has a reference allele frequency in a range between 0.2 and 0.7.

8. The method of claim 7, wherein the one or more pre-determined SNPs are selected from Table 2. The method of claim 8, wherein the one or more pre-determined SNPs does not include a conversion type comprising: A>G; T>C; OT; or G>A. The method of any one of claims 1-9, wherein the one or more pre-determined SNPs are selected from Table 3. The method of any one of claim 1-10, further comprising determining a contamination probability for each pre-determined SNP using its observed allele frequency. The method of any one of claims 1-11, further comprising identifying two or more predetermined SNPs in the sequencing reads, thereby determining an observed allele frequency for each of the two or more pre-determined SNPs in the plurality of sequencing reads. The method of claim 12, wherein the two or more pre-determined SNPs are selected from Table 1, Table 2, Table 3, or any combination thereof. The method of any one of claims 1-13, wherein the allele present in a Universal Human Reference (UHR) comprises an allele having a homozygous frequency of at least 75% in the UHR and a homozygous frequency of 5% or less in a human sample. The method of any one of claims 1-14, wherein the reference allele frequency is in a range between 0.3 and 0.7. The method of any one of claims 1-15, wherein the reference allele frequency comprises a MAF, a VAF, a sequencing depth, or any combination thereof. The method of claim 16, wherein the reference allele frequency comprises a MAF, wherein the MAF is in a range between 0.3 and 0.7. The method of claim 1, further comprising filtering the sequences by removing sequencing reads comprising SNPs including no-calls prior to determining a contamination probability. The method of claim 18, wherein filtering further comprises removing sequences having a SNP with a A>G; G>A; T>C; or OT conversion. The method of any one of claims 1-19, wherein the observed allelic frequency comprises: a minor allele frequency (MAF), a variable allele frequency, a sequencing depth, a noise rate, or any combination thereof. The method of any one of claims 1-20, wherein the observed allelic frequency comprises a MAF indicating contamination. The method of claim 21, wherein the MAF is 0.5 or greater. The method of any one of claims 1-22, further comprising discarding the sample following a determination that the sample is contaminated. The method of any one of claims 1-22, further comprising assessing a risk introduced by contamination and using the risk in determining whether the sample is discarded. The method of claim 24, wherein the risk introduced by the contamination is determined in part by determining a likely source of contamination. The method of claim 25, wherein determining the contamination source lowers the risk introduced by the contamination, and wherein not determining the contamination source increases the risk introduced by the contamination. The method of any one of claims 1-26, further comprising applying a contamination model to the sequencing reads identified as having one or more pre-determined SNPs and an observed allele frequency in the plurality of sequencing reads. The method of any one of claims 1-27, wherein the contamination model comprises at least one likelihood test. The method of claim 28, wherein one or more likelihood tests are applied to a sequencing read of the plurality of sequencing reads using the associated contamination probability, wherein each test to obtain a current contamination probability is indicative of whether the sequencing reads are contaminated. The method of claim 28 or 29, further comprising: determining that the sequencing reads are contaminated based on the current contamination probability of the at least one test being above a threshold associated with the at least one test likelihood test. The method of any one of claims 28-30, further comprising: determining that the sequencing reads are contaminated based on the current contamination probability of at least two likelihood tests being above a threshold associated with the at least two likelihood tests. The method of any one of claims 28-31, wherein the at least one likelihood test maximizes a likelihood function, the likelihood function proportional to the probability of an event occurring in a data set given a variable. The method of any of claims 28-32, wherein applying the at least one likelihood test of the contamination model comprises: comparing a set of generated contaminated sequencing reads to a set of previously obtained non-contaminated sequencing reads to determine the contamination probability. The method of any one of claims 28-33, wherein applying at least one likelihood test of the contamination model comprises: generating a null hypothesis representing that the sequencing reads are not contaminated; generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; and applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, wherein the likelihood ratio test obtains the current contamination probability. The method of any one of claims 28-34, wherein applying the at least one likelihood test of the contamination model comprises: comparing a set of generated contaminated sequencing reads to an average of previously obtained sequencing reads to determine the contamination probability, wherein

I l l the contamination probability is associated with the likelihood that the sequencing reads are contaminated at a contamination level. The method of any one of claims 28-35, wherein applying at least one likelihood test of the contamination model comprises: generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; generating a null hypothesis representing the mean minor allele frequency at a contamination level for a plurality of previously obtained sequencing reads, wherein the contamination level is associated with the contamination hypothesis most likely to be contaminated; and applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, wherein the likelihood ratio test obtains the current contamination probability. The method of any one of claims 1-27, wherein the contamination model comprises generating a noise model. The method of claim 37, wherein the noise model represents a measure of background noise in a subset of sequencing reads, and wherein the noise model is generated based on the subset of the sequencing reads. The method of claims 37 or 38, further comprising applying the contamination model to an identified sequencing read using the observed allele frequency of the one or more predetermined SNPs in the identified sequencing reads and the generated noise model to obtain a confidence score representing a measure of the predicted contamination in the sequencing reads. The method of any one of claims 37-39, wherein the background noise is a population measure of allele frequency in the subset of sequencing reads. The method of claim 40, wherein the background noise is representative of the static noise generated when sequencing a SNP.

42. The method of any of claims 38-41, wherein the subset of sequencing reads comprises SNPs from uncontaminated and healthy test samples.

43. The method of any of claims 37-42, wherein generating the noise model further comprises: determining a noise coefficient for each SNP of the subset of sequencing reads, wherein the noise coefficient predicts the expected noise level for each SNP.

44. The method of any of claims 37-43, wherein the noise model generated based on the subset of sequencing reads is additionally based on a sample type of the sequencing reads.

45. The method of any of claims 37-44, wherein when the confidence score is above a threshold the contamination model predicts that the sequencing reads are contaminated.

46. The method of any of claims 37-45, wherein the contamination model additionally includes a random error term.

47. A system for determining contamination in a sample, comprising:

(a) a computer processor; and

(b) a non-transitory computer-readable storage medium storing instructions that, when executed by the computer processor, cause the computer processor to perform steps of any of the methods of claims 1-46.

48. A method of predicting presence of a disease in a sample, comprising:

(a) obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA);

(b) identifying contamination in a sample using any of the methods of claims 1-46; and

(c) identifying SNPs from the plurality of sequencing reads that are informative for the presence of a disease.

49. The method of claim 48, further comprising assessing the risk introduced by contamination identified in step (b).

50. The method of claim 49, wherein the risk introduced by the contamination is determined in part by determining a likely source of contamination. The method of claim 50, wherein determining the contamination source lowers the risk introduced by the contamination, and wherein not determining the contamination source increases the risk introduced by the contamination. The method of any one of claims 48-51, wherein a contaminated sample is discarded based in part on the presence of contamination, the risk introduced by the contamination, or both. The method of claim 48, wherein the disease is cancer.

Description:
DETECTING CROSS-CONTAMINATION IN CELL-FREE RNA

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Application Ser. No. 63/304,503 filed on January 28, 2022, which is hereby incorporated in its entirety by reference.

BACKGROUND

1. FIELD OF ART

[0002] This application relates generally to detecting contamination in a sample, and more specifically to detecting contamination in a sample including targeted sequencing used for early detection of cancer.

2. DESCRIPTION OF THE RELATED ART

[0003] Next generation sequencing-based assays of circulating tumor DNA must achieve high sensitivity and specificity in order to detect cancer early. Early cancer detection and liquid biopsy both require highly sensitive methods to detect low tumor burden as well as specific methods to reduce false positive calls. Contaminating DNA from adjacent samples can compromise specificity which can result in false positive calls. In various instances, compromised specificity can be because rare SNPs from the contaminant may look like low level mutations. Methods currently exist for detecting and estimating contamination in whole genome sequencing data, typically from relatively low-depth sequencing studies. However, existing methods are not designed for detection of contamination in sequencing data from cancer detection samples, which typically require high-depth sequencing studies and include tumor-derived mutations (e.g., single base mutations and/or copy number variations (CNVs)) that may be present at varying frequencies (e.g., clonal and/or sub-clonal tumor-derived mutations). There is a need for new methods of detecting cross-sample contamination in sequencing data from a test sample used for cancer detection.

SUMMMARY

[0004] Embodiments described herein relate to methods of analyzing sequencing data to detect cross-sample contamination in a test sample. Determining cross-contamination in a test sample can be informative for determining that the test sample will be less likely to correctly identify the presence of cancer in the subject. In one example, cross-contamination is determined in a nucleic acid sample obtained from a human subject and used for the early detection of cancer.

[0005] In various embodiments, samples (e.g., test samples) are obtained from subjects and prepared using genome sequencing techniques to generate sequencing reads representing a plurality of nucleic acid fragments from the sample, including cell-free RNA. The sequencing reads include a number of sequencing reads having one or more pre-determined SNPs that can be used to identify contamination in the sample. Identifying a sequencing read as having one or more pre-determined SNPs modifies the data set of the sequencing reads such that it can be more easily analyzed to determine contamination. In addition, predetermining a SNP enables identification of types of contamination, while also increasing the confidence with which contamination can be identified and lowering the limit of detection. Sequencing reads having one or more of the pre-determined SNPs are identified and an observed allele frequency is determined. Contamination probabilities can be based on the observed allelic frequency for each of the one or more pre-determined SNPS within the sample. Determining whether the sample is contaminated relies, at least in part, on the contamination probabilities of the one or more pre-determined SNPs.

[0006] In some embodiments, to determine contamination, the system can apply a contamination model including at least one likelihood test to a sequencing read of the plurality of sequencing reads. Here, the likelihood test obtains a current contamination probability representing the likelihood that the sample (e.g., the plurality of sequencing reads) is contaminated.

[0007] In some embodiments, to determine contamination, the system can apply a contamination model including generating a noise model. Generally, SNPs of the sample (e.g., test sample) at a given site are expected to have a variant allele frequency that can be modeled as a function of the minor allele frequency for SNPs at that site in a population, a contamination level, and a noise level. In some cases, the model can include a probability function based on the minor allele frequencies. Therefore, when analyzing the test sample obtained from a subject, variations from the expected variant allele frequency can be determined utilizing regression modeling. Specifically, regression modeling can be used to determine a contamination level and its statistical significance based on the relationship between the variant allele frequency and the minor allele frequency for a given site. If the determined contamination level of the test sample is above a threshold contamination level and the determined contamination level is statistically significant, a contamination event can be called. Calling a contamination event can indicate that at least some sequences included in the test sample originate from a different subject.

[0008] In one aspect, this disclosure features a method for identifying contamination in a sample, comprising: obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA); identifying sequencing reads that comprise one or more pre-determined single nucleotide polymorphisms (SNPs), thereby determining an observed allele frequency for each pre-determined SNP in the plurality of sequencing reads, wherein each of the one or more pre-determined SNPs are selected from: an allele present in one or more selected databases; or a genotyping SNP associated with a sample type; and determining whether the sample is contaminated using a determined contamination probability of the one or more pre-determined SNPs.

[0009] In some embodiments, wherein the identified sequencing reads that comprise the one or more pre-determined SNPs comprise a sequencing depth of at least 10 reads per million mapped reads (RPM).

[0010] In some embodiments, the identified sequencing read comprising the one or more pre-determined SNPs each comprise an exonic sequence.

[0011] In some embodiments, the exonic sequence comprises an exon-exon junction.

[0012] In some embodiments, the allele present in one or more select databases comprises an allele present in a universal human reference database.

[0013] In some embodiments, the one or more pre-determined SNPs are selected from Table 1.

[0014] In some embodiments, the allele present in the one or more select databases comprises an allele present in a NCBI dbSNP database (Build 155) that has a reference allele frequency in a range between 0.2 and 0.7.

[0015] In some embodiments, the one or more pre-determined SNPs are selected from Table 2.

[0016] In some embodiments, the one or more pre-determined SNPs does not include a conversion type comprising: A>G; T>C; OT; or G>A.

[0017] In some embodiments, the one or more pre-determined SNPs are selected from Table 3.

[0018] In some embodiments, the method further comprising determining a contamination probability for each pre-determined SNP using its observed allele frequency. [0019] In some embodiments, the method further comprising identifying two or more pre-determined SNPs in the sequencing reads, thereby determining an observed allele frequency for each of the two or more pre-determined SNPs in the plurality of sequencing reads.

[0020] In some embodiments, the two or more pre-determined SNPs are selected from Table 1, Table 2, Table 3, or any combination thereof.

[0021] In some embodiments, the allele present in a Universal Human Reference (UHR) comprises an allele having a homozygous frequency of at least 75% in the UHR and a homozygous frequency of 5% or less in a human sample.

[0022] In some embodiments, the reference allele frequency is in a range between 0.3 and 0.7.

[0023] In some embodiments, the reference allele frequency comprises a MAF, a VAF, a sequencing depth, or any combination thereof.

[0024] In some embodiments, the reference allele frequency comprises a MAF, wherein the MAF is in a range between 0.3 and 0.7.

[0025] In some embodiments, the method further comprising filtering the sequences by removing sequencing reads comprising SNPs including no-calls prior to determining a contamination probability.

[0026] In some embodiments, filtering further comprises removing sequences having a SNP with a A>G; G>A; T>C; or OT conversion.

[0027] In some embodiments, the observed allelic frequency comprises: a minor allele frequency (MAF), a variable allele frequency, a sequencing depth, a noise rate, or any combination thereof.

[0028] In some embodiments, the observed allelic frequency comprises a MAF indicating contamination.

[0029] In some embodiments, the MAF is 0.5 or greater.

[0030] In some embodiments, the method further comprising discarding the sample following a determination that the sample is contaminated.

[0031] In some embodiments, the method further comprising assessing a risk introduced by contamination and using the risk in determining whether the sample is discarded.

[0032] In some embodiments, the risk introduced by the contamination is determined in part by determining a likely source of contamination. [0033] In some embodiments, determining the contamination source lowers the risk introduced by the contamination, and wherein not determining the contamination source increases the risk introduced by the contamination.

[0034] In some embodiments, the method further comprising applying a contamination model to the sequencing reads identified as having one or more pre-determined SNPs and an observed allele frequency in the plurality of sequencing reads.

[0035] In some embodiments, the contamination model comprises at least one likelihood test.

[0036] In some embodiments, one or more likelihood tests are applied to a sequencing read of the plurality of sequencing reads using the associated contamination probability, wherein each test to obtain a current contamination probability is indicative of whether the sequencing reads are contaminated.

[0037] In some embodiments, the method further comprising:

[0038] determining that the sequencing reads are contaminated based on the current contamination probability of the at least one test being above a threshold associated with the at least one test likelihood test.

[0039] In some embodiments, the method further comprising:

[0040] determining that the sequencing reads are contaminated based on the current contamination probability of at least two likelihood tests being above a threshold associated with the at least two likelihood tests.

[0041] In some embodiments, the at least one likelihood test maximizes a likelihood function, the likelihood function proportional to the probability of an event occurring in a data set given a variable.

[0042] In some embodiments, applying the at least one likelihood test of the contamination model comprises:

[0043] comparing a set of generated contaminated sequencing reads to a set of previously obtained non-contaminated sequencing reads to determine the contamination probability.

[0044] In some embodiments, applying at least one likelihood test of the contamination model comprises: generating a null hypothesis representing that the sequencing reads are not contaminated; generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; and applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, wherein the likelihood ratio test obtains the current contamination probability.

[0045] In some embodiments, applying the at least one likelihood test of the contamination model comprises: comparing a set of generated contaminated sequencing reads to an average of previously obtained sequencing reads to determine the contamination probability, wherein the contamination probability is associated with the likelihood that the sequencing reads are contaminated at a contamination level.

[0046] In some embodiments, applying at least one likelihood test of the contamination model comprises: generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; generating a null hypothesis representing the mean minor allele frequency at a contamination level for a plurality of previously obtained sequencing reads, wherein the contamination level is associated with the contamination hypothesis most likely to be contaminated; and applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, wherein the likelihood ratio test obtains the current contamination probability.

[0047] In some embodiments, the contamination model comprises generating a noise model.

[0048] In some embodiments, the noise model represents a measure of background noise in a subset of sequencing reads, and wherein the noise model is generated based on the subset of the sequencing reads.

[0049] In some embodiments, the method further comprising applying the contamination model to an identified sequencing read using the observed allele frequency of the one or more pre-determined SNPs in the identified sequencing reads and the generated noise model to obtain a confidence score representing a measure of the predicted contamination in the sequencing reads.

[0050] In some embodiments, the background noise is a population measure of allele frequency in the subset of sequencing reads.

[0051] In some embodiments, the background noise is representative of the static noise generated when sequencing a SNP.

[0052] In some embodiments, the subset of sequencing reads comprises SNPs from uncontaminated and healthy test samples. [0053] In some embodiments, generating the noise model further comprises: determining a noise coefficient for each SNP of the subset of sequencing reads, wherein the noise coefficient predicts the expected noise level for each SNP.

[0054] In some embodiments, the noise model generated based on the subset of sequencing reads is additionally based on a sample type of the sequencing reads.

[0055] In some embodiments, when the confidence score is above a threshold the contamination model predicts that the sequencing reads are contaminated.

[0056] In some embodiments, the contamination model additionally includes a random error term.

[0057] In another aspect, this disclosure features a system for determining contamination in a sample, comprising: (a) a computer processor; and (b) a non-transitory computer- readable storage medium storing instructions that, when executed by the computer processor, cause the computer processor to perform steps of any of the methods described herein.

[0058] In another aspect, this disclosure features a method of predicting presence of a disease in a sample, comprising: obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA); identifying contamination in a sample using any of the methods of described herein; and identifying SNPs from the plurality of sequencing reads that are informative for the presence of a disease.

[0059] In some embodiments, the method further comprising assessing the risk introduced by contamination identified in step (b).

[0060] In some embodiments, the risk introduced by the contamination is determined in part by determining a likely source of contamination.

[0061] In some embodiments, determining the contamination source lowers the risk introduced by the contamination, and wherein not determining the contamination source increases the risk introduced by the contamination.

[0062] In some embodiments, a contaminated sample is discarded based in part on the presence of contamination, the risk introduced by the contamination, or both.

[0063] In some embodiments, the disease is cancer.

BRIEF DESCRIPTION OF DRAWINGS

[0064] FIG. l is a flowchart of a method for preparing a nucleic acid sample for sequencing, according to one example embodiment. [0065] FIG. 2 is a block diagram of a processing system for processing sequence reads, according to one example embodiment.

[0066] FIG. 3 is a flowchart of a method for determining variants of sequence reads, according to one example embodiment.

[0067] FIG. 4 shows an error plot with mean error rate (y-axis) plotted against mean sequencing depth (x-axis), according to one example embodiment.

[0068] FIGs. 5A-5B show histograms for error rate (y-axis) for each of the different conversion types (x-axis), according to one example embodiment. FIG. 5A shows error rate (y-axis) for each of the different conversion types (x-axis) when analyzing SNPs from whole transcriptome data. FIG. 5B shows error rate (y-axis) for each of the different conversion types (x-axis) when analyzing SNPs from targeted panels. Error rate = alt counts / depth for each error mode in a sample.

[0069] FIG. 6 illustrates a flow diagram of a workflow for detecting contamination in a plurality of sequencing reads using contamination probabilities for one or more predetermined SNPs, according to one example embodiment.

[0070] FIG. 7. illustrates a flow diagram of a workflow for detecting contamination in a plurality of sequencing reads using likelihood tests based on prior probabilities of contamination for one or more pre-determined SNPs, according to one example embodiment. [0071] FIG. 8A illustrates a limit of detection workflow, according to one example embodiment.

[0072] FIG. 8B shows the limit of detection for the workflow of FIG. 8 A.

[0073] FIG. 9A is a plot showing the analytical validation for limit of detection for cfRNA contamination, according to one example embodiment.

[0074] FIG. 9B shows the limit of detection for the workflow FIG. 8A.

[0075] FIG. 10A is a plot showing the analytical validation for limit of detection of UHR contamination, according to one example embodiment.

[0076] FIG. 10B shows the limit of detection for workflow FIG. 8 A.

[0077] FIG. 11 illustrates a workflow of a method of validating the contamination detection application, according to one embodiment, according to one example embodiment. [0078] FIG. 12A illustrates a workflow for in silico validation, according to one example embodiment.

[0079] FIG. 12B is a contamination estimation plot showing in silico validation, according to one example embodiment. [0080] FIG. 12C shows contamination fraction (y-axis) plotted against average likelihood (Log) showing in silico validation when analyzing SNPs from targeted panels.

[0081] FIG. 12D shows contamination fraction (y-axis) plotted against average likelihood (Log) showing in silico validation when analyzing SNPs from whole transcriptome data. [0082] FIG. 13 illustrates a block diagram of a contamination detection application for detecting and calling contamination in a plurality of sequence reads, according to one example embodiment. Dashed lines indicate optional workflow.

[0083] FIG. 14 illustrates a block diagram of a contamination detection application for detecting and calling contamination in a plurality of sequence reads, according to one example embodiment. Dashed lines indicate optional workflow.

[0084] The figures depict embodiments of the present invention for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles of the invention described herein.

DETAILED DESCRIPTION

I. DEFINITIONS

[0085] The term “individual” refers to a human individual. The term “healthy individual” refers to an individual presumed to not have cancer or disease. The term “subject” refers to an individual who is known to have, or potentially has, cancer or disease. [0086] The term “sample” refers to a biological specimen taken from an individual or subject. Sample can refer to one or more samples taken from an individual or subject and combined prior to performing the detection methods described herein. For example, genome sequencing techniques commonly combine samples prior to performing a sequencing reaction. In such cases, the samples are labeled prior to combining. Sample can refer to nucleic acid fragments taken from targeted panels. Sample can refer to nucleic acid fragments taken from whole transcriptome and/or whole genome data.

[0087] FIG. 12D shows contamination fraction (y-axis) plotted against average likelihood (Log) showing in silico validation when analyzing SNPs from whole transcriptome data [0088] The term “sequence reads” “or “sequencing reads” refers to nucleotide sequences read obtained from a sample. Sequence reads can be obtained through various methods known in the art. [0089] The term “a plurality of sequencing reads” refers to all or a portion of a plurality of nucleic acid sequences or fragments from a sample.

[0090] The term “read segment” or “read” refers to any nucleotide sequences including sequence reads obtained from an individual and/or nucleotide sequences derived from the initial sequence read from a sample obtained from an individual. For example, a read segment can refer to an aligned sequence read, a collapsed sequence read, or a stitched read. Furthermore, a read segment can refer to an individual nucleotide base, such as a single nucleotide variant.

[0091] The term “single nucleotide variant” or “SNV” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g, site) of a nucleotide sequence, e.g, a sequence read from an individual. A substitution from a first nucleobase X to a second nucleobase Y may be denoted as “X>Y.” For example, a cytosine to thymine SNV may be denoted as “OT.”

[0092] The term “single nucleotide polymorphism” or “SNP” refers to a substitution of one nucleotide to a different nucleotide at a position (e.g., site) of a nucleotide sequence, e.g., a sequence read from an individual. For example, at a specific base site, the nucleobase C may appear in most individuals, but in a minority of individuals, the position is occupied by base A. There is a SNP at this specific site.

[0093] The term “pre-determined single nucleotide polymorphism” or “pre-determined SNP” refers to a SNP identified prior to performing any of the methods described herein (e.g., prior identifying sequencing reads). For example, a pre-determined SNP is identified prior to identifying sequence reads that comprises one or more pre-determined single nucleotide polymorphisms. A pre-determined SNP, alone or in combination with one or more additional pre-determined SNPs, enables identification of contamination in a sample. [0094] The term “indel” refers to any insertion or deletion of one or more base pairs having a length and a position (which may also be referred to as an anchor position) in a sequence read. An insertion corresponds to a positive length, while a deletion corresponds to a negative length.

[0095] The term “mutation” refers to one or more SNVs or indels.

[0096] The term “true positive” refers to a mutation that indicates real biology, for example, the presence of potential cancer, disease, or germline mutation in an individual. True positives are not caused by mutations naturally occurring in healthy individuals (e.g., recurrent mutations) or other sources of artifacts such as process errors during assay preparation of nucleic acid samples.

[0097] The term “false positive” refers to a mutation incorrectly determined to be a true positive. Generally, false positives may be more likely to occur when processing sequence reads associated with greater mean noise rates or greater uncertainty in noise rates.

[0098] The term “cell-free nucleic acid,” “cell-free DNA,” “cfDNA,” “cell-free RNA,” or “cfRNA” refers to nucleic acid fragments that circulate in an individual’s body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancer cells. A sample, as described herein, can include cell-free nucleic acids e.g., cfRNA).

[0099] The term “circulating tumor DNA” or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, which may be released into an individual’s bloodstream as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells. Nucleic acid fragments that originate from tumor cells or other types of cancer cells can be informative of the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin).

[00100] The term “genomic nucleic acid,” “genomic DNA,” or “gDNA” refers to nucleic acid including chromosomal DNA that originates from one or more healthy cells.

[00101] The term “alternative allele” or “ALT” refers to an allele having one or more mutations relative to a reference allele, e.g., corresponding to a known gene.

[00102] The term “minor allele” or “MIN” refers to the second most common allele in a given population.

[00103] The term “sequencing depth” or “depth” refers to a total number of read segments from a sample obtained from an individual that have a particular location in the genome. A non-limiting example of sequencing depth described herein includes “reads per million” (RPM) mapped reads.

[00104] The term “allele depth” or “AD” refers to a number of read segments in a sample that supports an allele in a population. The terms “AAD”, “MAD” refer to the “alternate allele depth” (i.e., the number of read segments that support an ALT) and “minor allele depth” (i.e., the number of read segments that support a MIN), respectively.

[00105] The term “contaminated” refers to a test sample that is contaminated with at least some portion of a second test sample. That is, a contaminated test sample unintentionally includes DNA sequences from an individual that did not generate the test sample. Similarly, the term “uncontaminated” refers to a test sample that does not include at least some portion of a second test sample.

[00106] The term “contamination level” refers to the degree of contamination in a test sample. That is, the contamination level the number of reads in a first test sample from a second test sample. For example, if a first test sample of 1000 reads includes 30 reads from a second test sample, the contamination level is 3.0%.

[00107] The term “contamination event” refers to a test sample being called contaminated. Generally, a test sample is called contaminated if the determined contamination level is above a threshold contamination level and the determined contamination level is statistically significant.

[00108] The term “allele frequency” or “AF” refers to the frequency of a given allele in a population. The terms “AAF”, “MAF” refer to the “alternate allele frequency” and “minor allele frequency”, respectively. Herein, the term “variant allele frequency” refers to the minor allele frequency for an allele of the test sample. In this case, the VAF may be determined by dividing the corresponding variant allele depth of a test sample by the total depth of the sample for the given allele.

[00109] The term “reference allele frequency” refers to the frequency of a given allele in a previously sequenced sample. For example, a reference allele frequency refers to allele frequency for an allele in a previously sequenced sample that included cfRNA where allele frequency was determined. In another example, the reference allele frequency refers to allele frequency for an allele in a NCBI dbSNP database (Build 155).

[00110] The term “observed allele frequency” refers to frequency of a given allele in a sample where the detection methods described herein were used, at least in part, to determine the allele frequency. An observed allele frequency can be then used to determine where the sample is contaminated.

II. DETECTING CONTAMINATION BASED ON PRE-DETERMINED SNPS

[00111] In various embodiments, samples (e.g., test samples) are obtained from subjects and prepared using genome sequencing techniques to generate sequencing reads representing a plurality of nucleic acid fragments from the sample, including cell-free RNA. The sequencing reads include a number of sequencing reads having one or more pre-determined SNPs that can be used to identify contamination in the sample. Identifying a sequencing read as having one or more pre-determined SNPs modifies the data set of the sequencing reads such that it can be more easily analyzed to determine contamination. In addition, pre- determining a SNP enables identification of types of contamination, while also increasing the confidence with which contamination can be identified and lowering the limit of detection. Sequencing reads having one or more of the pre-determined SNPs are identified and an observed allele frequency is determined. Contamination probabilities can be based on the observed allelic frequency for each of the one or more pre-determined SNPS within the sample. Determining whether the sample is contaminated relies, at least in part, on the contamination probabilities of the one or more pre-determined SNPs. In some embodiments, to determine contamination, the system can apply a contamination model including at least one likelihood test to a sequencing read of the plurality of sequencing reads. Here, the likelihood test obtains a current contamination probability representing the likelihood that the sample (e.g., the plurality of sequencing reads) is contaminated.

II. A. EXAMPLE ASSAY PROTOCOL

[00112] FIG. 1 is a flowchart of a method 100 for preparing a nucleic acid sample for sequencing according to one embodiment. The method 100 includes, but is not limited to, the following steps. For example, any step of the method 100 may comprise a quantitation substep for quality control or other laboratory assay procedures known to one skilled in the art. [00113] In step 110, a nucleic acid sample (DNA or RNA) is extracted from a subject. In the present disclosure, DNA and RNA may be used interchangeably unless otherwise indicated. That is, the following embodiments for using error source information in variant calling and quality control may be applicable to both DNA and RNA types of nucleic acid sequences. However, the examples described herein may focus on DNA for purposes of clarity and explanation. The sample may be any subset of the human genome, including the whole genome. The sample may be extracted from a subject known to have or suspected of having cancer. The sample may include blood, plasma, serum, urine, fecal, saliva, other types of bodily fluids, or any combination thereof. In some embodiments, methods for drawing a blood sample (e.g., syringe or finger prick) may be less invasive than procedures for obtaining a tissue biopsy, which may require surgery. The extracted sample may comprise cfDNA and/or ctDNA. For healthy individuals, the human body may naturally clear out cfDNA and other cellular debris. If a subject has cancer or disease, ctDNA in an extracted sample may be present at a detectable level for diagnosis.

[00114] In step 120, a sequencing library is prepared. During library preparation, unique molecular identifiers (UMI) are added to the nucleic acid molecules (e.g., DNA molecules) through adapter ligation. The UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of DNA fragments during adapter ligation. In some embodiments, UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment. During PCR amplification following adapter ligation, the UMIs are replicated along with the attached DNA fragment, which provides a way to identify sequence reads that came from the same original fragment in downstream analysis.

[00115] In step 130, targeted DNA sequences are enriched from the library. During enrichment, hybridization probes (also referred to herein as “probes”) are used to target, and pull down, nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin). For a given workflow, the probes may be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA. The target strand may be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand. The probes may range in length from 10s, 100s, or 1000s of base pairs. In one embodiment, the probes are designed based on a gene panel to analyze particular mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases. Moreover, the probes may cover overlapping portions of a target region. By using a targeted gene panel rather than sequencing all expressed genes of a genome, also known as “whole exome sequencing,” the method 100 may be used to increase sequencing depth of the target regions, where depth refers to the count of the number of times a given target sequence within the sample has been sequenced. Increasing sequencing depth reduces required input amounts of the nucleic acid sample. After a hybridization step, the hybridized nucleic acid fragments are captured and may also be amplified using PCR.

[00116] In step 140, sequence reads are generated from the enriched DNA sequences. Sequencing data may be acquired from the enriched DNA sequences by known means in the art. For example, the method 100 may include next-generation sequencing (NGS) techniques including synthesis technology (Illumina), pyrosequencing (454 Life Sciences), ion semiconductor technology (Ion Torrent sequencing), single-molecule real-time sequencing (Pacific Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing (Oxford Nanopore Technologies), or paired-end sequencing. In some embodiments, massively parallel sequencing is performed using sequencing-by-synthesis with reversible dye terminators. [00117] In some embodiments, the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene.

[00118] In various embodiments, a sequence read is comprised of a read pair denoted as /? x and R 2 . For example, the first read may be sequenced from a first end of a nucleic acid fragment whereas the second read R 2 may be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R 1 and second read R 2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R r and R 2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., Ri) and an end position in the reference genome that corresponds to an end of a second read (e.g., R2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis such as variant calling, as described below with respect to FIG. 2.

II.B. EXAMPLE PROCESSING SYSTEM

[00119] FIG. 2 is a block diagram of a processing system 200 for processing sequence reads, according to one example embodiment. The processing system 200 includes a sequence processor 205, sequence database 210, model database 215, machine learning engine 220, models 225, parameter database 230, score engine 235, variant caller 240 and copy number variation (CNV) caller (not pictured). FIG. 3 is a flowchart of a method 300 for determining variants (e.g., a SNP and/or a pre-determine SNP) in a sequencing read from a plurality of sequencing reads, according to one example embodiment. In some embodiments, the processing system 200 performs the method 300 to perform variant calling (e.g., for SNPs) based on input sequencing data. Further, the processing system 200 may obtain the input sequencing data from an output file associated with a nucleic acid sample e.g., a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA)) prepared using the method 100 described above. The method 300 includes, but is not limited to, the following steps, which are described with respect to the components of the processing system 200. In other embodiments, one or more steps of the method 300 may be replaced by a step of a different process for generating variant calls, e.g., using Variant Call Format (VCF), such as HaplotypeCaller, VarScan, Strelka, or SomaticSniper.

[00120] The processing system 200 can be any type of computing device that is capable of running program instructions. Examples of processing system 200 may include, but are not limited to, a desktop computer, a laptop computer, a tablet device, a personal digital assistant (PDA), a mobile phone or smartphone, and the like. In one example, when processing system is a desktop or laptop computer, models 225 may be executed by a desktop application. Applications can, in other examples, be a mobile application or web-based application configured to execute the models 225.

[00121] At step 310, the sequence processor 205 collapses aligned sequence reads of the input sequencing data. In one embodiment, collapsing sequence reads includes using UMIs, and optionally alignment position information from sequencing data of an output file (e.g., from the method 100 shown in FIG. 1) to collapse multiple sequence reads into a consensus sequence for determining the most likely sequence of a nucleic acid fragment or a portion thereof. Since the UMIs are replicated with the ligated nucleic acid fragments through enrichment and PCR, the sequence processor 205 may determine that certain sequence reads originated from the same molecule in a nucleic acid sample. In some embodiments, sequence reads that have the same or similar alignment position information (e.g., beginning and end positions within a threshold offset) and include a common UMI are collapsed, and the sequence processor 205 generates a collapsed read (also referred to herein as a consensus read) to represent the nucleic acid fragment. The sequence processor 205 designates a consensus read as “duplex” if the corresponding pair of collapsed reads have a common UMI, which indicates that both positive and negative strands of the originating nucleic acid molecule are captured; otherwise, the collapsed read is designated “non-duplex.” In some embodiments, the sequence processor 205 may perform other types of error correction on sequence reads as an alternative to, or in addition to, collapsing sequence reads.

[00122] At step 320, the sequence processor 205 stitches the collapsed reads based on the corresponding alignment position information. In some embodiments, the sequence processor 205 compares alignment position information between a first read and a second read to determine whether nucleotide base pairs of the first and second reads overlap in the reference genome. In one use case, responsive to determining that an overlap (e.g., of a given number of nucleotide bases) between the first and second reads is greater than a threshold length (e.g., threshold number of nucleotide bases), the sequence processor 205 designates the first and second reads as “stitched”; otherwise, the collapsed reads are designated “unstitched.” In some embodiments, a first and second read are stitched if the overlap is greater than the threshold length and if the overlap is not a sliding overlap. For example, a sliding overlap may include a homopolymer run (e.g., a single repeating nucleotide base), a dinucleotide run (e.g., two-nucleotide base sequence), or a trinucleotide run (e.g., three- nucleotide base sequence), where the homopolymer run, dinucleotide run, or trinucleotide run has at least a threshold length of base pairs.

[00123] At step 330, the sequence processor 205 assembles reads into paths. In some embodiments, the sequence processor 205 assembles reads to generate a directed graph, for example, a de Bruijn graph, for a target region (e.g., a gene). Unidirectional edges of the directed graph represent sequences of k nucleotide bases (also referred to herein as “k-mers”) in the target region, and the edges are connected by vertices (or nodes). The sequence processor 205 aligns collapsed reads to a directed graph such that any of the collapsed reads may be represented in order by a subset of the edges and corresponding vertices.

[00124] At step 340, the variant caller 240 identifies sequencing reads that include one or more pre-determined SNPs from the paths assembled by the sequence processor 205. In one embodiment, the variant caller 240 identifies sequencing reads that include one or more predetermined SNPs by comparing a directed graph (which may have been compressed by pruning edges or nodes in step 310) to a reference sequence of a target region of a genome or a reference sequence that includes one or more of the pre-determined SNPs (e.g., obtained sequencing reads from a sequence UHR or sample that includes cfRNA). The variant caller 240 may align edges of the directed graph to the reference sequence and record the genomic positions of mismatched edges and mismatched nucleotide bases adjacent to the edges as the locations of candidate variants. Additionally, the variant caller 240 may identify sequencing reads that including one or more pre-determined SNPs based on the sequencing depth of a target region. In particular, the variant caller 240 may be more confident in identifying sequencing reads that include one or more pre-determined SNPs in target regions that have greater sequencing depth, for example, because a greater number of sequence reads help to resolve (e.g., using redundancies) mismatches or other base pair variations between sequences. [00125] Further, multiple different models may be stored in the model database 215 or retrieved for application post-training. For example, models may be trained to determine the presence of a contamination event (e.g., contamination of a test sample during process 100 or process 300) and/or verify contamination detection. Further, the score engine 235 may use parameters of the model 225 to determine a likelihood of one or more true positives or contamination in a sequence read. The score engine 235 may determine a quality score (e.g., on a logarithmic scale) based on the likelihood. For example, the quality score is a Phred quality score Q = — 10 ■ log 10 P, where P is the likelihood of an incorrect candidate variant call (e.g., a false positive). In some embodiments, CNV caller 240 can call copy number variations using a model stored in the model database 215. In one example, CNVs associated with one or more pre-determined SNPs are identified using a model that analyzes the presence or absence of one or more of the pre-determined SNPs. In one example, CNVs associated with cancer are identified using a model that analyzes random sequencing data. In another example, CNVs associated with cancer are identified using a model that analyzes allele ratios at a plurality of heterozygous loci within a region of the genome.

[00126] At step 350, the score engine 235 scores the identified sequencing reads and/or the pre-determined SNPs based on the model 225 (e.g., the presence or absence of the one or more pre-determined SNPs) or corresponding likelihoods of true positives, contamination, quality scores, etc. Training and application of the model 225 are described in more detail below.

[00127] At step 360, the processing system 200 outputs the identified sequencing reads and/or the pre-determined SNPs. In some embodiments, the processing system 200 outputs some or all of the identified sequencing reads and/or pre-determined SNP along with the corresponding scores. Downstream systems, e.g., external to the processing system 200 or other components of the processing system 200, may use the pre-determined SNPs and scores for various applications including, but not limited to, predicting the presence of cancer, predicting contamination in test sequences, or predicting noise levels.

II.C. USING PRE-DETERMINED SNPS

[00128] In one aspect this disclosure features methods for identifying contamination in a sample where the method includes: (a) obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA); (b) identifying sequencing reads that comprise one or more pre-determined single nucleotide polymorphisms (SNPs) thereby determining an observed allele frequency for each pre-determined SNP in the plurality of sequencing reads, and wherein each of the one or more pre-determined SNPs are selected from: (i) an allele present in a Universal Human Reference (UHR) database; (ii) an allele present in a NCBI dbSNP database (Build 155) that has a reference allele frequency in a range between 0.3 and 0.7; and (iii) a genotyping SNP associated with a sample type; and (c) determining whether the sample is contaminated using the determined contamination probabilities of the one or more pre-determined SNPs. In some embodiments, the methods provided herein further comprise determining a contamination probability for each pre-determined SNP using its observed allele frequency and determining whether the sample is contaminated using the determined contamination probabilities of the one or more pre-determined SNPs.

[00129] In a non-limiting example, FIG. 6 provides a flow diagram illustrating a contamination detection workflow 600. In some embodiments, the workflow of 600 is executed on the processing system 200. The detection workflow 600 of this embodiment includes, but is not limited to, the following steps.

[00130] At step 610, sequencing data obtained from a sample (e.g., using the process 300) is cleaned up. For example, data cleaning may include removing a pre-determined SNP with: no coverage, a sequencing depth less than a threshold (e.g., any of the sequence depth thresholds described herein), a high error frequency (e.g., > 0.1%), high variance, and/or a particular genomic location (e.g., when the SNP is present within an intron or other noncoding region).

[00131] At step 615, optionally, observed allele frequencies for each of the one or more pre-determined SNPs are determined.

[00132] At step 620, optionally, a contamination probability for each of the one or more pre-determined SNPs using its observed allele frequency is calculated. In some cases, step 620 includes applying a contamination model to the sequencing reads identified as having one or more pre-determined SNPs and an observed allele frequency in the plurality of sequencing reads. In one embodiment, method 600 also includes applying a contamination model that includes performing likelihood tests based, at least in part, on the observed allele frequencies for each of the one or more pre-determined SNPs identified in the sample (see, e.g., FIG. 7). In another embodiment, method 600 also includes applying a contamination model that includes generating a noise model analysis as described herein. [00133] At step 625, a determination is made whether or not the sample is contaminated using the determined contamination probabilities of the one or more pre-determined SNPs. In one embodiment, at decision step 625, it is determined whether the plurality of sequencing reads are contaminated. If the plurality of sequencing reads have an observed allele frequencies at one or more of the pre-determined SNPs that identify contamination is present, then the sample is contaminated and workflow 600 proceeds to a step 630. If a plurality of sequencing reads does not have an observed allele frequency at the one or more predetermined SNPs that identify contamination is present, then the sample is not contaminated and workflow 600 ends.

[00134] At step 630, a likely source of contamination is identified. In one embodiment, a genotyping SNP (e.g., a genotyping SNP as described herein, e.g., in Table 1) is used to identify the source of contamination. In another embodiment, contamination is identified based on the prior probabilities of SNPs from known genotypes of other samples that were processed in the same batch as the test sample (or a set of related batches).

III. SELECTING PRE-DETERMINED SINGLE NUCLEOTIDE POLYMORPHISMS

[00135] In one aspect, this disclosure features methods for identifying contamination in a sample where the method includes identifying one or more pre-determined single nucleotide polymorphisms (SNPs) prior to determining contamination. A SNP can be considered a “predetermined SNP” based, at least in part, on its ability to aid in the determination of whether a sample is contaminated. In some embodiments, a pre-determined SNP is selected based on one or more of the following: an allele present in one or more selected databases; or a genotyping SNP associated with a sample type. In some embodiments, a pre-determined SNP is selected based on one or more of the following: (i) an allele present in a universal human reference database; (ii) an allele present in a NCBI dbSNP database (Build 155) that has a reference allele frequency in a range between 0.2 and 0.8 (or any of the subranges therein); and/or (iii) a genotyping SNP associated with a sample type.

[00136] In some embodiments, the steps of selecting a pre-determined SNP to be included in the contamination detection method occurs prior to obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA) or after obtaining the plurality of sequencing reads. In some embodiments, one or more pre-determined SNPs are selected based on the outputs of one or more of the steps related to method 300. For example, a SNP is selected as a pre-determined SNP, based, at least in part, on the sequencing depth determined after step 320. In another example, a SNP is selected, based, at least in part, on the statistical significance associated with the paths assembled in step 330.

[00137] In some embodiments, one or more pre-determined SNPs can be removed/filtered out based, at least in part, on the outputs of one or more of the steps related to the method 300. For example, a SNP is not selected (e.g., removed or filtered out) as a pre-determined SNP based, at least in part, on the sequencing depth determined after step 320. In another example, a SNP is not selected (e.g., removed or filtered out) as a pre-determined SNP based, at least in part, on the statistical significance associated with the paths assembled in step 330. [00138] Additional criteria can be used to select a SNP as a pre-determined SNP. Nonlimiting examples of additional criteria include: observed sequencing depth in previously sequenced samples, low error rates in previously sequence samples, and genomic location (e.g., a sequencing read including all or a portion of an exonic sequence).

[00139] In some embodiments, the method is premised in part on obtaining sequencing reads (e.g., a sequencing read identified as having one or more pre-determined SNPs) sequenced at sufficient sequencing depth to enable contamination detection. For example, a pre-determined SNP has sufficient sequencing depth when at least 25 sequencing reads (e.g., at least 50 sequencing reads, at least 75 sequencing reads, at least 100 sequencing reads, at least 125 sequencing reads, at least 150 sequencing reads, at least 175 sequencing reads, or at least 200 sequencing reads) map to the genomic location of the pre-determined SNP. In some embodiments, a pre-determined SNP has sufficient sequencing depth when the samples has a sequencing depth of at least 10 reads per million mapped reads (RPM), at least 25 RPM, at least 50 RPM, at least 100 RPM, at least 500 RPM, or at least 1000 RPM in the plurality of sequencing reads (or sample).

[00140] As shown in FIG. 4, high error rates correlate with low sequencing depth. FIG. 4 shows 50,000 candidate dbSNPs having wild-type (WT) noncancer expression, sequencing depth between 15 sequencing reads and 150 sequence reads, and a minor allele frequency (MAF) of 0.3 < MAF < 0.7. Reads with low sequencing depth had higher error rates, including error rates above the assay error rate between about 10' 4 to about 10' 3 described herein. As such, pre-determined SNPs present at a genomic locus that have a sequencing depth below a threshold (e.g., any of the sequencing depth criteria described herein) are excluded due to high error rates.

[00141] In some embodiments, a pre-determined SNP comprises a low error rate when detected in the plasma cfRNA. Low error rates enable a pre-determined SNP to be distinguished from technical errors from trace contamination events arising from or during performance of the assay.

[00142] In some embodiments, a pre-determined SNP is present in an exon. In some embodiments, a sequencing read identified as having one or more pre-determined SNPs is excluded if the sequencing read does not include all or a portion of an exonic sequence. In some embodiments, a sequencing read identified as having one or more pre-determined SNPs and including all or a portion of an exonic sequence results in greater statistical significance being assigned to paths assembled in step 330. In some embodiments, a sequencing read identified as having one or more pre-determined SNPs is given greater weight (e.g., a contamination model is adjusted to weight the presence of the pre-determined SNP more heavily) if the sequencing read includes all or a portion of an exonic sequence (e.g., an exonexonjunction).

[00143] In some embodiments, one or more of the predetermined SNPs do not include SNPs having a conversion type comprising: A>G; T>C; OT; or G>A. Conversion types including A>G; T>C; C>T; or G>A can be difficult to differentiate from low-level contamination events See, e.g., FIGs. 5A-5B). In some embodiments, a pre-determined SNP having a conversion type comprising A>G; T>C; C>T; or G>A is removed/filtered out after being selected as a pre-determined SNP but before a contamination probability is determined. In some embodiments, target SNP error rates are between 10' 4 and 10' 3 . For example, FIG. 5A shows greater error rates (y-axis) for A>G; T>C; C>T; or G>A conversion types (x-axis) when analyzing SNPs from whole transcriptome data. In another example, FIG. 5B shows error rate (y-axis) for A>G; T>C; C>T; or G>A conversion types (x-axis) when analyzing SNPs from targeted panels.

[00144] In some embodiments, the steps of selecting one or more pre-determined SNPs to be included in the contamination detection method includes determining whether the one or more pre-determined SNPs enable a contamination limit of detection (LoD) approaching the assay error rate. In some embodiments, the assay error rate is between about 10' 4 to about 10" 3 (or any of the subranges therein). In some embodiments, the contamination LoD should be about 12 / effective coverage (e.g., number of sequencing reads mapping to the genomic locations of the SNPs). In some embodiments, determining the contamination LoD includes determining how many one or more pre-determined SNPs are needed to detect contamination. Determining how many one or more pre-determined SNPs are needed to detect contamination can include, without limitation: determining LoD as = ~ 3 / (0.5 (i.e., % of pre-determined SNPs that are homozygous SNPs) * 0.5 (i.e., % of pre-determined SNPs that will have opposite haplotype in contaminating sample) * total sampling events); determining effective coverage as = number of SNPs * mean depth; determining LoD as = ~ 3 / (0.25 * effective coverage); and/or determining the number of SNPs= ~ 3 / (0.25 * LoD * mean depth).

III. A. PRE-DETERMINED SNPS INCLUDING UNIVERSAL HUMAN REFERENCE ALLELES

[00145] In some embodiments, one or more pre-determined SNPs include an allele present in a universal human reference database. In some embodiments, a universal human reference includes a plurality of nucleic acid fragments isolated from common human cells lines. Nonlimiting commercially available UHRs include: Agilent, Thermo Fisher, Stratagene, and Clontech. One or more of the exemplary UHRs described herein includes cell lines selected from: adenocarcinoma (e.g., mammary gland); melanoma; hepatoblastoma (e.g., liver); liposarcoma; adenocarcinoma (e.g., cervix); histiocytic lymphoma (e.g., macrophages and histocytes); embryonal carcinoma (e.g., testis); lymphoblastic leukemia (e.g., T lymphoblasts); glioblastoma (e.g., brain); plasmacytoma (e.g., myeloma and B-lymphocyte). [00146] In one embodiment, an allele present in a UHR based is selected as a predetermined SNP based, at least in part, on an allele frequency considered to be homozygous. For example, an allele present in a UHR is selected as a pre-determined SNP based, at least in part, on an allele frequency greater than 0.75 in a UHR. In some embodiments, an allele present in a UHR is selected as a pre-determined SNP based, at least in part, on the SNP having an allele frequency considered to be homozygous in a UHR and the SNP having an allele frequency considered not to be homozygous in a human sample (e.g., a previously sequenced human sample). For example, an allele present in a UHR is selected as a predetermined SNP based, at least in part, on an allele frequency of at least 0.75 (e.g., a homozygous frequency) in a UHR and an allele frequency of 0.05 or less (e.g., a non- homozygous frequency) in a human sample.

[00147] In some embodiments, UHR allele frequencies are determined empirically by sequencing UHR samples and/or human plasma samples.

[00148] Non-limiting examples of one or more pre-determined SNPs having an allele present in a UHR are provided in Table 1.

Ill B . PRE-DETERMINED SNPS INCLUDING NCBI DB SNP ALLELES

[00149] In some embodiments, one or more pre-determined SNPs include an allele present in a National Center for Biotechnology Information’s (NCBI) Single Nucleotide Database (“dbSNP”) (e.g., dbSNP Build 155). The NCBI dbSNP database includes greater than 500 million SNPs compiled from various sources, which are vetted by NCBI before being placed into the dbSNP.

[00150] In some embodiments, an allele present in the NCBI dbSNP database is selected as a pre-determined SNP based, at least in part, on having a reference allele frequency in a range between 0.2 and 0.8. In some embodiments, an allele present in the NCBI dbSNP database is selected as a pre-determined SNP based, at least in part, on having a reference allele frequency between 0.3 and 0.7. In some embodiments, an allele present in the NCBI dbSNP database is selected as a pre-determined SNP based, at least in part, on having a reference allele frequency between 0.4 and 0.6.

[00151] In some embodiments, an allele present in the NCBI dbSNP database is selected as a pre-determined SNP based, at least in part, on allele frequency comprising a MAF, a VAF, sequencing depth, or any combination thereof. For example, an allele present in the NCBI dbSNP database is selected as a pre-determine SNP based, at least in part, on having a MAF in a range between 0.3 and 0.7, or optionally in a range between 0.4 and 0.6.

[00152] In some embodiments, one or more pre-determined SNPs that are present in the dbSNP database are not used as a pre-determined SNP because the SNP is a conversion type comprising: A>G; T>C; C>T; or G>A See, e.g, FIGs. 5A-5B). In some cases, these types of conversions can be difficult to differentiate from low-level contamination events and so SNPs that match these conversion types can be excluded. In some embodiments, a predetermined SNPs present in the dbSNP database having a conversion type comprising A>G; T>C; C>T; or G>A is removed/filtered out after being selected as a pre-determined SNP but before a contamination probability is determined. [00153] Non-limiting examples of a pre-determined SNP having an allele present in the dbSNP database where the allele has a reference allele frequency in a range between 0.3 and 0.7 are provided in Table 2. Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs Table 2. CfRNA Contamination SNPs in.C GENOTYPING SNPS

[00154] In some embodiments, one or more pre-determined SNPs include a genotyping SNP. Genotyping SNPs are SNPs associated with a particular sample or sample type and therefore can be used to differentiate samples.

[00155] In some embodiments, an allele is selected as a pre-determined SNP based, at least in part, on a SNPs ability to provide genotype information across samples (e.g., samples prepared with different assays).

[00156] Non-limiting examples of a pre-determined SNP that can be used as a genotyping

SNP are provided in Table 3.

IV. ANALYTICAL VALIDATION TO DETERMINE LIMIT OF DETECTION FOR METHODS U SING

PRE-DETERMINED SNPS

[00157] To determine the limit of detection (LOD) of contamination detection workflow 600, different contamination levels of cfRNA (“cfRNA spike-ins”) and UHR (“UHR spikeins”) ranging from 5% down to 0.01% by mass (see, e.g., FIGs. 8A-8B) were mixed into background cfRNA. Limit of detection was assessed using maximum likelihood estimation of contamination fraction (i.e., at step 620 in FIG. 6 a maximum likelihood estimation was used). Here, the limit of detection is considered to be the lowest contamination level at which the specificity is above 95%.

[00158] FIG. 9A is a plot showing the analytical validation for limit of detection for cfRNA contamination using the detection methods described herein. Plot 910 shows a best fit line 920 of the detection rate obtained at each cfFNA spike-in level (see, e.g., FIG. 9A numeral 920 having Adj R 2 = 0.9261, p = 5.728e-45). FIG. 9B shows limit of detection of cfRNA spike-ins using detection workflow 600 (and as shown in FIG. 8 A) was 0.5 % contamination level.

[00159] FIG. 10A is a plot showing the analytical validation for limit of detection of UHR contamination using the detection methods described herein. Plot 1010 shows a best fit line 1020 of the detection rate obtained at each UHR spike-in level (see, e.g., FIG. 10A numeral 1020 having Adj R 2 = 0.9562, p = 7.803e-23). FIG. 10B shows limit of detection of UHR spike-ins using detection workflow 600 (and as shown in FIG. 8A) was 0.5% contamination level.

[00160] Limit of detection for detection workflow 600 (e.g., Step 620) can also be measured using a robust linear regression model for contamination detection (see, e.g., PCT/IB2018/050979, which is incorporated herein by reference in its entirety).

V. VALIDATION OF CONTAMINATION DETECTION U SING PRE-DETERMINED SNPS AND

LIKELIHOOD TESTS

[00161] Detection workflow 600 using maximum likelihood estimation for contamination probability determinations (i.e., at step 620 in FIG. 6 a maximum likelihood estimation was used) was validated using a three-step process. FIG. 11 illustrates an example of a method 1100 for validating contamination detection workflow (e.g., workflow 600 or 700). Validation method 1100 may include, but is not limited to, the following steps.

[00162] At a step 1100, a background noise baseline for each SNP is generated using a set of normal training samples (e.g., 80 normal, uncontaminated samples). The noise baseline provides an estimate of the expected noise for each SNP and is used to distinguish a contamination event from a background noise signal. Generation of a noise (contamination) baseline is described in more detail in PCT/US2018/039609, which is incorporated herein by reference in its entirety.

[00163] At a step 1115, a 5-fold cross-validation process is performed. For example, datasets of 24 normal samples and in silico titrations are partitioned into a validation set and a training set. Here, the contamination levels ranges from 0.05% to 50%. The training set is used to train detection method 600 and set a threshold for calling a contamination event versus normal background noise. That is, detection method 600 can include a different threshold for each threshold and repeat of an SNP. The threshold is then tested on the validation set. This process is repeated a total of 10 times to identify a final threshold and LOD for calling a contamination event. [00164] At a step 1120, the final threshold and LOD are tested on a real dataset (e.g., a cfDNA dataset from cancer patient samples).

[00165] FIGs. 12A-D show a workflow (FIG. 12A) and a plot (FIG. 12B) showing preliminary in silico validation of the detection method workflow 600 using whole transcriptome data of plasma from two individuals titrated with background plasma at 0%, 0.01%, 0.05%, 0.1%, 0.5%, 1% and 5%. Observed allele frequencies were determined for sequencing reads identified as having one or more pre-determined single nucleotide polymorphisms (SNPs). Contamination probability was determined using maximum likelihood estimation using the methods described herein and described in PCT/US2018/039609, which is incorporated herein by reference in its entirety.

[00166] FIG. 12C and FIG. 12D shows that contamination fraction estimates with small panels correlate better with average log likelihood (predicting the presence of contamination in a sample) than the same correlation calculation when analyzing SNPs from whole transcriptome data.

VI. DETECTING CONTAMINATION USING - LIKELIHOOD TESTS

[00167] In one embodiment, a method for identifying contamination in a sample includes applying at least one likelihood test (/.< ., a contamination model) to the sequencing reads. In one embodiment, a method for identifying contamination in a sample includes applying at least one likelihood test (/.< ., a contamination model) to the sequencing reads identified as having one or more pre-determined SNPs and an observed allele frequency in the plurality of sequencing reads. Exemplary methods for using likelihood tests for contamination detection are described in PCT/US2018/039609, which is incorporated herein by reference in its entirety.

[00168] In some embodiments, one or more likelihood tests are applied to a sequencing read of the plurality of sequencing reads using the associated contamination probability. In such cases, each likelihood test is used to obtain a current contamination probability is indicative of whether the sequencing reads are contaminated. In one embodiment, each likelihood test is used to obtain a confidence score representing a measure of the predicted contamination in the sequencing reads.

[00169] In one embodiment, a method of identifying contamination in a sample that includes applying at least one likelihood test (e.g., a contamination model) further includes a step of determining that the sequencing reads are contaminated based on the current contamination probability of the at least one test being above a threshold associated with the at least one test likelihood test.

[00170] In one embodiment, a method of identifying contamination in a sample that includes applying at least one likelihood test (e.g., a contamination model) further includes a step of determining that the sequencing reads are contaminated based on the current contamination probability of at least two likelihood tests being above a threshold associated with the at least two likelihood tests. In such cases, the threshold for each likelihood test can be the same. In other cases, the threshold for each likelihood test can be different.

[00171] In one embodiment, the at least one likelihood test maximizes a likelihood function, the likelihood function proportional to the probability of an event occurring in a data set given a variable.

[00172] In one embodiment, applying the at least one likelihood test of the contamination model comprises: comparing a set of generated contaminated sequencing reads to a set of previously obtained non-contaminated sequencing reads to determine the contamination probability.

[00173] In one embodiment, applying at least one likelihood test of the contamination model comprises: generating a null hypothesis representing that the sequencing reads are not contaminated; generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, the likelihood ratio test to obtain the current contamination probability.

[00174] In one embodiment, applying the at least one likelihood test of the contamination model comprises: comparing a set of generated contaminated sequencing reads to an average of previously obtained sequencing reads to determine the contamination probability, the contamination probability associated with the likelihood that the sequencing reads are contaminated at a contamination level.

[00175] In one embodiment, applying at least one likelihood test of the contamination model comprises: generating a set of contamination hypotheses representing that the sequencing reads are contaminated, wherein each contamination hypothesis of the set of contamination hypotheses is contaminated at a different contamination level; generating a null hypothesis representing the mean minor allele frequency at a contamination level for a plurality of previously obtained sequencing reads, wherein the contamination level is associated with the contamination hypothesis most likely to be contaminated; and applying a likelihood ratio test between the set of contamination hypotheses and the null hypothesis, the likelihood ratio test to obtain the current contamination probability.

[00176] In some embodiments, it is important to be able to distinguish between contamination and noise. As noted above, processing system 200 can be used to detect contamination in a test sample. For example, using the contamination detection workflow 700 a contamination event can be detected based on a plurality (or set) of observed variant allele frequencies in a test sample. In one embodiment, the observed variant allele frequencies can be compared to population MAFs from a plurality of SNPs for the detection of cross-sample contamination.

[00177] In a non-limiting example, FIG. 7 illustrates a flow diagram illustrating a contamination detection workflow 700. The detection workflow 700 of this embodiment includes, but is not limited to, the following steps.

[00178] At step 710, sequencing data obtained from a sample (e.g., using the process 300) is cleaned up. In some embodiments, data cleaning may include removing a pre-determined SNPs with no-calls (e.g., no coverage), a sequencing depth less than a threshold (e.g., any of the sequence depth thresholds described herein), high error frequencies (e.g., > 0.1%), high variance, and/or low coverage. In other examples, homozygous alternative SNPs with variant frequency 0.8 to 1.0 can be negated (e.g., variant frequency 0.95 becomes 0.05) in order to put all the variant frequency data in one scale that can be linearly compared to minor allele frequency values. Further, the MAF values can be negated based on a samples genotype. [00179] At step 715, optionally, observed allele frequencies for each of the one or more pre-determined SNPs is determined.

[00180] At step 717, optionally, a contamination probability for each pre-determined SNP is determined using the observed allele frequency for each pre-determined SNP. In one example, a prior probability of contamination is calculated for each SNP based on host sample’s genotype and minor allele frequency.

[00181] At step 720, a likelihood model including a maximum likelihood estimation is applied to determine contamination based on the probability of contamination for the predetermined SNPs. The likelihood model includes a first and a second likelihood test as described herein.

[00182] At a decision step 725, it is determined whether the test sample is contaminated. If a test sample passes both likelihood tests, then the sample is contaminated and workflow 700 proceeds to a step 730. If a test sample does not pass both likelihood tests, then the workflow is not contaminated and workflow 700 ends.

[00183] At step 730, a likely source of contamination is identified based on the prior probabilities of SNPs from known genotypes of other samples that were processed in the same batch as the sample (or a set of related batches).

[00184] In one embodiment, method 700 is executed according to workflow 1300. For example, FIG. 13 provides a diagram of a contamination detection workflow 1300 executing on the processing system 200 for detecting and calling contamination, in accordance with applying at least one likelihood test (i.e., a contamination model).

[00185] In the illustrated example, contamination detection workflow 1300 includes a single sample component 1310, a baseline batch component 1320, and an optional loss of heterozygosity (LOH) batch component 1330. Single sample component 1310 of contamination detection workflow 1300 is informed, for example, by the contents of a single variant call file 1312 and a minor allele frequencies (MAF) variant call file 1314 called by the variant caller 240. The single variant call file 1312 is the variant call file for a single target sample. The MAF variant call file 1314 is the MAF variant call file for any number of SNP population allele frequencies AF.

[00186] Baseline batch component 1320 of contamination detection workflow 1300 generates a background noise baseline for each SNP from uncontaminated samples as another input to single sample component 1310. Generating a background noise baseline using a contamination noise baseline workflow is described in more detail in regard to FIG. 13. Baseline batch component 1320 is informed, for example, by the contents of multiple variant call files 1322 called by the variant caller 240. The multiple variant call files 1322 can be the variant call files of multiple samples.

[00187] LOH batch component 1330 of contamination detection workflow 1300 determines a LOH in samples as another input to the single sample component 1310. LOH batch component 1330 is informed, for example, by the contents of LOH call files 1332. The LOH call files are call files for a plurality of alleles previously determined to include SNPs with LOH in the sample. The LOH call files can be called by the variant caller 240 and stored in the sequence database 210.

[00188] In one embodiment, the contamination detection workflow 1300 can generate output files 1340 and/or plots 1342 from sequencing data processed by contamination detection algorithm 110. For example, contamination detection workflow 1300 may generate log-likelihood data and/or display log-likelihood plots 1342 as a means for evaluating a DNA test sample for contamination. Data processed by contamination detection workflow 1300 can be visually presented to the user via a graphical user interface (GUI) 1350 of the processing system 200. For example, the contents of output files 1340 (e.g., a text file of data opened in Excel) and log-likelihood plots 1342 can be displayed in GUI 1350.

[00189] In another embodiment, the contamination detection workflow 1300 may use the machine learning engine 220 to improve contamination detection. Various training datasets (e.g., parameters from parameter database 230, sequences from sequence database 210, etc.) may be used to supply information to the machine learning engine 220 as described herein. In accordance with this embodiment, the machine learning engine 220 may be used to train a contamination noise baseline to identify a noise threshold, detect loss of heterozygosity, and determine the limit of detection (LOD) for contamination detection.

[00190] Single sample component 1310 of contamination detection workflow 1300 is, for example, a runnable script that is used to estimate contamination in a sample. By contrast, baseline batch component 1330 of contamination detection algorithm 110 is, for example, a runnable script that is used for generating estimates across a batch of samples, and may also be used to generate the noise model across these samples (if the input batch is healthy). Similarly, LOH batch component 1330 of contamination detection model is, for example, a runnable script that is used for generating estimates across a batch of samples, and may be used to determine the LOH in single samples based on the generated estimates.

[00191] In one embodiment, the contamination detection workflow 1300 may be based on a model for estimating contamination. In one embodiment, the model is a maximum likelihood model (herein referred to as the likelihood model) for detecting contamination in sequencing data from a sample. However, in other examples, the model can be any other estimation model such as an M-estimator, maximum spacing estimation, method of support, etc.

[00192] In one example, the likelihood model determines contamination by calculating the probability of observing a MAF of a sample at a given contamination level a and, subsequently, determining if the sample is contaminated. In some embodiments, the likelihood model is informed by prior probabilities of contamination that are first calculated for each pre-determined SNP in the sample based on the genotype of previously observed contaminated samples. [00193] Further, the contamination detection workflow 1300 can, in some cases, determine the likely source of contamination for the observed sample. That is, the likelihood model can compare sequencing data from several contaminated samples to determine a source of contamination. The likelihood model can be informed by prior probabilities of contamination from other samples with a known genotype to identify a likely source of contamination. In some embodiments, genotype is determined by identifying sequencing reads have a predetermined genotyping SNP.

VI. A PROBABILITY OF CONTAMINATION FOR A SINGLE PRE-DETERMINED SNP

[00194] The contamination detection workflow 1300 determines a probability that a sample is contaminated using prior probabilities and observed sequencing data (FIG. 13). In some examples, the observed sequencing data can be included in a sample call file (such as single variant call file 1312), optionally a LOH call file (such as LOH call file 1332), and optionally a population call file (such as MAF call file 1314). The prior probabilities of contamination can be determined based on the observed sequencing data. Here, for purpose of example, the probability of contamination for a single pre-determined SNP is based on a samples minor allele frequency MAF and the error rate of previously observed homozygous SNPs. In some embodiments, the contamination detection workflow 1300 can additionally or alternatively use, for example, alternate allele frequency, noise rates, and read depths to determine a contamination probability.

[00195] Contamination detection workflow 1300 compares the probability of observing data in the plurality of sequencing reads using two different models. In one model, there is no contamination and any sequencing reads with alternative alleles at the site are either the result of noise in the plurality of sequencing reads or of heterozygosity of the plurality of sequencing reads at a site of a pre-determined SNP. In the other model, there is contamination of the sample and sequencing reads with alternative alleles can be the result of correctly reading a contaminating cfRNA strand. In this context, contamination detection workflow 1300 calculates a ratio between the likelihood the sample is contaminated and the likelihood the sample is uncontaminated using the two models. Based on the ratio, contamination detection workflow can determine if the sample is contaminated or uncontaminated.

[00196] In one embodiment, the probability of contamination at a single pre-determined SNP site for a given set of data D is calculated as:

P(a|D) = P(a) ■ P(a) (1) where P(a|D) is the probability of observing the contamination level alpha given the data D, P(D|a) is the probability of observing the data given the contamination level alpha, and P(a) is the probability of the contamination level alpha. Therefore, in an example where there is no contamination in the sample, the probability of contamination in a sample can be represented as:

P(a = 0|D) = P(a = 0) ■ P(a = 0) (2) where a = 0 indicates that the contamination level a is 0.0%.

[00197] In one embodiment, in samples where the contamination level is non-zero, the probability of observing data D with a contamination level a for a given set of data D (P(D|a)) is further based on the genotype of the contaminant Gc and the genotype of the host GH (the source of the test sample). That is, the probability of observing data D given a contamination level a can be represented as: where P(Gc) is the probability that the contamination at the pre-determined SNP site will be the type associated with the genotype of the contaminant at that site, P(GH) is the probability that the contamination at the site will be the genotype of the host at that site, and P(D|p) is the probability of observing the data D given a set of characteristics p. Here, the set of characteristics p include the probability of an SNP mutation a for the pre-determined SNP site and the contamination level a but can include any other characteristics of the sample. The summation over the genotypes indicates that the probability of observing data at a contamination level a includes contributions based on the three possible genotypes of the contaminant and host (A/ A, A/B, and B/B).

[00198] For a given pre-determined SNP the probability of observing the data at a given contamination level alpha can be represented with a generic site specific model. The generic site specific model can be represented as:

P(cr) =

P(BBcont • P(p = £ + «) +

P BB h0St ) ■ P(BB cont ) - P(p = E) (4) where AA is a homozygous reference allele, AB is a heterozygous allele, BB is a homozygous alternative allele, the subscript “host” represents the genotype of the host GH, the subscript “conf ’ represents the genotype of the contaminant, a is the probability of observing a specific mutation, and a is the contamination level.

[00199] In some cases, the generic site specific model can be modeled with a binomial distribution. For example, for a specific case from the generic site specific model, the probability of observing the data D at a given contamination level alpha can be represented as: where “binomial” is the binomial probability of observing the data based on depth DP and minor allele depth MAD (minor allele depth) of the test sample, the genotype of the host (A/ A), the genotype of the contaminant (A/B), the contamination level a, and the probability of observing a specific error or mutation a.

[00200] The generic site specific model can be simplified using prior probabilities of contamination. The simplified model can be represented as:

P(a = P c ■ P(a, C) + (1 - P c )P a = 0, ! C) (6) where Peis the probability of contamination of the sample based on a prior observation of a contaminant with a genotype different from the host genotype C, P(D|a,C) is the probability of observing the data D with a contamination level a given the SNP is contaminated, (1-Pc) is the probability of no contamination and P(D|a=0, !C) is the probability of observing data D with a contamination level a of 0% (i.e., no contamination, denoted as !C).

[00201] Alternatively stated, Pc is the probability that an SNP at a site is contaminated with a contaminant of a different allele type than the host given a contamination level a. In one example, the simplified model determines the prior probability of contamination Pc using the following:

P c = {1 - (1 - MAF) 2 1 - MAF 2 if host is A/ A if host is B/B where MAF is the minor allele frequency, A/ A is a homozygous reference allele, and B/B is a homozygous alternative allele. Here, heterozygous alleles are removed and are not considered in determining the probability of contamination for a sample. VI.B PROBABILITY OF CONTAMINATION FOR A SAMPLE

[00202] As previously described, in one embodiment, the contamination detection workflow 1300 uses a likelihood model to determine contamination in a sample. Here, to determine contamination in a sample, the likelihood model determines a level of contamination a that maximizes a likelihood function L(a). The likelihood function L(a) can be written as: where P(D|a) is the probability of observing data D given contamination level a, P is a minimum allowable probability, N is the number of homozygous (A\A or B\B) SNPs of the sample, and D ; is the observed data for a given pre-determined SNP.

[00203] The likelihood function L(a) is proportional to the probability of observing data D given a contamination level a (P(D|a)). The probability of the data D given a contamination level a takes into account all pre-determined SNPs of the sample. That is, L(a) is the product over each pre-determined SNP in the sample of the maximum of the probability of the data in that pre-determined SNP given the contamination level a (P(Di|a)). For each pre-determined SNP, if the probability of the data D given a contamination level a is below a threshold, the probability for that pre-determined SNP can be assigned a value p. The value P is a minimum probability that is set as a black swan term (e.g., P = 3.3 x 10' 7 ) which limits the lowest value each pre-determined SNP evaluated can contribute to the likelihood function L(a). The probability of contamination at of a single pre-determined SNP site (P(Di|a)) is described in more detail in Section V. A.

VI.C PROBABILITY OF CONTAMINATION FOR A SAMPLE USING LIKELIHOOD TESTS

[00204] In one example of determining the likelihood of contamination, the contamination detection workflow 1300 applies a likelihood model including two separate likelihoods tests.

[00205] In the first likelihood test, the product term of the likelihood function L(a) is used to calculate a first likelihood ratio (LR) representing the maximum contamination likelihood that is obtained from testing a series of contamination levels ou against the minor allele frequency in a sample. That is, which level of contamination a gives the highest contamination likelihood.

[00206] The first likelihood ratio LRi uses a first null hypothesis that the sample is contaminated at a maximum of a series of contamination levels a (L(a = ou)) based on the MAF of the observed, pre-determined SNPs. That is, the sample is contaminated at a contamination level amax giving the highest likelihood of contamination. Therefore, the first null hypothesis can be written as:

L max = max L^a = ,001), L 2 ( a = -002), ... L (a = .5)] (8)

[00207] The first likelihood ratio also uses a first hypothesis that there is no contamination in the sample (L(a =0.000)). Therefore, the first likelihood ratio test LRi can be written as:

J D > max[L(a=0.001),L(a=0.002),L(a=0.003) ... L(a=.5)] uni

1 — - ~ : - (9) L(a=0.000) v 7

[00208] Generally, the first likelihood ratio LRi results in a value. The sample is considered to pass the first likelihood test if the value of the first likelihood ratio LRi is above a threshold level. That is, it is likely that the sample is contaminated at a contamination level a.

[00209] In the second likelihood test, the likelihood function L(a) is used to calculate a second likelihood ratio LR2 representing a likelihood that observed minor allele frequencies are due to contamination rather than due to a constant increase in noise across all predetermined SNPs or all SNPs.

[00210] The second likelihood ratio LR2 uses a second null hypothesis Lmax MAF that is the same as the first null hypotheses (Eqn. 4). Additionally, the second likelihood ratio LR2 uses a second hypothesis Lnoise that a sample contaminated at contamination level amax includes minor allele frequencies at an average allele frequency of previously observed SNPs (e.g., pre-determined SNPs or all SNPs) (uniform(MAF)). The second null hypothesis can be written as: noise L(d max \unif OV n^M AF')') (10)

[00211] Accordingly, the second likelihood ratio can be written as:

[00212] The second likelihood ratio LR2 results in a value. The sample is considered to pass the second likelihood test LR2 if the value is above a threshold. That is, it is likely that the observed MAF is due to contamination and not due to noise. Alternatively stated, the second likelihood test passes when a specific arrangement of previously observed MAFs are significant in determining the contamination likelihood, while a random distribution of previously observed MAFs are insignificant in determining contamination likelihood. [00213] If a sample passes both of the likelihood tests, then the sample is called as contaminated at contamination level a which passes the tests. If a sample fails either of the likelihood tests, then it is not called as contaminated.

[00214] In other configurations, the contamination detection workflow can use additional or fewer likelihood tests to determine if a sample is contaminated.

VI.D DETERMINING A CONTAMINATION SOURCE

[00215] In one example of determining the likelihood of contamination, the likelihood model of the contamination detection workflow 400 can additionally determine a likely source of contamination. Detecting the source of contamination enables the assessment of risk introduced by the contaminant, as well as the point in sample process in which it happened, such as, for example, any step of process 100 or 300. In contamination detection workflow 600 or 700, the genotypes of likely contaminants may be used in place of prior probabilities from population SNPs. Introduction of prior probabilities of contamination will either increase or decrease the likelihood ratio relative to the likelihood ratio obtained by for probabilities based on the population.

[0101] The likelihood model can be informed by the prior probabilities of pre-determined SNPs from the known genotypes of samples that were processed in the same batch as the test sample (or a set of related batches). A likelihood test is then performed to determine if knowing the exact genotype probabilities gives a higher value than the likelihood obtained using the population MAF probability. If the difference is significant, it can be concluded that a given sample is the contaminant.

[0102] For a given pre-determined SNP, three observed genotypes are possible: homozygous reference 0/0, heterozygous 0/1, and homozygous alternative 1/1, where 0 represents the reference allele and 1 the alternative allele. In a normal (uncontaminated) sample, the expected allele frequency values observed are expected to be close to 0, 0.5 and 1 for genotypes 0/0, 0/1 and 1/1, respectively. However, in a contaminated sample, the observed allele frequency values can be expected to shift from 0, 0.5, and 1, as the pre-determined SNPs vary across the population, and thus, have a higher likelihood of being present in a contaminating sample.

[00216] VII. DETECTING CONTAMINATION USING - REGRESSION

[00217] In one embodiment, a method for identifying contamination in a sample includes generating a noise model (i.e., a contamination model) based on the sequencing reads. In one embodiment, a method for identifying contamination in a sample includes generating a noise model (i.e., a contamination model) based on the sequencing reads identified as having one or more pre-determined SNPs and an observed allele frequency in the plurality of sequencing reads. Exemplary methods for using regression analysis for contamination detection are described in PCT/IB2018/050979, which is incorporated herein by reference in its entirety. [00218] In one embodiment, the noise model represents a measure of background noise in a subset of sequencing reads, the noise model generated based on the subset of the sequencing reads. The background noise can be a population measure of allele frequency in the subset of sequencing reads. Additionally, the background noise can be representative of the static noise generated when sequencing a SNP.

[00219] In one embodiment, a method of identifying contamination in a sample that includes applying a noise model (e.g., a contamination model) further includes applying the contamination model to an identified sequencing read using the observed allele frequency of the one or more pre-determined SNPs in the identified sequencing reads and the generated noise model to obtain a confidence score representing a measure of the predicted contamination in the sequencing reads. In such cases, a plurality of sequencing reads (e.g., a sample) is identified as contaminated when the confidence score is above a threshold that the contamination model predicts is indicative of contamination. Contamination models can include a random error term to aid in generating a confidence score.

[00220] In one embodiment, generating the noise model further comprises: determining a noise coefficient for each SNP of the subset of sequencing reads, the noise coefficient predicting the expected noise level for each SNP. In some embodiments, the noise model generated based on the subset of sequencing reads is additionally based on a sample type of the sequencing reads.

[00221] In a non-limiting example, FIG. 14 provides a diagram of a contamination detection workflow 1400 executing on the processing system 200 for detecting and calling contamination, applying a noise model (i.e., a contamination model).

[00222] In the illustrated example, contamination detection workflow 1400 includes a single sample component 1410 and a baseline batch component 1420. Single sample component 1410 of contamination detection workflow 1400 is informed, for example, by the contents of a single variant call file 1412 and a minor allele frequencies (MAF) variant call file 1414 called by the variant caller 240. The single variant call file 1412 is the variant call file for a single target sample. The MAF variant call file 1414 is the MAF variant call file for any number of SNP population allele frequencies AF.

[00223] Baseline batch component 1420 of contamination detection workflow 1400 generates a background noise baseline for each SNP from uncontaminated samples as another input to the single sample component 1410. Generating a background noise baseline is described in more detail below. Baseline batch component 1420 is informed, for example, by the contents of multiple variant call files 1422 called by the variant caller 240. The multiple variant call files 1422 can be the variant call files of multiple samples and are, in some examples, variants that are determined to be healthy samples. Healthy samples are samples previously determined not to include cancer.

[00224] In one embodiment, the contamination detection workflow 1400 can generate output files 1440 and/or plots 1442 from sequencing data processed by contamination detection algorithm 110. For example, contamination detection workflow 1400 may generate variant allele frequency distribution plots or regression plots as a means for evaluating a DNA test sample for contamination. Data processed by contamination detection workflow 1400 can be visually presented to the user via a graphical user interface (GUI) 1450 of the processing system 200. For example, the contents of output files 1440 (e.g., a text file of data opened in Excel) and regression plots 1442, for example, can be displayed in GUI 1450. [00225] In another embodiment, the contamination detection workflow 1400 may use the machine learning engine 220 and training module 1455 to improve contamination detection. Various training datasets 1456 (e.g., parameters from parameter database 230, sequences from sequence database 210, etc.) may be used to supply information to the machine learning engine 220 as described herein. In accordance with this embodiment, the machine learning engine 220 may be used to train a contamination noise baseline to identify a noise threshold, determine a contamination level, determine a contamination event, and determine the limit of detection (LOD) for contamination detection. Additionally, machine learning engine may be used to calculate the sensitivity (true positive rate) and specificity (true negative rate) for contamination detection. That is, machine learning engine 220 can analyze different statistical significance indicators (such as p-values) and determine the threshold that achieves highest sensitivity at the minimum desired specificity level (e.g. 99%) for determining a contamination event. [00226] Single sample component 1410 of contamination detection workflow 1400 is, for example, a runnable script that is used to estimate contamination in a sample. By contrast, baseline batch component 1430 of contamination detection algorithm 110 is, for example, a runnable script that is used for generating estimates across a batch of samples, and may also be used to generate a background noise model across these samples. The noise model is generated from a batch of samples previously determined to be healthy.

VIII. DETECTING CONTAMINATION U SING MAF AND NOISE

[00227] Exemplary methods for using regression analysis for detecting contamination are described in PCT/IB2018/050979, which is incorporated herein by reference its entirety.

[00228] In one embodiment, the contamination detection workflow 1400 may be based on a model for estimating contamination. In one example, the model is a linear regression model based on population mean allele frequencies of the one or more pre-determined SNPs, herein referred to as the “population model” for clarity, that is configured for detecting contamination in sequencing data from a sample (e.g, a plurality of sequencing reads). [00229] In one example, the population model determines contamination by calculating a probability that the observed variant frequency VAF for a sample (e.g, a plurality of sequencing reads) is statistically significant relative to the population mean allele frequency MAF and a background noise baseline. That is, the population model calculates a probability of observing a variant allele frequency VAF of a sample at a given contamination level a of the average minor allele frequency MAF of the population for any one or more of the predetermined SNPs. If the population model determines that the observed VAF for the sample at a given contamination level a is above a threshold contamination level and statistically significant, the contamination detection workflow 1400 can call a contamination event.

[00230] In some embodiments, the population model can be informed by a sample call file (e.g., single variant call file 1412), a population call file (e.g., MAF call file 1414), and a set of variant call files (e.g., multiple variant call files 1422). The single variant call file 1412 includes, at least in part, observed variant allele frequencies VAFs for each of the one or more of the pre-determined SNPs that are present in the plurality of sequencing reads.

Similarly, the population call file includes the minor allele frequencies of a population of test samples (MAFp). The minor allele frequency of the population of test samples MAFp can include the minor allele frequencies MAF of any number of SNPs of the population at any number of sites k. The set of variant call files includes the variant allele frequencies for a set of test samples (VAFB). The set of variant allele frequencies for a set of test samples can include variant allele frequencies VAF of any number of SNPs at any number of sites k.

VIII. A REGRESSION MODEL FOR MAF AND NOISE

[00231] In one embodiment, a contamination detection workflow 1400 determines a likelihood that a sample is contaminated using observed sequencing data and a background noise model. In some examples, the observed sequencing data can be included in a test sample call file (such as single variant call file 1412) and a population call file (such as MAF call file 1414). The background noise model can use a set of variant call files (such as multiple variant call files 1422) to determine a background noise baseline. Here, for the purpose of example, the probability of contamination for a single SNP is based on the relationship between a sample’s observed variant allele frequency VAFs of the one or more pre-determined SNPs present in the sample, a population minor allele frequency MAFp, and a background noise baseline generated from a set of variant allele frequencies VAFB.

[00232] In one embodiment, the contamination detection workflow 1400 uses a population model on a sample including a number of SNPs, including one or more of the pre-determined SNPs. The population model can be represented as:

VAF S = a MAF P + N(VAF B ) + e (12) where a is the contamination level, P is the noise fraction for the sample (i.e., number of noisy SNPs over number of non-noisy SNPs), N is the background noise model based on a set of observed variant allele frequencies VAFB, and a is a random error term determined by the regression.

[00233] In some cases, the observed variant allele frequency of the sample VAFs and the minor allele frequency MAFp of the population can include a negated variant allele frequency VAF and a negated minor allele frequency (MAF). Negated variant allele frequencies and negated minor allele frequencies allow the data used by the population model to be similarly scaled such that data from homozygous alternate alleles and homozygous alleles in a test samples are similarly analyzed in the population model.

[00234] In one example embodiment, the population model includes each pre-determined SNP i in a sample. Each pre-determined SNP i of the test sample is associated with a site k (i.e., genomic position) and any number of reads of the test sample can be associated with site k. Therefore, each SNP i of a test sample has an observed variant allele frequency VAF associated with its site k. Further, each pre-determined SNP i at site k is associated with a minor allele frequency MAF for that site k. The minor allele frequency MAF for site k is the minor allele frequency MAF for reads from multiple samples at site k. For example, a first SNP ii of a test sample is associated with a first site ki. The variant allele frequency VAF for the site ki is determined to be 0.03 from 1235 reads in the test sample associated with the first site ki. The minor allele frequency MAF at the first site ki associated with the SNP ii is determined to be 0.01 from 1 • 10 8 SNPs in the population. A second SNP i2 of a test sample is associated with a second site k2. The variant allele frequency VAF for the site k2 is determined to be 0.81 from 1792 reads in the test sample associated with the site k2. The minor allele MAF frequency at site k2 associated with the SNP i2 at the site k2 is determined to be 0.90 from 1 • 10 9 SNPs in the population.

[00235] Therefore, the variant allele frequency of the test sample VAFs can be represented as: where VAFs is the variant allele frequency of the test sample, the summation over k indicates that the variant allele frequency VAFs includes the variant allele frequency of SNPs at all sites k included in the test sample, and the summation over i indicates that the variant allele frequency VAF at site k includes all SNPs i at site k. Similarly, the minor allele frequency of the population MAFp can be represented as:

MAF P = ^ k Si MAF k l (14) where MAFp is the minor allele frequency of the population, the summation over k indicates that the minor allele frequency MAF includes the minor allele frequency MAF of SNPs of the population at all sites k included in the test sample, and the summation over i indicates that there is a minor allele frequency MAF associated with each SNP i at a site k of the test sample.

[00236] In one example embodiment, for a given test sample, there are three possible observed genotypes for each SNP i at a site k possible: homozygous reference 0/0, heterozygous 0/1, and homozygous alternative 1/1, where 0 represents the reference allele and 1 the alternative allele. In an uncontaminated test sample, the variant allele frequency values observed are expected to be close to 0, 0.5 and 1 for genotypes 0/0, 0/1 and 1/1, respectively. However, in a contaminated sample, the variant allele frequency values can be expected to shift from 0, 0.5, and 1, as the SNPs vary across the population, and thus, have a higher likelihood of being present in a contaminating sample. Modifying the variant allele frequencies VAF of the homozygous reference and homozygous alternative alleles such that the population model can analyze all genotypes of a test sample is beneficial. [00237] Therefore, in some embodiments, the population model can, for some SNPs i, negate variant allele frequencies VAF for some SNPs such that the population model can more easily process the variant allele frequency VAF data. In one example embodiment, the variant allele frequency VAF for SNPs i at site k (VAFk 1 ) included in the test sample can be described by:

FdF/( = {VAF k if 0 < VAF k < 0.2 NA if 0.2 < VAF k < 0.8 1 - VAF k if 0.8 < VAF k < 1.0 (15) where VAFk 1 is the variant allele frequency VAF for an SNP i at site k of the test sample, VAFk is the variant allele frequency of all SNPs of the test sample at site k, and NA indicates that a SNP will not be considered. Here, the variant allele frequency VAF for SNP i at site k of the test sample (VAFk 1 ) is the determined variant allele frequency for the SNPs at site k (VAFk) if the SNP i is a homozygous reference genotype call. A homozygous reference call is a reference call with a variant allele frequency VAF of SNPs at site k greater than 0.0 and less than 0.2 (0 < VAFk < 0.2). The variant allele frequency for an SNP i at site k of the test sample (VAFk 1 ) is not considered (marked as “NA” above) if the SNP i is a heterozygous reference genotype call. A heterozygous reference call is a reference call with a variant allele frequency VAF of SNPs at site k greater or equal to than 0.2 and less than or equal to 0.8 (0.2 < VAFk < 0.8). Finally, the variant allele VAF frequency for an SNP i at site k of the test sample (VAFk 1 ) is 1 less the determined variant allele frequency VAFk for all the SNPs at site k if the SNP i is a homozygous alternative reference call. A homozygous alternative reference call is a reference call with a variant allele frequency VAF of SNPs at site k greater than 0.8 and less than 1.0 (0.8 < VAFk < 1.0).

[00238] In some embodiments, the population model can, for some SNPs i, negate the minor allele frequencies MAF based on the variant allele frequency for an SNP i at site k such that the population model can more easily process the data. For example, the minor allele frequency for an SNP i at site k can be described by:

MAF^ = {MAF k if 0 < VAF k < 0.2 NA if 0.2 < VAF k < 0.8 1 - MAF k if 0.8 < VAF k < 1.0 (16) where MAFk 1 is the minor allele frequency MAF associated with SNP i at site k of the test sample, MAFk is the minor allele frequency of population SNPs at site k, NA indicates that a SNP will not be considered, and VAFk is the variant allele frequency of the SNPs of the test sample at site k. Here, the minor allele frequency MAF associated with SNP i at site k of the test sample (MAFk 1 ) is the minor allele frequency for the SNPs of the population at site k (MAFk) if the SNP i is a homozygous reference genotype call. The minor allele frequency for a SNP i at site k of the test sample (MAFk 1 ) is not considered (NA) if the SNP i is a heterozygous reference genotype call. Finally, the minor allele frequency associated with an SNP i at site k of the test sample (MAFk 1 ) is the 1 less the determined minor allele frequency MAFk for all the SNPs at site k if the SNP i is a homozygous alternative reference call. [00239] The population model can also include a background noise model N based on the variant allele frequencies from a set of variants (VAFB). The background noise model N can be used to distinguish a background noise baseline that is generated during sequencing of each SNP, such as, for example, during processes 100 and 300. The introduced noise may be from the sequence context of a variant and, therefore, some sites k will have a higher noise level and some sites k will have a lower noise level. Generally, the noise model is the average variant allele frequency for healthy variants of the set of variants at a given site k. Therefore, a given SNP i at site k of the sample can be associated with a background noise baseline associated with the site k. The background noise model N can determine a noise coefficient P representing the expected background noise baseline of each SNP.

[00240] In one approach, the population model regresses the contamination level a against the variant allele frequency for a test sample VAFs, the minor allele frequency for the population MAFp, and the background noise model N. That is, contamination detection workflow 1400 calculates a contamination level a of a sample using the associated observed variant allele frequency VAF, minor allele frequency MAF, and background noise model N for the pre-determined SNPs present in the sample. Contamination detection workflow 1400 determines a p-value of the contamination fraction a using the regression model across all pre-determined SNPs of a test sample. Based on the p-value and the contamination level a, the contamination detection workflow 1400 can determine that the sample is contaminated. For example, in one embodiment, if the determined contamination level a is above a threshold contamination value (e.g., 3%) and the p-value is below a threshold p-value (e.g., 0.05) the sample can be called contaminated.

[00241] In an alternative approach, the population model can calculate two contamination levels using the variant allele frequencies VAF and minor allele frequencies MAF of the predetermined SNPs in the test sample. In one example, the population model can include a first regression including a first contamination level ai using SNPs with homozygous alternative reference calls and a second regression including a second contamination level causing SNPs with homozygous reference calls. If a significant regression p-value is observed from both regressions, contamination detection workflow 1400 can determine that the sample is contaminated. In this case, using two regression equations to detect a contamination event provides stronger evidence for contamination than a single regression equation.

IX. DETECTING CONTAMINATION U SING CONTAMINATION PROBABILITY AND NOISE

[00242] Exemplary methods for using contamination probability and noise models for detecting contamination are described in PCT/IB2018/050979, which is hereby incorporated by reference in its entirety.

[00243] In another example embodiment of contamination detection workflow 1400 and the methods described herein, the contamination model for detecting contamination is a linear regression model based on a contamination probability generated from population mean allele frequencies, herein referred to as a “probability model” for convenience of description and delineation from the “population model” discussed previously. The probability model determines contamination by calculating a probability that the observed variant allele frequency for a plurality of sequencing read is statistically significant relative to a contamination probability and background noise baseline. That is, the probability model calculates a probability of observing a variant allele frequency VAF of a in a plurality of sequencing reads at a given contamination level alpha of the probable contamination frequency generated from the population. If the population model determines that the observed VAF for the test sample at a given contamination level a is above a threshold contamination level and statistically significant, the detection workflow 1400 can determine a contamination event.

[00244] In some embodiments, the probability model is informed by a test sample call file (e.g., single variant call file 1412), a population call file (e.g., MAF call file 1414), and a set of variant call files (e.g., multiple variant call files 1422). The test sample call file includes the observed variant allele frequencies VAFs for a single test sample. The variant allele frequency of the test sample VAFs can include observed variant allele frequencies VAF of each of the one or more pre-determined SNPs. Similarly, the population call file includes the minor allele frequencies MAFp of a plurality of sequencing reads. The minor allele frequency of the plurality of sequencing reads MAFp can include the minor allele frequencies of each of the one or more pre-determined SNPs. The set of variant call files includes the variant allele frequencies for a set of samples (/.< ., different pluralities of sequencing reads), i.e. VAFB. The set of variant allele frequencies for a set of samples can include variant allele frequencies at each of the one or more pre-determined SNPs. IX. A REGRESSION MODEL FOR CONTAMINATION PROBABILITY AND NOISE

[00245] In one embodiment, a contamination detection workflow 1400 determines a likelihood that a sample is contaminated using observed sequencing data and a background noise model. In some examples, the observed sequencing data can be included in a sample call file (such as single variant call file 1412) and a population call file (such as MAF call file 1414). The background noise model can be used from a set of variant call files (such as multiple variant call files 1422) to determine a background noise baseline. Here, for the purpose of example, the probability of contamination for a single pre-determined SNP is based on the relationship between a sample’s (i.e., plurality of sequencing reads) variant allele frequency VAFs, a contamination probability C based on a population minor allele frequency MAFp, and a background noise baseline generated from a set of variant allele frequencies VAFB.

[00246] In one embodiment, the contamination detection workflow 1400 uses a population model on a test sample including a number of SNPs. The population model can be represented as:

VAF S = aC(MAFp) + N(VAF B ) + e (17) where C is contamination probability based on the minor allele frequency of the population MAFp, a is the contamination level for the population, P is the noise fraction for the test sample, N is the background noise model generating a background noise baseline from the variant allele frequencies for a set of variants VAFB, and a is a random error term determined by the regression.

[00247] Here, the variant allele frequency of the test sample VAFs and the minor allele frequency of the population MAFp are similarly defined as in Eqns. 2 and 3. That is, each SNP i of the test sample is associated with a site k and the variant allele frequency for an SNP i is the variant allele frequency based on all SNPs at site k in the test sample. Further, each SNP i of the test sample is associated with a minor allele frequency MAF of all SNPs of the population at site k.

[00248] In some embodiments, contamination detection workflow 1400 uses a probability model based on the population minor allele frequency MAFp. Therefore, the contamination probability associated with each SNP i at site k of the test sample can be represented as: where Ck 1 is the contamination probability associated with each SNP i at site k of the test sample, the summation over k indicates that the contamination probability C includes the minor allele frequency MAF of SNPs of the population at all sites k included in the test sample, and the summation over i indicates that there is a contamination probability C associated with each SNP i of the test sample.

[00249] The contamination probability represents the likelihood a sample is contaminated based on the minor allele frequency MAF and genotype of the SNP i at site k. In one example embodiment, contamination probability C for an SNP i at site k (Ck 1 ) included in the test sample can be described as:

C k l = {1 - (1 - MAF k ) 2 if 0 < VF k < 0.2 NA if 0.2 < VF k < 0.8 1 - (M4F fe ) 2 if 0.8 < VF k < 1.0 (19) where Ck 1 is the probability of contamination probability C associated with SNP i at site k of the test sample, MAFk is the minor allele frequency of population SNPs at site k, NA indicates that an SNP will not be considered, and VAFk is the variant allele frequency of the SNPs of the test sample at site k. Here, the contamination probability C associated with SNP i at site k of the test sample (Ck 1 ) is one less the quantity one less the minor allele frequency for SNPs of the population at site k squared (1-(1- MAFk) 2 ) if the SNP i is a homozygous reference genotype call. The contamination probability for an SNP i at site k of the test sample (Ck 1 ) is not considered (marked as “NA” above) if the SNP i is a heterozygous reference genotype call. Finally, the contamination probability C associated with SNP i at site k of the test sample (Ck 1 ) is one less the quantity one less the minor allele frequency for SNPs of the population at site k squared (i.e., 1-(1- MAFk) 2 ) if the SNP i is a homozygous reference genotype call.

[00250] In some embodiments, the probability model can include a background noise model N similar to the noise model described for detection workflow 1400. That is, the noise model is the average variant allele frequency for healthy variants of the set of variants at a given site k (i.e., VAFB). Therefore, a given SNP i at site k of the test sample can be associated with a background noise baseline associated with the site k. The background noise model N can determine a noise coefficient P representing the expected background noise baseline of each SNP.

[00251] In this example, the probability model regresses the contamination level a against the variant allele frequency for a test sample VAFs, the contamination probability C and the background noise model N. That is, contamination detection workflow 1400 calculates a contamination level a of a test sample using the associated variable allele frequency VAF, contamination probability C, and background noise model N for the SNPs of the test sample. Contamination detection workflow 1400 determines a p-value of the contamination fraction a of the SNPs in a test sample using the probability model. Based on the p-value and the contamination level a, the contamination detection workflow 1400 can determine that the test sample is contaminated. For example, in one embodiment, if the determined contamination fraction a is above a threshold contamination value (such as, for example, 3%) and the p- value is below a threshold p-value (such as, for example, 0.05) the sample can be called contaminated.

X. METHOD OF PRE-DETECTING PRESENCE OF A DISEASE

[00252] In another aspect, this disclosure provides a method of predicting presence of a disease in a sample using, in part, the contamination detection methods described herein. In some cases, the disease is cancer. In some embodiments, the method of predicting presence of a disease in a sample includes: obtaining a plurality of sequencing reads for a plurality of nucleic acid fragments isolated from a sample comprising cell-free RNA (cfRNA); identifying contamination in a sample using any of the contamination detection methods described herein; and identifying SNPs from the plurality of sequencing reads that are informative for the presence of the disease.

[00253] In some embodiments, the methods of predicting presence of a disease include discarding a sample following determination that the sample is contaminated. In some embodiments, the method of predicting presence of a disease include assessing the risk introduced by contamination and using the risk in determining whether the sample is discarded. In some embodiments, the risk introduced by the contamination is determined in part by determining a likely source of contamination. In some embodiments, determining the contamination source lowers the risk introduced by the contamination, and wherein not determining the contamination source increases the risk introduced by the contamination.

XI. ADDITIONAL CONSIDERATIONS

[00254] The foregoing description of the embodiments of the invention has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.

[00255] Some portions of this description describe the embodiments of the invention in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.

[00256] Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product including a computer-readable non-transitory medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.

[00257] Embodiments of the invention may also relate to a product that is produced by a computing process described herein. Such a product may include information resulting from a computing process, where the information is stored on a non-transitory, tangible computer- readable storage medium and may include any embodiment of a computer program product or other data combination described herein.

[00258] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments of the invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.