Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TARGETED DETECTION OF RECURRENT GENOMIC REARRANGEMENTS
Document Type and Number:
WIPO Patent Application WO/2014/055920
Kind Code:
A1
Abstract:
The present invention is based in part on an assay method used to detect genetic variations or markers specific to tumor cells. The method targets structural variation (SV) breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, this method validate SVs through whole exome/genome sequencing and hybridization arrays. The invention further provides for the identification patient specific markers, methods to monitor the patients therapeutic response, remission and relapse. The invention additionally provides for kits for the detection of such variants.

Inventors:
BAFNA VINEET (US)
PATEL ANAND (US)
Application Number:
PCT/US2013/063539
Publication Date:
April 10, 2014
Filing Date:
October 04, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
C12Q1/68; C12N15/11
Domestic Patent References:
WO2005094291A22005-10-13
Foreign References:
US20100086918A12010-04-08
US20090202999A12009-08-13
US20030157543A12003-08-21
Other References:
BARSHIR ET AL.: "`Optimization of primer design for the detection of variable genomic lesions in cancer", BIOINFORMATICS, vol. 23, no. 21, 2007, pages 2807 - 2815
Attorney, Agent or Firm:
TAYLOR, Stacy, L. et al. (4365 Executive Drive Suite 110, San Diego CA, US)
Download PDF:
Claims:
What is claimed is:

1. A high sensitivity method for detecting a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising:

a) computational design of a multiplicity of primers to detect one or more breakpoint loci in the wildtype nucleic acid molecule of interest;

b) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotides present in the sample ;

c) analysis of the sequence of each multiplex PCR products;

d) detection of the variant polynucleotide in the sample; and

e) high through-put sequence analysis of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild- type nucleic acid.

2. The method of claim 1, wherein the variant is a mutation, deletion, insertion, substitution or genomic rearrangement.

3. The method of claim 2, wherein the genomic rearrangement is an inversion, deletion, translocation, alteration or a duplication.

4. The method of claim 1, wherein the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers are separated by approximately at least 15 - 100 kb.

5. The method of claim 4, wherein the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 5-15 kb around the locus of interest and the innermost primers are separated by approximately at least 20-80 kb.

6. The method of claim 5, wherein the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kp around the locus of interest and the innermost primers are separated by approximately at least 20 kb.

7. The method of claim 1, wherein the ratio of primer spacing to locus of interest is 3 kp to 380 kb.

8. The method of claim 1, wherein the ratio of primer spacing to locus of interest is 3 kp to 80 kb.

9. The method of claim 1 , wherein the multiplicity of primers are selected for minimum cost in loss of assay sensitivity, wherein the cost is calculated by the equation:

C( P)= ∑ w +∑ max {A (P) - d,0, ( l-p)d- A (P)}

(U)eE P i

10. The method of claim 1, wherein the analysis of the multiplex PCR products comprises:

a) sequencing the nucleotide of interest; and

b) analyzing the sequence by:

i) sequence alignment to wild-type nucleotide sequence,

ii) alignment trimming,

iii) clustering of breakpoints, and

iv) confirming the variant sequence.

11. The method of claim 1 , wherein the multiplicity of primers consists of approximately 2-80 primers.

12. The method of claim 11 , wherein the multiplicity of primers consists of

approximately 16 primers.

13. The method of claim 1, wherein a variant is detected in multiple samples

simultaneously.

14. The method of claim 1, further comprising the step of:

(f) prognosing, determining progression of cancer, predicting a therapeutic regimen or predicting benefit from therapy in a subject having a disease based on the detection of the variant.

15. The method of claim 14, wherein the disease is cancer.

16. The method of claim 15, wherein the cancer is selected from the group consisting of a carcinoma, sarcoma, leukemia, lymphoma, myeloma, and a CNS tumor. In some

embodiments, the cancer is selected from the group consisting of skin cancer, lung cancer, colon cancer, pancreatic cancer, prostate cancer, liver cancer, thyroid cancer, ovarian cancer, uterine cancer, breast cancer, cervical cancer, kidney cancer, epithelial carcinoma, squamous cell carcinoma, basal cell carcinoma, melanoma, papilloma, and adenomas.

17. A kit for detecting a variant polynucleotide having a nucleotide sequence differing from the wildtype nucleotide sequence of a nucleic acid molecule, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the kit comprising:

a) a multiplicity of primers computationally designed to detect one or more breakpoint loci in the wildtype nucleic acid molecule; and

b) algorithms to detect the variant.

18. The kit of claim 17, wherein the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers are separated by approximately at least 15 - 100 kb.

19. The kit of claim 17, wherein the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 5-15 kb around the locus of interest and the innermost primers are separated by approximately at least 20-80 kb.

20. The kit of claim 17, wherein the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kp around the locus of interest and the innermost primers are separated by approximately at least 20 kb.

21. The kit of claim 17, wherein the ratio of primer spacing to locus of interest is 3 kp to 80 kb.

22. A system for analyzing a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest in a subject, comprising:

a) a multiplicity of primers computationally designed to detect one or more breakpoint loci in the wildtype nucleic acid sequence of the subject;

b) a computer -executable algorithm for detecting the variant in a sample from a subject having a cancer, the algorithm comprising computational design of a multiplicity of primers; c) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotide present in the sample;

d) analysis of the sequence of each multiplex PCR products;

e) detection of the variant polynucleotide in the sample; and

f) high through-put sequence analysis of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild- type nucleic acid.

23. The system of claim 22 further comprising a machine to perform the multiplex PCR.

24. The system of claim 22, further comprising a machine to sequence the multiplex PCR products.

25. The system of claim 22, further comprising a machine to perform long read sequence analysis of a detected variant polynucleotide.

26. A method for confirming variant polynucleotide of unknown nucleotide sequence which differs from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising:

a) sequencing the nucleotide of interest; and

b) analyzing the sequence by:

i) sequence alignment to wild-type nucleotide sequence,

ii) alignment trimming,

iii) clustering of breakpoints, and

iv) confirming the variant sequence.

27. The method of claims 1 or 26, wherein the high through-put sequence analysis is single molecule sequence analysis.

28. The method of claims 1 or 26, wherein multiple polynucleotide variants are detected simultaneously.

29. The method of claims 1 or 26, wherein the sequence analysis has a high error rate.

30. The method of claim 1 or 26, where in the high through-put error rate is comprised of up to about 20% insertion error rates and deletion rates and/or up to about 5% substitution error rates.

Description:
TARGETED DETECTION OF RECURRENT GENOMIC REARRANGEMENTS

STATEMENT OF GOVERNMENT SUPPORT

[0001] This work is supported in part by grant numbers RO1-HG004962 from the

National Institutes of Health. The Government has certain rights to this invention.

FIELD OF THE INVENTION

[0002] The invention relates to methods for identification of genomic variants. The invention further relates to identification of chromosomal anomalies arising from mutation, deletion, substitution, insertion or rearrangement of such gene segments in cancer cells.

BACKGROUND OF THE INVENTION

[0003] Cancer develops through a series of genetic mutations, with tumor cells acquiring pernicious mutations that eventually lead to metastatic disease. The DNA mutations contributing to oncogenesis are not limited to point mutations, but include large chromosomal rearrangements, duplications, and deletions. It has been suggested that recurring mutations are the likely drivers for cancer, and might be viable biomarkers for disease detection and prognosis. However, capturing the DNA breakpoints associated with recurrent deletions is challenging via current targeted sequencing methods due to variability in the deletion size (up to 2 mb) and exact position. The effort is particularly challenging when the exact loci of breakpoints which make up the deletion in a given patient— and therefore the sequence of the resulting variant polynucleotide from the wild-type genomic nucleic acid— is unknown.

[0004] The utility of a method which overcomes the limitation of the art, however, would be significant. For instance, , a rare translocation between chromosome 21 and 8 fuses RUNX1 and RUNX1T1 genes forms a chimeric oncoprotein. The chimeric protein

contributes to initial leukemia cell growth mostly through transcriptional repression of wild- type RUNX1 targets and is found in, and serves as a marker for, 12% of acute myeloid leukemia (AML) cases. A more common translocation between chromosome 9 and 22 results in the Philadelphia chromosome observed in 95% of chronic myelogenous leukemia (CML). The translocation event adjoins the regulation of the BCR gene and ABL1 tyrosine kinase with the overall effect of speeding up cell division.

[0005] Such loss of DNA may also contribute to cancer progression. For example, many human cancers frequently delete chromosome 9p21 -22 locus containing ΜΊΑΡ, CDKN2A, and CDKN2B genes. The locus encodes INK4 proteins (p\ 5 1NK4b , p\ 6 INK4a ) that inhibit cyclin-dependent kinases, CDK4 and CDK6, and p\4 ARF , which inactivates MDM2 and thereby regulating p53. Thus, expression of these proteins is responsible for Gl cell-cycle arrest and signaling apoptosis. Homozygous deletions frequent the 9p21 -22 locus, in particular CDKN2A, which encodes both pl 6 INK4a and pl4 ARF . Each single deletion event diminishes expression of multiple proteins with unique tumor suppressor activity.

[0006] Thus, recurrent DNA lesions in cancer are the ideal markers for monitoring cancer progression and therapeutic response in patients. DNA lesions with high frequency and large deviation from normal DNA serve as the best targets because they are detected more often. For example, INK4A and ARF (collectively referred to as CDKN2A) are adjacent tumor suppressor genes that are frequently deleted (-33%) in the early neoplastic stages of different types of cancer.

[0007] In a clinical setting, DNA lesions can be used to (a) detect/characterize tumor DNA in individuals and (b) monitor tumor burden during or after treatment. It has been demonstrated that identification of BCR-ABL gene fusion at the DNA level in leukemia patients leads to a more sensitive test for measuring tumor burden than current BCR- ABL mRNA tests. Measuring changes in tumor burden during therapeutic treatment is critical for checking therapy effectiveness and deciding to continue treatment. It has been found that circulatory tumor DNA had the highest correlation with tumor burden and greater dynamic range than current standard of care CA 15-3 biomarker and circulatory tumor cell counting in metastatic breast cancer.

[0008] These studies all focused on tumor burden monitoring after the specific lesion had been identified and characterized. While monitoring is easy for point mutations and structural variants with known breakpoints, it is very difficult when the breakpoint of the structural variation is not known. For example, deletion of CDKN2A gene pervades cancer genomes across numerous different types of solid tumor and hematologic malignancies. While the gene loses function in each case, the exact DNA sequence lost in each cancer genome differs. At the same time, large variants are potentially much more specific for tumor detection and monitoring, and a test that could identify them reliably would have higher sensitivity for monitoring tumor burden. Reliable and sensitive identification of breakpoints in tumor DNA could also serve as a diagnostic for early detection. [0009] Whole genome sequencing experiments (analyzed with appropriate tools like BreakDancer, Pin-del, and SVDetect) have the potential to identify point mutations and structural variations in individual samples. However, clinical tumor samples are a mixture of tumor cells and normal cells and require ultra-deep sequencing to analyze tumor DNA.

[0010] Therefore, current approaches apply ultra-deep sequencing after targeted amplification of select genes. Unfortunately, these methods are unable to reliably identify structural variation with uncertain breakpoints. Alternatively, DNA

hybridization microarrays (SNP-arrays), which are still widely used in clinics, are capable of calling copy number variation, from which deletions and gene

amplifications can be inferred. However, the technology is only reliable with homogeneous samples and only reports low resolution boundaries estimates, insufficient for performing tumor burden monitoring assays. Thus, a challenge remains how to detect DNA markers, specifically, somatic structural variations, in a complex patient sample containing a mixture of tumor DNA and germline DNA. This is particularly challenging when the exact breakpoints are needed for quantitative DNA assays. To address these challenges, described herein are methods for more reliable and expansive capture of chromosomal rearrangement events while still using simple molecular biology techniques.

Summary of the Invention

[0011] The present invention is based in part on a method used to detect genetic variations or markers specific to tumor cells. The method targets structural variation (SV) breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, this method can validate SVs called by whole exome/genome sequencing and hybridization arrays. The invention further provides for the identification patient specific markers, methods to monitor the patients therapeutic response, remission and relapse. The invention additionally provides for kits for the detection of such variants.

[0012] Deletions in a cancer genome are all the result of one breakpoint in a normal genome, while inversions, and translocations in a cancer genome are the result of two breakpoints. Therefore, according to the invention, forward oligos are selected from one DNA stretch and reverse oligos are selected from another DNA stretch. If both breakpoints occur in the targeted DNA stretches, then the chromosomal rearrangement will be selectively amplified. A computational pipeline was developed for specialized oligo selection to allow for long range PCR amplification and maximize coverage of potential breakpoints.

Rearranged sequences that are amplified are then sequenced and analyzed.

[0013] Structural variants (SVs) or breakpoints, like CDKN2A deletion, can be utilized as patient-specific tumor biomarkers. The methods described herein, Amplification of

Breakpoints, ("AmBre") target SV breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, AmBre can validate SVs called by whole exome/genome sequencing and hybridization arrays.

[0014] AmBre involves a PCR-based approach to amplify the DNA segment containing a SVs breakpoint and then confirms breakpoints using DNA sequencing technology. To amplify breakpoints with PCR, primers tiling specified target regions are carefully selected with a simulated annealing algorithm to minimize off-target amplification and maximize efficiency at capturing all possible breakpoints within the target regions. To confirm breakpoints, PCR amplicons are combined without barcoding and long-read sequenced simultaneously. The algorithm efficiently separates reads based on breakpoints. Each read group supporting the same breakpoint corresponds with an amplicon and a consensus amplicon sequence is called. AmBre can target SVs where DNA harboring the breakpoints are present in 1 : 1000 mixtures.

[0015] Therefore, in one embodiment, the invention provides for a high sensitivity method for detecting a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising: a) computational design of a multiplicity of primers to detect one or more breakpoint loci in the wildtype nucleic acid of interest; b) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotides present in the sample; c) analysis of the sequence of each multiplex PCR products; d) detection of the variant polynucleotide in the sample; and e) high through-put sequence analysis of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild-type nucleic acid.

[0016] In one aspect, the variant is a mutation, deletion, insertion, substitution or genomic rearrangement. In another aspect, the genomic rearrangement is an inversion, deletion, translocation, alteration or a duplication. In a further aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers is separated by approximately at least 15 - 100 kb. In an additional aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced

approximately at least 5-15 kb around the locus of interest and the innermost primers is separated by approximately at least 20-80 kb. In a preferred aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kp around the locus of interest and the innermost primers is separated by

approximately at least 20 kb. In one aspect, the ratio of primer spacing to locus of interest is 3 kp to 380 kb. In a further aspect, the ratio of primer spacing to locus of interest is 6 kp to 80 kb. In an additional aspect, the ratio of primer spacing to locus of interest is 3 kp to 80 kb. In one aspect, the multiplicity of primers are selected for minimum cost in loss of assay sensitivity, wherein the cost is calculated by the equation:

C( P)= ∑ w +∑ max {A (P) - d,0, ( l-p)d- A (P)}

(U)eE P i

[0017] In one aspect, the analysis of the multiplex PCR products comprises: a) sequencing the nucleotide of interest and b) analyzing the sequence by: i) sequence alignment to wild- type nucleotide sequence, ii) alignment trimming, iii) clustering of breakpoints, and iv) confirming the variant sequence. In an additional aspect, the multiplicity of primers consists of approximately 2-80 primers. In a preferred aspect, the multiplicity of primers consists of approximately 16 primers. In a further aspect, the variant is detected in multiple samples simultaneously. In one aspect, the method can detect multiple variants simultaneously. In one aspect the high through-put sequence analysis is single molecule sequence analysis. In a further aspect, the high through-put sequence analysis has a high error rate. In another aspect, the high through-put sequence analysis error rate is comprised of up to about 20% insertion error rates and deletion rates and/or up to about 5% substitution error rates.

[0018] In an additional aspect, the method comprises a further step of prognosing, determining progression of cancer, predicting a therapeutic regimen or predicting benefit from therapy in a subject having a disease based on the detection of a variant. In one aspect, the disease is cancer. In a further aspect, the cancer is selected from the group consisting of a carcinoma, sarcoma, leukemia, lymphoma, myeloma, and a CNS tumor. In some embodiments, the cancer is selected from the group consisting of skin cancer, lung cancer, colon cancer, pancreatic cancer, prostate cancer, liver cancer, thyroid cancer, ovarian cancer, uterine cancer, breast cancer, cervical cancer, kidney cancer, epithelial carcinoma, squamous cell carcinoma, basal cell carcinoma, melanoma, papilloma, and adenomas.

[0019] In a further embodiment, the invention provides a kit for detecting a variant polynucleotide having a nucleotide sequence differing from the wildtype nucleotide sequence of a nucleic acid molecule, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the kit comprising a) a multiplicity of primers; and b) algorithms to detect the variant. In one aspect, the variant is a mutation, deletion, insertion, substitution or genomic rearrangement.

[0020] In a further aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers is separated by approximately at least 15-100 kb. In an additional aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 5-15 kb around the locus of interest and the innermost primers is separated by approximately at least 20-80 kb. In a preferred aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kp around the locus of interest and the innermost primers is separated by approximately at least 20 kb. In a further aspect, the ratio of primer spacing to locus of interest is 6 kp to 80 kb.

[0021] In a further embodiment, the present invention provides a system for analyzing a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest in a subject, comprising: a) a multiplicity of primers computationally designed to detect one or more breakpoints in the wildtype nucleic acid sequence of the subject; b) a computer -executable algorithm for detecting the variant in a sample from a subject having a cancer, the algorithm comprising computational design of a multiplicity of primers; c) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotide present in the sample; d) analysis of the sequence of each multiplex PCR products; e) detection of the variant polynucleotide in the sample; and f) long range sequencing of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild-type nucleic acid. In one aspect, the system further comprising a machine to perform the multiplex PCR. In another aspect, the system further comprising a machine to sequence the multiplex PCR products. In a further aspect, the system further comprising a machine to perform long range sequencing of a detected variant polynucleotide.

[0022] In a further embodiment, the invention provides a method for confirming variant polynucleotide of unknown nucleotide sequence which differs from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising: a) sequencing the nucleotide of interest; and b) analyzing the sequence by i) sequence alignment to wild-type nucleotide sequence, ii) alignment trimming, iii) clustering of breakpoints, and iv) confirming the variant sequence. In one aspect the high through-put sequence analysis is single molecule sequence analysis. In an additional aspect, multiple polynucleotide variants are detected simultaneously. In a further aspect, the high through-put sequence analysis has a high error rate. In another aspect, the high through-put sequence analysis error rate is comprised of up to about 20% insertion error rates and deletion rates and/or up to about 5% substitution error rates.

Brief Description of the Drawings

[0023] Figure 1 shows a standard PAMP tiling design for capture of CDKN2A deletions. CDKN2A upstream and downstream breakpoint regions are defined on a germline genome, blue and red lines, respectively. Tiled forward primers and reverse primers are spaced Ikb apart (width of hashed boxes) (not to scale with reference). Overlap of blue box and red box on tumor DNA represents a forward and reverse primers are less than 2kb apart and will lead to amplification of tumor DNA

harboring CDKN2A deletion breakpoints.

[0024] Figure 2 is a flow chart of the AmBre method with primer designing and long fragment sequence analysis.

[0025] Figure 3 depicts the process of designing, analyzing and selecting primers, a) Candidate primers are uniformly distributed in CDKN2A locus suggesting good primer designs are possible. AmBre primer design was used to capture CDKN2A deletion upstream and downstream breakpoints in regions chr9 : 21, 730, 000 - 21, 965,000 and chr9 : 21, 975, 000 - 22, 129, 000 (GRCH37 coordinates), respectively, b) Simulated annealing using different convergence rates, is used to select good primer designs with lowest sensitivity cost. The convergence rate that finds the lowest sensitivity cost primer design will depend on the input given to AmBre-design. c) Final low cost 68-primer design to capture CDKN2A deletions in 400kb breakpoint region. The solution has a 97.6% and a 99.7% coverage of breakpoint regions. The fraction of break pairs captured by the design (resulting in amplicon length < 13kb) is 99.99%.

[0026] Figure 4 shows the PCR products of AMBRE-16 on cell-lines: A549 (lane 2), CEM (lane 3), Detroit562(lane 4), HeLa(lane 5), MCF7 (lane 6), and T98G (lane 7). 4μ1 of lkb GeneRuler in lane 1. Lanes are reactions starting with lOng cell-line genomic DNA. HeLa cells (no CDKN2A deletion) and H 2 0 are negative controls. Arrow denotes weak Detroit562 band; another PCR had stronger amplification and was used for subsequent sequencing.

[0027] Figure 5 shows the aggregates of breakpoints from PacBio™ fragments after sweep line clustering. Only breakpoints with L < 1000 are displayed for inset boxes. The height of each cluster corresponds with number of fragments supporting the breakpoint (depth of breakpoint coverage).

[0028] Figure 6 shows the breakpoint sequences for A549, CEM, Detroit562, MCF7, and T98G with orthogonal validation chromatogram of MCF7 and T98G. AmBre-analyze captures both breakpoints and nontemplated insert sequence.

[0029] Figure 7 shows the subsampling of 9 primers from the complete AMBRE-68 tiling design results in clean amplification of CDKN2A loss DNA fragments in 6 cell lines. From left to right, lanes contain lkb Plus GeneRuler DNA ladder, PCR products from samples A549 (2.2kb), CEM (5.8kb), MCF7 (3.6kb), MOLT4 (6,8kb), T98G (7.5kb), HEK (Okb), and water (Okb). The expected lengths of each amplicon according to AMBRE-68 design are listed in parentheses. HEK cells (no CDKN2A deletion) and H 2 0 are negative controls.

[0030] Figure 8 depicts successful A549 (arrow) and MCF7 (arrow) CDKN2A deletion amplification with heterogeneity ratios 1 : 1, 1 : 10, 1 : 100, 1 : 1000 (lanes 3-6 for A549 and lanes 10-13 for MCF7) and 16 primers starting with 400ng of gDNA. Lane 1 contains lkbp Plus Gene Ruler DNA ladder. Lanes 2 and 9 are A549 and MCF7 positive control reactions starting with 20ng of homogenous gDNA. Lanes 7,14 are negative control reactions with wild-type DNA and lanes 8, 15 are water negative control reactions with corresponding 16 primer mixes. [0031] Figure 9 shows a) Fragment-segmentation example for local alignments 1, 2,3, and 4 along a PacBio fragment, b) Triangle representation of adjacent alignments 1, 2, and 3 on G x G plane.

[0032] Figure 10 Characterizing RUNX1-RUNX1T1 balanced translocation in Kasumi-1. Lanes 1,2,4,6 and 8 contain lkb Plus GeneRuler DNA ladder, PCR products from Kasumi-1 Der8 with all 28 primers (3:5Kbp), 14 primer FE U RO (3:5Kbp), 14 primer FO U RO

(6:8Kbp), 14 primer FO U RE (10: lKbp). Lanes 3,5,7 and 9 contain matching water controls, which show no contamination. Lanes 10,12,14, and 16 contain PCR products from Kasumi-1 Der21 with all 29 primers (2:7Kbp), 15 primer FO U RO (2:7Kbp), 15 primer FE U RO (6: lKbp), and 14 FE U RE (8: lKbp). Gel was loaded with 2ul for lanes 2,3,4,5,10,11,12 and 13, and 4ul for remaining volumes. Reactions with shorter amplicons amplified extremely well and lesser volumes were used for visualization on the gel. The expected amplicon lengths according to the Der8 and Der21 design are listed in parentheses.

[0033] Figure 11 shows DNA helix stability around breakpoints. Using code from BreakSeq pipeline, DNA flexibility or the 6 breaks around proposed non-homologous end joining DNA breaks showed no significant deviation.

Detailed Description of the Invention

[0034] The present invention is based in part on a method used to detect genetic variations or markers specific to tumor cells. The method targets structural variation (SV) breakpoints occurring in samples composed of heterogeneous tumor and germline DNA. Additionally, this method can validate SVs called by whole exome/genome sequencing and hybridization arrays. The invention further provides for the identification patient specific markers, methods to monitor the patients therapeutic response, remission and relapse. The invention

additionally provides for kits for the detection of such variants.

[0035] Before the present compositions and methods are described, it is to be understood that this invention is not limited to particular compositions, methods, and experimental conditions described, as such compositions, methods, and conditions may vary. It is also to be understood that the terminology used herein is for purposes of describing particular embodiments only, and is not intended to be limiting, since the scope of the present invention will be limited only in the appended claims.

[0036] As used in this specification and the appended claims, the singular forms "a", "an", and "the" include plural references unless the context clearly dictates otherwise. Thus, for example, references to "the method" includes one or more methods, and/or steps of the type described herein which will become apparent to those persons skilled in the art upon reading this disclosure and so forth.

[0037] It will be understood that the terms "variant", "structural variant", "SV" and "breakpoint" as used throughout this disclosure refer to a polynucleotide having a sequence differing from the wildtype nucleotide sequence of a nucleic acid molecule.

[0038] It will be understood that the terms "high through-put sequence analysis" or "high through-put sequencing" as used throughout this disclosure refer to a method for obtaining digital readouts of sequence from a DNA sample. Typically, a multiplicity of reads is found from a single sample with lengths ranging from 30-20000bp. Reads with lengths greater than 100 bp are considered long reads. Single molecule sequencing is an example of high through-put sequencing in which long nucleotide sequences can be sequence in a single pass without shearing of PCR products. High throughput sequencing may be performed using any appropriate sequencing technology. Examples of such technologies include Sanger™ sequencing, SBS™ sequencing and Pacific Biosystem™ sequencing.

[0039] It will be understood that the terms "primer" or "primers" as used throughout this disclosure refer a an oligonucleotide sequence used as a reagent for PCR. The

oligonucleotides are typically 20 -40 nucleotides in length, but may be longer or shorter. Primers may be modified for detection. Such modifications include biotinylation, labeling with fluorescent dyes or other known labeling methods.

[0040] It will be understood that the term "wild type DNA" as used throughout this disclosure refer any DNA which does not have a variant polynucleotide. Wild type DNA would include heterogeneous tumor DNA.

[0041] A mutation is a change of the nucleotide sequence of the genome. Mutations result from unrepaired damage to DNA or to RNA genomes, errors in the process of replication, or from the insertion or deletion of segments of DNA by mobile genetic elements. Mutation can result in several different types of change in sequences. Mutations in genes can either have no effect, alter the product of a gene, or prevent the gene from functioning properly or completely. Mutations can also occur in noncoding regions. Examples of mutations include point mutations, substitutions, insertions or deletions.

[0042] Genomic rearrangement maybe large scale nucleic acid mutations. Examples of genomic rearrangements include inversion in which the nucleotide orientation is reversed, deletion in which nucleotides are deleted, translocation in which nucleotides are moved to a different chromosomal location, alteration or a duplication of nucleotides. Examples of genomic rearrangements include CDKN2A deletion, TP53 deletion, PIK3A deletion, EGFR deletion, PTEN deletion, KLF5 deletion, BCR-ABL translocation, RUNX1-RUNX1T1 translocation, RARA-PML translocation, MLLT3-MLL translocation, TMPRSS2-ERG translocation and ERB2 inversion.

[0043] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the invention, the preferred methods and materials are now described.

[0044] The invention provides advantages over existing technology by, for example: a) using long range polymerases, which provides a higher coverage of genetic lesion variability; and,

b) long range sequencing is used to confirm detection which reduces false positives and can pinpoint the exact breakpoints associated with the DNA lesion.

[0045] The method of the present invention, also referred to as Amplification of

Breakpoints ("AmBre"), is based on PCR, where a DNA product is amplified only when a tumor specific genetic lesion is present within a patient's DNA sample. The invention improves on Primer Approximation Multiplex PCR (PAMP) methodology but differs in several aspects, including utilization of a long range PCR protocol instead of a standard PCR protocol and in not requiring a microarray hybridization step. PAMP is a PCR assay, developed to selectively amplify the tumor DNA sequence containing a structural variation (United States Patent Application No. 12/375,912 relating to PAMP is incorporated herein by reference).

[0046] PAMP has several drawbacks. In the multiplexed reaction, all primers must be evenly spaced so as to amplify any deletion in the region and primers cannot dimerize. In a large (e.g. l OOKbp) region, 100 applicable primers must be identified from a large candidate set of over 5000 potential primers. An exhaustive search of all candidate primer combinations is infeasible (5000 candidate primers and 50 to 100

211 *

primers desired would result in searching∑5o<; <ioo (5000)~10 Z1 7 i combinations. [0047] PAMP is therefore limited to detecting recurrent structural variations where breakpoints appear in short breakpoint regions (< 40kb), as a large number of primers in a multiplex reaction inevitably leads to loss of sensitivity with off-target DNA synthesis and increased spurious primer-primer interactions. Finally, PAMP detects the amplified product and identifies breakpoints via DNA hybridization arrays which had the additional challenge of designing probes that match the primer designs. AmBre resolves these issues with a three-phase approach (Fig. 2). The first phase (AmBre- design) involves a revised computational approach to designing multiplex primers on discontiguous DNA regions ignoring regions known to not contain breakpoints. This requires some changes to the optimization function and results in a more flexible design with better performance on sparse regions. The output of this phase is a multiplicity of primers that can be mixed in a single multiplex reaction.

[0048] In the second phase (AmBre-amplify) long range PCR amplifies target amplicons, reducing the number of primers required in a single reaction substantially as compared to PAMP. For example, PAMP would require 600 primers to cover a 600Kbp region, with over 180,000 putative interactions. In contrast, to cover the same region, AmBre would need < 100 primers with only 5,000 possible interactions, which improves reliable amplification from proposed designs.

[0049] In AmBre, the amplified products are sequenced using a system from which analysis allows mixing the amplicons prior to sequencing, with computational separation of breakpoints in the third phase. A non-limiting example of such a system is the Pacific Biosciences RS Platform (PacBio™).

[0050] The third phase (Ambre- analyze) involves a customized analysis of sequenced reads to identify DNA breakpoints for each tumor genome. The analysis involves clustering of split mapped reads followed by error correction, and sequence reconstruction around the breakpoint regions. For example, AmBre can detect targeted structural variations (potential tumor DNA biomarkers) by identifying deletion breakpoints in the cancer cell lines A549, CEM, Detroit562, MCF7, MOLT4, and T98G, including resolution of previously unidentified CDKN2A breakpoints in MCF7 and T98G. Furthermore, Ambre can be used to identify translocations and inversion; e.g., as exemplified herein for RUNX1-RUNX1T1 translocation in the cancer cell line Kasumi-1. [0051] Therefore, in one embodiment, the invention provides for a high sensitivity method for detecting a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising: a) computational design of a multiplicity of primers to detect one or more breakpoint loci in the wildtype nucleic acid of interest; b) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotides present in the sample; c) analysis of the sequence of each multiplex PCR products; d) detection of the variant polynucleotide in the sample; and e) high through-put sequence analysis of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild-type nucleic acid.

[0052] In one aspect, the variant is a mutation, deletion, insertion, substitution or genomic rearrangement. In another aspect, the genomic rearrangement is an inversion, deletion, translocation, alteration or a duplication. In a further aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced

approximately 1-20 kb around the locus of interest and the innermost primers is separated by approximately at least 15-100 kb. In an additional aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced

approximately at least 5-15 kb around the locus of interest and the innermost primers is separated by approximately at least 20-80 kb. In a one aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 4, 5, 6, 7, 8 kp around the locus of interest and the innermost primers is separated by approximately at least 15, 20, 25, 30, 35, 40, 45, 50, 60, 70 or 80 kb. In a preferred aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6, kp around the locus of interest and the innermost primers is separated by approximately at least 20 kb. In one aspect, the ratio of primer spacing to locus of interest is 3 kp to 380 kb. In a further aspect, the ratio of primer spacing to locus of interest is 6 kp to 80 kb. In an additional aspect, the ratio of primer spacing to locus of interest is 3 [0053] kp to 80 kb. In one aspect, the multiplicity of primers are selected for minimum cost in loss of assay sensitivity, wherein the cost is calculated by the equation:

C( P)= ∑ w +∑ max {A (P) - d,0, ( l-p)d- A (P)}

(U)eE P i

[0054] In one aspect, the analysis of the multiplex PCR products comprises: a) sequencing the nucleotide of interest and b) analyzing the sequence by: i) sequence alignment to wild- type nucleotide sequence, ii) alignment trimming, iii) clustering of breakpoints, and iv) confirming the variant sequence. In an additional aspect, the multiplicity of primers consists of approximately 2-80 primers. In another aspect, the multiplicity of primers consists of approximately 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 40, 50 or 60 primers. In a preferred aspect, the multiplicity of primers consists of approximately 16, primers. In a further aspect, the variant is detected in multiple samples simultaneously. In one aspect, the method can detect multiple variants simultaneously. In an additional aspect, the method can detect multiple variant polynucleotides in multiple samples simultaneously. In one aspect the high through-put sequence analysis is single molecule sequence analysis. In a further aspect, the high through-put sequence analysis has a high error rate. In another aspect, the high through-put sequence analysis error rate is comprised of up to about 20% insertion error rates and deletion rates and/or up to about 5% substitution error rates. In another aspect, the high throughput sequencing is performed with gel electrophoresis. The subject method can be used for both the detection of known variants and for the identification of unknown variants.

[0055] Primer design

[0056] The present invention utilizes unique computational analysis for the design of the multiplicity of primers. The primer design used in the subject method are critical. The primers must be unique, must not self-hybridize or hybridize with other primers and must be evenly spaced over the desired length of DNA and amplify the non-target locus wild type DNA. The input to AmBre-design is a collection of genomic intervals for the forward region, denoted by F, a collection of genomic intervals for the reverse region (R), and parameter d. The output is a collection of forward primers in F and reverse primers located in R spaced apart by approximately d. [0057] In practice, AmBre-design has the following steps:

• Candidate primer generation from target breakpoint regions, where oligonucleotides are selected according to thermodynamic properties. Primers with significant self-dimerization are eliminated. Primers that are likely to dimerize, or cause off-target amplifications are marked as incompatible.

• The candidate list of primers and incompatible primers is used to design an optimal set of primers based on considerations outlined below.

[0058] Firstly, denote a primer design, P as a subset of candidate primers numbered according to the order of genomic start locations / / , I 2 , , ... / « . Let set B denote incompatible primers. A cost C(P) was associated with each design, and designs with minimum cost are preferred. The formulation of cost accommodates sparser primer designs and targeting discontiguous regions. The parameter d is set to be half the maximum feasible PCR amplicon size. Thus, for long-range polymerases used here, d = 6500 was used, corresponding to a desirable amplicon size < 13Kbp. The cost of the design is a sum of incompatibility costs for each pair, and coverage costs.

[0059] For the coverage, let A t (P) = - l denote the gap between adjacent pairs. If Ai( P) > d, there is a risk of the product being too long to be amplified. On the other hand, if Ai( P) « d, the design has extra primers that greatly decrease the efficiency of the multiplex reactions. Let parameter p, with 0<p< 1, describe a target density l+p of primers every d bp, corresponding to a primer every dl\=p ~ (l -p)d bp. Ideally, the distance between adjacent primers is bounded by (l -p)d < Ai(P) < d. A design is penalized if the distances violate these constraints. Formally,

C( P)= ∑ w p +∑ max {A.(P) - d,0, ( l-p)d- A.(P)}

(U)eE i

[0060] A single incompatible pair severely diminishes the multiplex reaction therefore, w p =∞ for the designs wherein p = 0.2. Simulated annealing is used to find low cost primer designs by applying the cost function (Fig. 3b). The algorithm explores the large space of all primer designs by initiating a random primer subset and improving the primer subset with iterative addition or removals of primers. Since the algorithm involves randomization and has parameters governing convergence to low cost designs, simulated annealing is repeated multiple times under different rates of convergence. The lowest cost primer design from all simulated annealing runs is used as the final primer tiling design (Fig. 3b).

[0061] In one aspect the multiplicity of primers is approximately between 2-40. In other aspects the multiplicity of primers maybe 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 , 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 ,65, 70, 75 or 80. In a preferred aspect the multiplicity of primers is 16.

[0062] As used herein "locus of interest" refers to the nucleotide sequence targeted for detection or identification of a variant or breakpoint. In some aspects, the locus of interest is 15 kb to 100 kb in length. In other aspects, the locus of interest is 15, 16, 17, 18, 19 ,20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95,100, 150, 200, 250, 300, 400, or 500 kb.

[0063] In an additional aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers is separated by approximately at least 15-100 kb. In another aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 kb around the locus of interest and the innermost primers is separated by approximately at least 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75 or 80 kb. In a preferred aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kb around the locus of interest and the innermost primers is separated by approximately at least 20 kb.

[0064] In one aspect, the ratio of primers spacing to the locus of interest is approximately 2kb:400 kb-10 kb: 100kb. In further aspects, the ratio of primers spacing to the locus of interest is approximately 3kb:380kb, 3kb:80kb or 6kb to 80 kb.

[0065] In a further aspect, the primers are selected for a minimum cost in loss of sensitivity, wherein the cost of a primers is calculated by the equation:

C( P)= ∑ w +∑ max {A (P) - d,0, ( l-p)d- A (P)}

(U)eE P i [0066] Amplification

[0067] In subject method amplification is performed using long range multiplex PCR, reducing the number of primers required in a single reaction. Long Range PCR refers to the amplification of DNA lengths that cannot typically be amplified using routine PCR methods or reagents. Multiplex PCR is a modification of polymerase chain reaction in order to target variable products or variability in breakpoints. This process amplifies genomic DNA samples using multiple primers and a temperature-mediated DNA polymerase in a thermal cycler. In contrast to PAMP, the subject method can be used for DNA lengths > 40 kb. For example, the subject method can be used for DNA lengths of 40 kb, 50 kb, 60 kb, 70, kb, 80 kb, 90 kb, 100, kb, 150 b, 200 kb, 250 kb, 300 kb, 350 kb, 400 kb, 450 kb or 500 kb.

[0068] Analysis Sequences of PCR Products, Identification of the Variant and Confirmation of the Variant

[0069] Sequencing of the amplified PCR products and computational analysis confirms capture of the variation or breakpoint (for example CDKN2A deletions). The subject method utilizes sequencing technology for long reads structural variation calling, and throughput. An example of technology suitable sequencing platform is the PacBio™ system from Pacific Biosciences of Menlo Park, California, although other platforms might be utilized. The amplified PCR products are sequenced and then aligned to a reference genome to identify variants or breakpoints with the sequence around each breakpoint to identify the variant.

[0070] A breakpoint is defined according to the computational algorithm applied in the AmBre-analyze step as a pair of disjoint coordinates a and b on a reference, and a non- template sequence s (of length t) such that the sample sequence brings a and b together, separated only by the insertion of s. The objective of AmBre-analyze is to take as input a collection of sample sequences aligned to the reference genome and output a collection of breakpoints along with the sequence around each breakpoint.

[0071] A unique aspect to the subject application is the detection of a variant

polynucleotide while correcting for a high number of sequencing errors. For example the subject method can detect a variant polynucleotide in a multiplicity of reads with error rates such as up to 20% insertion and up to 5% substitution rates. Ambre-analyze works by (a) alignment trimming, (b) breakpoint or variant clustering of fragments, and (c) consensus sequence generation around each breakpoint or variant, as outlined in Figure 2 and described further below. [0072] Alignment Trimming

[0073] BLASR-computed local alignments between the PacBio™ sequencing platform reads and human reference assembly were provided as input to alignment trimming. An alignment pair (F a , G a ), {Fb, Gb) with a « b between a fragment F and reference G imply a breakpoint. The goal of alignment trimming is to trim the ends of each alignment for each fragment F, so that (a) each segment of F participates in a single alignment; and, (b) F is maximally covered.

[0074] A local alignment is defined as a pair of intervals from the fragment and reference that can be aligned with a small number of edits. A split mapped fragment F supports a breakpoint (a, b, s) with two local-alignments (denoted as ( F a , G a ), (F b , G b )). In the ideal case, G a ends at a, and Gb begins at b, while the fragment segment between F a and Fb is exactly the inserted sequence s (Fig. 2). However, in real data, a fragment can span multiple breakpoints, sequence errors can result in spurious incorrect alignments, and the alignments output by standard tools like BLASR, will have uncertain boundaries. Specifically, inaccurate boundaries might result in overlapping consecutive segments F a , Fb.

[0075] AmBre-analyze resolves these errors by choosing the optimal alignment segments covering the fragment F. For a fragment F, the input is a chain of local alignments = (F a , G a ), (Fb,Gb),.... The output is a subset F'=(F' a , G ' a ), ...ofF with alignment boundaries trimmed so (1) none of the fragment segments F' a , F'b,... overlap, (2) the number of distinct alignments is minimized, and (3) most of fragment F is covered. The second and third objectives reinforce the notion that a typical fragment covers a small number of breakpoints and is mostly well-aligned except for non-templated insertion sequence. The first objective helps to narrow down the breakpoint coordinates. To clarify, consider a trimmed reference interval G ' a that ends at x and a consecutive interval G b beginning at y, while the gap between corresponding fragment segments is L. Then, we expect that a > x, b < y, and

L= £ - (& - x) + (y - b).

[0076] Thus, the fragment constrains the location of the breakpoint (a, b) to lie in a small region between x, y. The subject method uses a unique algorithm for the alignment which works by combining the above objectives into a single objective function, and uses a dynamic programming approach to identify the optimal trimming. [0077] Local alignments encompassed by other alignments are removed (e.g., 4 in Fig. 9). The remaining alignments were sorted by their location on the fragment, so that alignment i starts before alignment j if and only if i < j. Let b s (i) and b e (i) denote the fragment breaks before the beginning and after the end of alignment i.

[0078] Alignments can be represented on a grid with alignments as rows and fragment positions as columns (Fig. 9). An alignment is a series of breaks on the fragment (i.e. (1 , b ) to (1 , bs) in Fig. 9). Alignments are chained together to cover a portion of F exactly once. To chain adjacent alignments, for each alignment j with an alignment i that terminates before j starts, add a jump from (i, b e (i)) to (j,b s (j)) (for instance (1 , b e {\)) to (3, b s ,(3))). Also, for each alignment j overlapping an earlier alignment i on the fragment, add a jump from (i, b s (j)) to (j, b e (i)) (for instance (2, b e (3)) to (3, b s (2))) if i spans b s (j) an d j spans b e (i). By this process, any alignment chain covers positions exactly once.

[0079]

and i,j overlap from u to v rwise

[0080] An alignment chain is scored by summing local alignment scores (Aln[z, u, v] for alignment i for fragment coordinates u to v) and penalizing for jumps between alignments (J(u, v) for alignment u to v). A high scoring alignment chain corresponds to trimmed alignments that aligns well and covers most of the fragment. The score of a chain is computed using dynamic programming. Let S(j, v) denote the score of the best chain ending at (j, v). Then,

S(j,v)=max{S(i,u) +wf(i,u), (j,v)]}

(i,u)

[0081] In the recursion, (i, u) is the start of alignment j, start of a jump to (j,v) (i.e. if (j,v) = (3, b e (2)) then (i, u) could be (2, b s (3))), or previous position on alignment j where a jump ends (i.e. if (j, v) = (2, b e {2)) then (i, u) = (2, b e {\))). By not computing the score for each alignment and fragment position on the grid, the optimal trimmed alignment chain is quickly found. [0082] Along the maximum scoring chain, each jump, (F' a , G a ), (F' b ) , G b ), represents a breakpoint estimate (a, b, F - F :). For example, the jump from 1 to 3 correspond with breakpoint estimate (xi, y 2 , 6).

[0083] In this formulation, two alignments that overlap may contribute to a high score since the overlap segment is scored as the average of both alignment scores. For a breakpoint estimate from overlapping alignments, boundaries around the overlap are used and do not resolve a tighter breakpoint within the overlap segment. Finding a tighter breakpoint estimate would require computing S for all breaks within overlap intervals, which is inefficient for thousands of fragments. In any case, the conservative breakpoint estimates are improved with downstream clustering and refinement steps.

[0084] Fragment Clustering

[0085] Fragment clustering uses the information from the alignment trimming for multiple fragments to further narrow the breakpoint location. Consider a two dimensional representation of the genomic space with F and R being the horizontal and vertical axes, respectively. In this representation, a true breakpoint (a, b) is represented by a point, and each split-mapped read (x, y, L) is represented by a triangle of possible breakpoints (a, b) that satisfy (a - x) + (y - b) < L (Fig. 9b). Multiple reads supporting the same breakpoint represent multiple triangles whose intersection reduces the uncertainty in breakpoint determination. Furthermore, if reads from multiple AmBre-amplify experiments are combined, the split-mapped reads will cluster according to overlap, revealing breakpoints for each experiment sample. The subject method uses a fast, customized method to recover the aggregated read clusters for each breakpoint. For example, the method took 2.5 min seconds on a single Desktop core to analyze all local alignments from 52,000 reads from a single PacBio™ SMRT cell experiment.

[0086] Consensus Sequence Determination

[0087] Predicted amplicon sequences are generated from the breakpoint estimates. In turn, these templates are supplied as reference sequences for further analysis. For example, the PacBio™ SMRT Analysis Resequencing Protocol analysis protocol culls consensus amplicon sequences by correcting the predicted templates.

[0088] Breakpoint clustering

[0089] Breakpoint estimates from all fragments supporting the same breakpoint are aggregated into groups using a sweep line algorithm. For a breakpoint estimate (x,y,L), the true breakpoint junctions (a,b) in reference G lies between x.... x + L and y-L ....y, respectively, subject to a - x + y -b < L. Here, we assume L, a spacing length on F, is a reasonable estimate for breakpoint uncertainty on G and the effect of sequencing deletion errors at the breakpoint junction is minimal. On a G x G plane, each breakpoint estimate x, y, and L with the above constraints defines a triangle which contains the true breakpoint (a, b) (Fig. 9 and Fig. 5).

[0090] A line sweeps the plane and tracks when breakpoint triangles overlap along the sweep line. Here, a cluster is a collection of triangles where each triangle overlaps one or more triangles in the cluster. The consensus breakpoint (a, b) for the cluster is the mode of (x, y) estimates (see Fig. 5).

[0091] Accounting for reverse orientation alignments

[0092] With a slight modification, alignments in the reverse complement orientation can be accounted for to capture structural variations with inversions and bidirectional sequencing reads. For example, PacBio™ reads DNA amplicons in both directions, in particular, read in the forward direction produces an alignment chain ( F x ,G x ), (F y , G y ) and in the reverse direction (H y , RC(G y )), (H x , RC(G X )) where RC reverse complements the sequence G. This is resolved by relabeling reverse complement alignments by a -, such that H supports the breakpoint (-y, -x).

[0093] The relabeling applies naturally to the sequence analysis pipeline. Alignment- trimming relies only on projections on sequenced fragments and therefore does not change. Each DNA amplicon containing a breakpoint is associated with two breakpoint estimates, (x, y) generated from forward reading and (-y, -x) from reverse reading.

[0094] In addition, the constraints of -y,-x,L in relation to -a, -b remain the same, therefore both forward and reverse direction breakpoint estimates have the same triangle orientation on the G x G plane. All forward and reverse breakpoints are simultaneously recovered with the sweep line algorithm.

[0095] Using reverse complement alignments, breakpoints associated with inversions, like A549, are captured. In this case, a breakpoint corresponds with ( -x, y) and (-y, x) or (x, -y) and (y, -x).

[0096] Breakpoint reconstruction

[0097] In the final step, predicted amplicon templates for each cluster are created by joining reference sequence G(6500— a, a) and G(b, b + 6500). A reference sequence reconstruction can be performed using, for example, automated system such as the PacBio™ SMRT Analysis 1.4 pipeline for resequencing to refine the amplicon template predictions using all fragments generated from the SMRT cell. The resequencing protocol involves running BLASR for mapping followed by Quiver for consensus sequence calling. The protocol accurately recovers the sequence around breakpoints; for example, a consensus amplicon sequence starting at aligned 25 - a and ending at b + 25 matched either sequencing from previous studies or independent Sanger sequencing chromatogram (Fig. 6). For clusters with L > 0, adding L N nucleotides at the breakpoint junction of the predicted amplicon template had no effect on PacBio™ resequencing protocol. In both cases, the correct amplicon breakpoint junction sequence was found.

[0098] The subject method may be used to identify variants which are unique to individual subjects, and can be applied to detect variants in mixtures at ratios to wild-type of up to 1 : 100, 1 :500, 1 : 1000, 1 :2500, 1 :5000, 1 :7500 or 1 : 10,000. These unique variants serve as a personalized biomarkers, where a quantitative PCR-based assay could accurately measure the subject's tumor burden. Once a subject's unique variants have been identified, the variants can then be used in prognosing, determining a treatment regimen, determining response to treatment or detecting recurrence of disease by detecting the variant polynucleotide.

[0099] In an additional aspect, the method comprises a further step prognosing, determining progression of cancer, predicting a therapeutic regimen or predicting benefit from therapy in a subject having a disease based on the detection of a variant. In one aspect, the disease is cancer. In a further aspect, the cancer is selected from the group consisting of a carcinoma, sarcoma, leukemia, lymphoma, myeloma, and a CNS tumor. In some

embodiments, the cancer is selected from the group consisting of skin cancer, lung cancer, colon cancer, pancreatic cancer, prostate cancer, liver cancer, thyroid cancer, ovarian cancer, uterine cancer, breast cancer, cervical cancer, kidney cancer, epithelial carcinoma, squamous cell carcinoma, basal cell carcinoma, melanoma, papilloma, and adenomas.

[0100] In a further embodiment, the invention provides a kit for detecting a variant polynucleotide having a nucleotide sequence differing from the wildtype nucleotide sequence of a nucleic acid molecule, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the kit comprising a) a multiplicity of primers; and b) algorithms to detect the variant. In one aspect, the variant is a mutation, deletion, insertion, substitution or genomic rearrangement. [0101] In a further aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately 1-20 kb around the locus of interest and the innermost primers is separated by approximately at least 15-100 kb. In an additional aspect, the multiplicity of primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 5-15 kb around the locus of interest and the innermost primers is separated by approximately at least 20-80 kb. In a preferred aspect, the primers are designed to hybridize to loci on the wildtype nucleic acid molecule evenly spaced approximately at least 6 kp around the locus of interest and the innermost primers is separated by approximately at least 20 kb. In a further aspect, the ratio of primer spacing to locus of interest is 6 kp to 80 kb.

[0102] In a further embodiment, the present invention provides a system for analyzing a variant polynucleotide of unknown nucleotide sequence believed to differ from the wildtype nucleotide sequence of a nucleic acid molecule of interest in a subject, comprising: a) a multiplicity of primers computationally designed to detect one or more breakpoints in the wildtype nucleic acid sequence of the subject; b) a computer -executable algorithm for detecting the variant in a sample from a subject having a cancer, the algorithm comprising computational design of a multiplicity of primers; c) use of the multiplicity of primers in multiplex PCR to amplify any variant polynucleotide present in the sample; d) analysis of the sequence of each multiplex PCR products; e) detection of the variant polynucleotide in the sample; and f) long range sequencing of the variant polynucleotide to confirm the differences in nucleotide sequence in the variant as compared to the sequence of the wild-type nucleic acid. In one aspect, the system further comprising a machine to perform the multiplex PCR. In another aspect, the system further comprising a machine to sequence the multiplex PCR products. In a further aspect, the system further comprising a machine to perform long range sequencing of a detected variant polynucleotide.

[0103] In a further embodiment, the invention provides a method for confirming variant polynucleotide of unknown nucleotide sequence which differs from the wildtype nucleotide sequence of a nucleic acid molecule of interest, wherein the variant polynucleotide is in a sample containing up to about 99.9% of the wildtype nucleic acid molecules, the method comprising: a) sequencing the nucleotide of interest; and b) analyzing the sequence by i) sequence alignment to wild-type nucleotide sequence, ii) alignment trimming, iii) clustering of breakpoints, and iv) confirming the variant sequence. In one aspect the high through-put sequence analysis is single molecule sequence analysis. In an additional aspect, multiple polynucleotide variants are detected simultaneously. In a further aspect, the high through-put sequence analysis has a high error rate. In another aspect, the high through-put sequence analysis error rate is comprised of up to about 20% insertion error rates and deletion rates and/or up to about 5% substitution error rates.

[0104] The following examples are intended to illustrate but not limit the invention.

EXAMPLES

EXAMPLE 1

CDKN2A Variant Identification

[0105] To test Ambre-design, cell-line copy number data was analyzed to identify a large clustering of deletions in the CDKN2A region. A 480Kbp region surrounding the CDKN2A gene was identified, 230Kbp upstream and 250Kbp region downstream of CDKN2A that captures breakpoints in 55 of the 109 CDKN2A deletion cell-lines considered. It was chosen that d = 6500, as 13Kbp products can be reliably amplified with LongAmp Taq DNA polymerase (New England Biolabs, NEB).

[0106] AmBre exploits the fact that variable breakpoints aggregate along fragile regions of the chromosome by designing primers around the fragile regions, this was used to produce a single design for five cancer cell lines: A549, CEM, Detroit562, MCF7, and T98G.

Breakpoints were estimated by copy number changes for four cancer cell-lines (A549, CEM, MCF7, and T98G) from SNP-array data (Table 1) and the breakpoint was given for a fifth cell line (Detroit562) from prior studies. The error in breakpoint estimation for SNP-array data is roughly lOKbp. Thus, to generate cluster target regions, each breakpoint estimate was expanded to be a lOKbp interval and overlapping intervals were merged. This created four regions (F) upstream of CDKN2A and three downstream regions (R), and the target regions were used as input for AmBre-design (d = 6500bp). AmBre-design output a high quality 16 primer design (AMBRE- 16) with primers spaced apart by approximately 6Kbp to cover the lOOKbp input region. The design was used by AmBre-amplify on DNA samples from each cell line. The experiment successfully amplified DNA from each cell line (S.M. 2), where each line produced a unique sized amplicon even though the same set of 16 primers are used for each reaction. [0107] Primer generation and filtering

[0108] Primer3 2.3.0 was used with long-range PCR specific parameters to identify 3 lbp candidate AmBre primers that were capable of amplification under the same thermocycling conditions. To minimize the chance of off-target amplification, candidate primers were aligned to the reference human assembly (GRCH37) using Blat. Define an end-aligning match as an exact match of length > 18 between the 3' end of a primer and an off-target location. Primers with greater than 10 end-alignments were removed as having a high chance for off-target amplification. Second, pairs of primers that have compatible end-alignments within a 2d long off target region were marked as incompatible. Finally, each pair (including a self-pair) was tested for dimerization using Multi-Plx. Primers with self-dimerization (maximum binding energy AG less than -8.0 kcal/mol for any region) were removed and pairs with high binding affinity (maximum binding energy AG less than -4.0 kcal/mol for primer-primer 3end binding or -8.0 kcal/mol for any region of primers) were marked as incompatible. The remaining candidate primers and incompatibilities formed the input to AmBre primer selection.

[0109] The candidate primer generation and primer filtering stages resulted in 5181 candidate primers. As shown in Figure 3a, the candidate primers are uniformly spread across breakpoint regions suggesting good tiling primer designs may exist. The simulated annealing algorithm is repeated for 12 different rates of convergence with the fastest convergence rate having a 10 minute average runtime and slowest convergence rate having a 864 minute average runtime (Fig. 3b). When d = 6500, the lowest cost solution (AMBRE-68) requires only 68 primers with 99.99% in silica capture of simple CDKN2A deletions that may occur in the 480Kbp breakpoint region (Fig. 3c).

[0110] Primer selection with simulated annealing

[0111] A final AmBre primer design was selected from a filtered list of candidate primers (Pu) and primer-primer compatibilities. To compute an optimal primer design, a low cost P according to C(P), we applied a simulated annealing procedure, an initial design P was computed using a random subset of 6 primers. Define the neighboring design of P, N(P), as either the removal of a single primer from P, or the addition of a single primer pe P to P followed by removal of all primers p P s.t. (ρ, ρ)' e E. The simulated annealing procedure described in Algo 1 was used to compute low cost designs. Algorithm 1 Simulated Annealing Algorithm

1 : procedure Simulated AnnealingiJV/C)

2: P<^Random(Pu, 6) ► Initialize random primer set P with size 6

3 : for t = Tl; T2; T3; : : : do ►Iterate until design is stable

4: l<^Random(Pu, 1)

5 : if C(Ni(P)) < C(P) or Ran dom[0; 1 ] < then

► Move to neighboring design if improves or with probability proportional to extra cost and iteration

6: P<-N P)

7: end if

8: end for

9: return P

10: end procedure

[0112] The temperature schedule, Tl , T2, T3, linearly decreases depending on intercept and slope parameters m and b. Parameters tested for were combinations of m = 1 , 0.1 , 0.01 , 0.001 and b = 10 4 , 10 5 , 10 6 . The maximum number of iterations ran was determined by the temperature schedule, 2b + b/n, and constrained to be at least 10 6 and at most 10 5 iterations. Each parameter set was repeated 3 times. The lowest cost primer design of all runs was used as the final design. Figure 3 demonstrates convergence to design minima under different parameters of T for a target CDK 2A breakpoint region of length 480Kbp.

[0113] Amplification

[0114] PCR was performed using the following thermocycling conditions; initial denaturation at 95°C for 3 min, 10 cycles at 94°C for 20 sec, 64°C for 30 sec, 66°C for 15 min, 28 cycles at 94°C for 5 sec, 64°C for 30 sec, 66°C for 15 min + 20 sec for each cycle, final extension at 64°C for 45 min, and 4° hold. The standard protocol for NEB Crimson™ LongAmp Taq was used for 50 ul PCR reactions with the following changes. The same mix of 16 primers was used in each reaction where each primer is present with final concentration of 0.2μΜ. Starting genomic DNA for each cell line reaction is 10 ng. QIAquick PCR purification kit was used to clean up PCR. samples. Samples were quantified and 2pg of A549 reaction sample was mixed with Lug of each remaining cell line reaction sample and submitted for PacBio™ sequencing at the UCSD BioGen Core facility. Loading of DNA samples onto a PacBio™ SMRT cell is biased towards sequencing smaller amplicons and increasing the amount of A549 reaction sample containing an 1 lkb DNA fragment was necessary to sufficiently sequence the A549 DNA fragment. For MCF7 and T98G PCR validation, primers sequences were generated using Primer3 2.3.0 given short genomic sequence around the MCF7 and T98G breakpoints as determined by PacBio sequencing and analysis. Standard protocol for NEB Standard Taq is used for 50 ul PCR reactions starting with 250ng of genomic DNA.

[0115] Copy number data for four cancer cell-lines (A549, CEM, MCF7, and T98G) was investigated from SNP-array data (Table 3.3), and added the breakpoint for a fifth cell line (Detroit562). As the breakpoint estimates are not exact, we expanded each breakpoint estimate by 5kb to create four regions upstream of CDKN2A gene (F) and three downstream regions R. These were used as input for AmBre-design (d = 6500bp). AmBre-design output a high quality 16 primer design(AMBRE-16) with primers spaced apart by approximately 6kb to cover the lOOkb input region. This design was used by AmBre-amplify on DNA samples from each cell-line. Figure 4 shows the success of this experiment, with variable sized amplicons from each of the 5 samples.

[0116] PCR products were mixed together for simultaneous preparation and

sequencing on a single SMRT cell. The sequence data was the input to AmBre-analyze. The tool BLASR identified 52k alignable fragments. After clustering in AmBre-analyze, deep coverage of every breakpoint was retrieved (although with 6 clusters instead of 5; see below), with A549 having the lowest coverage of 300 fragments and CEM having the highest coverage of 15,000 fragments (Fig. 5b). The difference in coverage is due to different amplicon sizes, where shorter amplicons are easier to load onto a PacBio™ SMRT cell than longer amplicons. Later generation PacBio™ instrumentation normalizes for this sequencing bias.

[0117] AmBre-analyze generated consensus sequence for A549, CEM, and Detroit562 breakpoints, which are concordant with previous studies (Fig. 6). The A549 harbors a complex structural variation where in addition to a large DNA segmental loss including CDKN2A, there is a 325bp internal inversion occurring at the deletion breakpoint junction. AmBre-analyze resolved the complex event as two separate breakpoints. The A549 amplicon template was created by ordering. The reference segments corresponding to the two breakpoints. The results are found in Table 1.

[0118] Table 1 : Five cell-lines with CDKN2A deletion breakpoints in GRCH37. Estimated breakpoints are according to CGP. COP coordinates were converted from GRCH36 to GRCH37 using UCSC liftover. The break coordinates for Detroit562 were identical to those previously described and the cell-line was not examined by CGP.

[0119] The nucleotide sequence for MCF7 and T98G had not been previously characterized in spite of previous efforts, including whole genome sequencing of the MCF-7 cell-line. The ease of the discovery in the AmBre assay attests to the value of a targeted approach to SV detection. Both MCF-7 and T98G sequences were confirmed using Sanger sequencing. The SNP-array estimate for MCF7 breakpoint is 15kb away from the AmBre detected breakpoint, a significant distance. Previous genome sequencing studies of MCF7 have not annotated the CDKN2A deletion breakpoints.

[0120] The physical properties of DNA around the breakpoints of CDKN2A deletions was analyzed using the BreakSeq™ pipeline. All five deletion events were predicted to result from non-homologous end joining (NHEJ). A characteristic of NHEJ is lower DNA duplex stability near the breakpoints of a structural variation. DNA duplex stability was assessed based on predictions of helix stability (average dissociation free energy of overlapping dinucleotides) and DNA flexibility (average twist angle of overlapping dinucleotides). no strong association to lower DNA duplex stability in CDKN2A deletion breakpoints, albeit we are analyzing much fewer structural variations was found (Figure 1 1). Alternatively, it has been suggested that the CDKN2A deletion in CEM is due to illegitimate V(D)J recombination, which is evidenced by V(D)J recombination motifs discovered near the deletion breakpoints. EXAMPLE 2

Identifying CDKN2A deletion assuming no DNA break clustering

[0121] AmBre can be applied to contiguous break regions. A 68 primer design was developed to capture CDKN2A deletions with breaks in a 480kb region (AMBRE-68, see also Figure 3). PCR was performed using the standard protocol for NEB Crimson LongAmp Taq is used for 50 μΐ PCR reactions with the following changes. The same mix of 9 primer was used in each reaction where each primer is present with final concentration of 0.4 μΜ. Starting genomic DNA for each cell line reaction is 20ng.

[0122] In AmBre-amplify experimentation, it was observed that the high amount of multiplexing, and larger amplicon lengths (> 4kb) reduce amplification efficiency. Using all AMBRE-68 primers in a single reaction resulted in amplification of only the 2.2kb A549 CDKN2A deletion loss (data not shown). To mitigate this effect, sub- sampling of primers from a design and performing multiple reactions per sample using different primer sets improved amplification results. To test whether the AMBRE-68 primers selected were viable at some level of subsampling, we sampled the nearest forward and reverse primer in AMBRE-68 to each CDKN2A break in cell lines: A549, CEM, Detroit562, MCF7, MOLT4, T98G. This resulted in 9 primer subset, which again captures the CDKN2A deletion in each cell line. Of these cell lines, 5 lines resulted in amplicons ranging in lengths from 2.2Kbp to 7.5Kbp (Fig. 7). Detroit562 did not amplify as the expected amplicon size with AMBRE-68 primers is 16Kbp. For each remaining cell line, the observed amplicon length matched the spacing between CDKN2A breakpoints and nearest primers in AMBRE-68 design. Thus, a universal primer design divided into multiple primer subset experiments can be used to identify SVs.

EXAMPLE 3

Dealing with tumor heterogeneity

[0123] The AmBre assay, unlike other methods, can target DNA with the relevant deletion in the context of high background of germline DNA. This feature is important for sensitive detection of tumor DNA and establishing a patient specific tumor DNA marker for monitoring tumor burden.

[0124] To demonstrate the principle, a 2.2kbp CDKN2A deletion sequence was successfully amplified from A549 and a 3.6kbp deletion sequence from MCF7 starting with A549 and MCF7 genomic DNA mixed with HEK genomic DNA (Fig. 8). Each reaction starts with a heterogeneous mixture of approximately 400ng with tumor to wild-type gDNA mixture ratios of 1 : 1 , 1 : 10, 1 : 100, 1 : 1000. In a realistic application for AmBre, each reaction contains numerous primers where only 2 primers are responsible for amplification. In the experiment, each reaction contains 16 primers sampled from

AMBRE-68 around CDKN2A deletion breakpoints for each cell line. For the Tumor:Wild- type genomic DNA heterogeneity experiment, the standard protocol for NEB Crimson LongAmp Taq is used for 50μ1 PCR reactions with the following changes. Each primer has final concentration 0.4μΜ. Each reaction contains ~ 400ng gDNA, with the following tumor to normal DNA ratios: 200ng : 200ng, 40ng : 400ng, 4ng : 400ng, 0.4ng : 400ng. Normal DNA is derived from HEK cells. In the heterogeneity experiment of A549, strong amplification is observed for each mixture ratio whereas for MCF7 there is clearly a reduction of amplification efficiency as the fraction of starting cancer cell line gDNA decreases (Fig. 8). Amplification of longer amplicons with AmBre in the complex gDNA sample is also possible.

EXAMPLE 4

General Experimental Methods

[0125] A549, CEM, Detroit562, and T98G cells were thawed from Moore's Cancer Biorepository. MCF7, HeLa, and HEK (293T) cells were collected. Standard DNAzol protocol was used for DNA extraction and DNA was quantified with NanoDrop 2000 spectrophotometer. DNA products are visualized on 1% agarose gels with EtBr. Gel images are either color value inverted or color curve adjusted uniformly across the image for visual enhancement. All PCRs were performed on a BioRad iCycler™ instrument.

EXAMPLE 5

Identification of Complex Rearrangements

[0126] AmBre also captures more complex rearrangements like inter-chromosomal translocations. This was demonstrated with an experiment characterizing RUNX1-RUNX1T1 gene fusion, the results of a translocation between chr21 and chr8. In the tumor genome, breakpoint ends lie within a 30Kbp region chr21 : 36205000-36235000) in the RUNX1 intron, and a 55Kbp region chr8 : 93030000-93085000 in RUNXITI, and the derivative chromosome 8 (Der8) encodes a fusion oncoprotein. In some cases, the 7 translocation is balanced and also generates a fusion of RUNX1T1-RUNX1 on a derivative chromosome21 (Der21). To capture the translocation producing Der8, AmBre was used to design 10 reverse primers in the RUNXl region and 18 forward primers in the RUNX1T1 region with ~ 3Kbp primer spacing.

[0127] Similarly, to capture Der21 breakpoints, 10 forward and 19 reverse primers were designed in the RUNXl and RUNXl 77 regions, respectively. A ~ 3Kbp primer spacing supposes the maximum product size is approximately 6Kbp. The primer designs were tested on Kasumi-1, which carries the balanced translocation with both Der8 and Der21 breakpoints characterized. AmBre spaced the primers in the two regions unaware of the true Kasumi-1 breakpoints and we assayed the Der8 and Der21 chromosomes in two independent reactions using the respective 28 and 29 primers. The primers closest to the breakpoints produces a 3:5Kbp and 2:7Kbp amplicon from Der8 and Der21, respectively (Fig. 10). Both reactions resulted in a strong signal and virtually no background noise, despite there being close to 30 primers in each reaction.

[0128] A subsampling of primers and efficacy in generating longer amplicons were investigated. For each primer design, forward and reverse primers were divided based on index parity when sorted by chromosome position. Thus, there are four primer sets: forward odd (FO), forward even (FE), reverse odd (RO), and reverse even (RE), with primers spaced by approximately 6Kbp. The forward and reverse primer sets make four combinations: FO U RO, FO U RE, FE U RO, and FE U RE, primers for capturing target breakpoints. These combinations can be treated as four new primer designs, each with a maximum product size of 12Kbp, but half as many primers. Thus, amplification efficiency may be assessed across different amplicon lengths and primer density per reaction using the same DNA template.

[0129] For example, in the original 28 primer design, the Kasumi-1 breakpoints for Der8 were generated by the sixth forward and ninth reverse primer. Thus, 14 primer designs FE U RO, FO U RO, and FO U RE produce 3:5Kbp, 6:8Kbp, and 10: lKbp amplicons (Fig. 10).

[0130] Similarly, the 29 primer design for Der21 was subsampled into three reactions. Each reaction resulted in a strong signal band at the expected amplicon size and all six amplicon were confirmed to span the Der8 and Der21 breakpoint via Sanger sequencing (S.M. 11). From each reaction, a general trend of better amplification for shorter amplicon lengths is observed. However, there was no significant difference in amplification efficiency between using all primers and half the primers to generate the shortest amplicons. Longer amplicons had strong signal, but weaker false products were visible. This effect is not seen with the shorter amplicons and false products may be more prevalent in reactions with greater number of primers and longer amplicons.

EXAMPLE 6

Detection of a Variant Polynucleotide in Human Tumors

[0131] AmBre was used to identify CDKN2A deletion breakpoints in primary tumors. Namely, tumor tissue from three pancreatic cancer patients were expanded in xenograft mouse models. DNA was collected and SNP-array analysis indicated a deletion in the CDKN2A locus. AmBre designed primers for each of these DNA samples and amplified DNA harboring the CDKN2A deletions. After simultaneous Pacific Biosciences RS™ sequencing of these products and sequencing data analysis, 1.67Mbp, 308Kbp, and 821Kbp deletions were confirmed in the pancreatic cancer samples.