Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LIBRARY PREPARATION FOR WHOLE GENOME SEQUENCING
Document Type and Number:
WIPO Patent Application WO/2019/070598
Kind Code:
A1
Abstract:
Methods for making libraries for whole genome sequencing are disclosed herein. The methods are particularly suitable for making libraries from biological samples with damaged DNA, such as, for example, formalin-fixed paraffin embedded tissues. The methods make high quality libraries the can be whole genome sequenced for the tissues with damaged DNA.

Inventors:
HARTWIG ANNA (US)
SO AUSTIN (US)
Application Number:
PCT/US2018/053784
Publication Date:
April 11, 2019
Filing Date:
October 01, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
TOMA BIOSCIENCES INC (US)
International Classes:
C12Q1/68; C12N15/09; C12N15/10; C12P19/34; C12Q1/6806; C12Q1/6883; C12Q1/6886
Domestic Patent References:
WO2017139492A12017-08-17
Foreign References:
US20160265027A12016-09-15
Other References:
HYKIN, SM ET AL.: "Fixing Formalin: A Method to Recover Genomic -Scale DNA Sequence Data from Formalin-Fixed Museum Specimens Using High-Throughput Sequencing", PLOS ONE, vol. 10, no. 10, 27 October 2015 (2015-10-27), pages 1 - 16, XP055588252, ISSN: 1932-6203, DOI: 10.1371/journal.pone.0141579
DO, H ET AL.: "Dramatic reduction of sequence artefacts from DNA isolated from formalin-fixed cancer biopsies by treatment with uracil-DNA glycosylase", ONCOTARGET, vol. 3, no. 5, 24 May 2012 (2012-05-24), pages 546 - 558, XP055282871, DOI: doi:10.18632/oncotarget.503
Attorney, Agent or Firm:
KUMAMOTO, Andrew (US)
Download PDF:
Claims:
CLAIMS

We claim:

1. A method for making a library of nucleic acids, comprising the steps of: obtaining a biological sample, wherein the biological sample is a formalin fixed paraffin embedded sample; obtaining a plurality of nucleic acids from the biological sample; treating the plurality nucleic acids to remove a plurality of damaged nucleotides; treating the plurality of nucleic acids to add a phosphate on the 5' end of each of the plurality of nucleic acids;

denaturing the plurality of nucleic acids to make a plurality of single stranded nucleic acids; attaching a 5 '-adaptors to each of the plurality of single stranded nucleic acids to make a plurality of 5 '-adaptor - single stranded nucleic acids, and attaching a 3 '-adaptor to the plurality of 5 '-adaptor - single stranded nucleic acids to make a plurality of 5 '-adaptor - single stranded nucleic acid - 3 'adaptors.

2. The method of claim 1 , further comprising the step of purifying the 5 '-adaptor - single stranded nucleic acid - 3 'adaptor.

3. The method of claim 1 , further comprising the step of purifying the 5 '-adaptor - single stranded nucleic acid.

4. The method of claim 1, wherein the 5 '-adaptor is attached to the single stranded nucleic acid using a ligase.

5. The method of claim 4, wherein the ligase is a single stranded ligase.

6. The method of claim 1 , wherein each adaptor comprises a first sequence that is complementary to a sequencing primer, a second sequence that is complementary to a capture probe, and a bar code.

7. The method of claim 6, wherein the first sequence is at least 70% complementary to the sequencing primer.

8. The method of claim 6, wherein the second sequence is at least 70% complementary to the capture probe.

9. The method of claim 1, wherein the biological sample arises from a subject.

10. The method of claim 9, wherein the subject has a disease.

11. The method of claim 10, wherein the disease is a cancer.

12. A method for making a library of nucleic acids, comprising the steps of: obtaining a biological sample, wherein the biological sample has a plurality of damaged nucleotides; obtaining a nucleic acid from the biological sample; treating the nucleic acid to remove the damaged nucleotides; treating the nucleic acid to add a phosphate on the 5' end of the nucleic acid; denaturing the nucleic acid to make a single stranded nucleic acid; attaching a 5 '- adaptor to the single stranded nucleic acid to make a 5-adaptor - single stranded nucleic acid, and attaching a 3 '-adaptor to the 5 '-adaptor - single stranded nucleic acid to make a 5'- adaptor - single stranded nucleic acid - 3 'adaptor.

13. The method of claim 12, further comprising the step of purifying the 5 '-adaptor - single stranded nucleic acid - 3 'adaptor.

14. The method of claim 12, further comprising the step of purifying the 5'-adaptor - single stranded nucleic acid.

15. The method of claim 12, wherein the 5 '-adaptor is attached to the single stranded nucleic acid using a ligase.

16. The method of claim 15, wherein the ligase is a single stranded ligase.

17. The method of claim 12, wherein each adaptor comprises a first sequence that is complementary to a sequencing primer, a second sequence that is complementary to a capture probe, and a bar code.

18. A library comprising, a plurality of nucleic acids isolated from a biological sample wherein the biological sample is a formalin fixed paraffin embedded sample, a 5 '-adaptor attached to the 5' end of each of the plurality of nucleic acids, a 3 '-adaptor attached to the 3' end of each of the plurality of nucleic acids, wherein a plurality of damaged nucleotides have been removed from the plurality of nucleic acids, and wherein sequencing of the plurality of nucleic acids can provide a median coverage depth of at least 20 fold.

19. The library of claim 18, wherein the biological sample is obtained from a subject, wherein the subject has a disease.

20. The library of claim 19, wherein the disease is a cancer.

21. The library of claim 18, wherein sequencing of the plurality of nucleic acids can provide a coverage depth of at least 20 fold for at least 80% of a genome of the subject.

22. A method for identifying changes in a nucleic acid sequence, comprising the steps of: making a plurality of libraries of nucleic acids from a plurality of biological samples using the method of claim 1, wherein each library is from one of the biological samples, wherein the plurality of biological samples are obtained from a plurality of subjects, wherein each of the plurality of subjects has a disease that is the same, sequencing each of the plurality of libraries of nucleic acids to obtain a median depth of coverage of at least 20 for at least 80% of a genome of each subject; comparing the sequence information obtained for the plurality of libraries to a sequence for a genome of a healthy subject; identifying changes in nucleic acid sequence that are shared among a substantial number of the subjects with the disease.

23. A process for analyzing nucleic acids obtained from a formalin fixed paraffin embedded biological sample, comprising the steps of: obtaining a sample of the formalin fixed paraffin embedded biological sample; extracting DNA from the sample of formalin fixed paraffin embedded biological sample; treating the DNA to remove a plurality of damaged nucleotides; treating the DNA to add a phosphate on the 5' end of each fragment in the DNA; denaturing the DNA to make a plurality of single stranded DNA fragments; attaching a 5 '-adaptors to each of the plurality of single stranded DNA fragments to make a plurality of 5 '-adaptor - single stranded DNA fragments, and attaching a 3 '-adaptor to the plurality of 5 '-adaptor - single stranded DNA fragments to make a plurality of 5 '-adaptor - single stranded DNA fragments - 3 'adaptors; sequencing the 5 '-adaptor - single stranded DNA fragments - 3 'adaptors; and analyzing the sequencing data to identify a somatic change.

Description:
LIBRARY PREPARATION FOR WHOLE GENOME SEQUENCING

BACKGROUND OF THE INVENTION

[1] Cancer poses serious challenges for modern medicine. It has been estimated that cancer causes over 10% of all human deaths worldwide. Cancer encompasses a broad group of various diseases, generally involving unregulated cell growth. In cancer, cells can divide and grow uncontrollably, can form malignant tumors, and can invade nearby parts of the body. Cancer can also spread to more distant parts of the body, for example, via the lymphatic system or bloodstream. There are over 200 different known cancers that afflict humans. Many cancers are associated with mutations, for example, mutations in cancer- related genes. The mutational status of a cancer can vary widely from one individual subject to another, and even from one tumor cell to another tumor cell in the same subject. Knowledge of these mutations can aid in the selection of cancer therapy, and can also aid in informing disease prognosis and/or disease status.

[2] Next Generation Sequencing is increasingly used in translational cancer research and as a diagnostic test to identify actionable mutations in tumors of cancer patients. However, most tumor specimens are only available as formalin-fixed, paraffin-embedded (FFPE) blocks, whether from patient biopsies or a part of archival biobanks. FFPE-derived DNA is typically fragmented and has frayed ends, abasic positions, crosslinks, and modified bases. These features cause difficulties for standard library preparation methods: frayed ends prevent blunt ended double stranded adapter ligation, typically used for whole genome sequencing library preparation. In addition, fragmented DNA and modified bases interfere with the PCR process that underlies amplification based targeted sequencing methodology.

SUMMARY OF THE INVENTION

[3] In some aspects, libraries described herein are prepared from single stranded or double stranded nucleic acids. Single-stranded nucleic acids can be prepared from a sample of double-stranded nucleic acid using any means known in the art or described herein. Starting samples can be a biological sample obtained from a subject. The biological sample can be formalin-fixed paraffin-embedded (FFPE) tissues, serum, blood, urine, cerebral spinal fluid, other bodily fluids, tissue (e.g., organs), cells, swabs, etc. The nucleic acid can be obtained from the biological sample as RNA, DNA, or cDNA. The nucleic acid can be obtained from biological samples, for example, using commercially available kits (e.g., those sold by Qiagen or Covaris). [4] Nucleic acids obtained from the biological sample can be fragmented to a desired size using, e.g., restriction enzymes, nuclease, sonication, shearing, other physical treatments that break the nucleic acids, or combinations of the foregoing. When the fragmented nucleic acids are obtained from sources with damaged DNA (e.g., FFPE samples), the fragmented nucleic acids can be treated to remove crosslinks and damaged DNA nucleotides from the nucleic acids. Damaged nucleotides can be removed using, for example, AP-endonuclease 1, Uracil DNA glycosylase (UDG), formamidopyrimidine [fapy]-DNA glycosylase (Fpg), bifunctional DNA glycosylase OGG1, other glycosylases, DNA polymerase β, X-ray repair cross-complementing group 1 (XRCC1), DNA ligase III, Poly(ADP-ribose) polymerase (PARP-1), Uvr proteins, Endo VIII, nucleotide excision repair enzymes (e.g., CETN2, DDB1, DDB2, ERCC1, ERCC2, ERCC3, ERCC4, ERCC5, ERCC6, ERCC8, LIG1, MNAT1, MMS19, RAD23A, RAD23B, RPA1, RPA2, TFIIH, XAB2, XPA, XPC). The damaged nucleotides can be removed using the above enzymes (or combinations of the enzymes) to remove the damaged base from the nucleic acid, and the sugar to which it was attached can also be removed resulting in a gap or break in the nucleic acid.

[5] The nucleic acid fragments can then be treated to add a phosphate to the 5' end of the fragments and optionally to remove the phosphate on the 3' end of the fragments. For example, the nucleic acid fragments can be treated with T4 polynucleotide kinase and ATP to add phosphate to the 5' end, and remove phosphates from the 3 ' end of the nucleic acids. In addition, the 3' end of the nucleic acids can optionally be blocked with appropriate blocking groups, e.g., dideoxynucleotides can be added to the 3' end, or reversible protection groups can be added to the 3' end to prevent ligation reactions at the 3' end of the DNA fragments or strands. Nucleic acid fragments with 5' phosphate and optionally dephosphorylated (and/or protected) 3' ends are ligated to 5 '-adapters on the 5' end of the fragments. If desired, unligated 5 '-adapters can be separated or removed from the fragment-5' -adapters by purification or other methods. The 5 '-adapter-fragments are then ligated with 3 '-adapters on the 3' end of the 5 '-adapter-fragments. If a protective group has been placed on the 3' end of the fragments to prevent ligation reactions, this protective group must be removed prior to the ligation of the 3 '-adapter. If desired, after ligation of the 3 '-adapters, unligated 3 '-adapters can be separated or removed from the 5 '-adapter-fragment-3' -adapter by purification or other methods.

[6] The library of 5 '-adapter-fragment-3 '-adapter nucleic acids can be directly sequenced, or the library can be subject to amplification. The amplification can be performed using primers specific for sequences in the adapters, or a target directed amplification can be done. [7] Libraries made using the methods described herein can be made from genomic material or mRNA obtained from the biological sample that provide good coverage depth and high percentage coverage of the genome (or expressed genes). The libraries can provide a median coverage depth of at least 20, 25, 30, 35, 40, 50, 100, 500, 1000, 10,000 or 100,000 fold. The libraries can also provide 80%, 90% 95%, 99% or 100% coverage of a genome, expressed genes, or target sequence with a coverage depth of at least 20, 25, 30, 35, 40, 50, 100, 500, or 1000 fold. The libraries can provide a sensitivity and/or precision in making sequence calls of 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.5% or 99.99%.

[8] Libraries made using the methods described herein can be used to detect known and new mutations, detect new alleles, diagnose and/or monitor disease, diagnose and/or monitor disorders, monitor, and/or improve the treatment of subjects suffering from a disease or disorder, and/or retrospective studies for any of the proceeding. The methods described herein can also be used to investigate and identify mutations, sequence changes, and/or variants that are associated with and have predictive value for diagnosis and treatment of diseases.

DETAILED DESCRIPTION OF THE INVENTION

[9] Before the various embodiments are described, it is to be understood that the teachings of this disclosure are not limited to the particular embodiments described, and as such can, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the present teachings will be limited only by the appended claims.

[10] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present teachings, some exemplary methods and materials are now described.

[11] As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which can be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present teachings. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.

Definitions [12] As used herein, the term "adaptor-ligated" is defined as a nucleic acid that has been ligated to an adaptor. The adaptor can be ligated to a 5' end or a 3' end of a nucleic acid molecule, or can be added to an internal region of a nucleic acid molecule.

[13] As used herein, "amplification" of a nucleic acid sequence is defined as techniques for enzymatically increasing the number of copies of a nucleic acid. Amplification methods include both asymmetric methods (in which the predominant product is single-stranded) and symmetrical methods (in which the predominant product is double-stranded), such as, for example, PCR.

[14] As used herein, the terms "anneal," "hybridize," or "bind" are defined as two polynucleotide sequences, segments or strands, that have sufficient complementariness to each other to form a double stranded nucleic acid, and these terms can be used interchangeably. Two sequences with sufficient complementary bases (e.g., DNA and/or RNA) can anneal or hybridize by forming hydrogen bonds with complementary bases to produce a double-stranded polynucleotide or a double-stranded region of a polynucleotide.

[15] As used herein, the term "barcode sequence" is defined as a unique sequence of nucleotides that can encode information. A barcode sequence can encode information relating to the identity of an interrogated allele, identity of a target polynucleotide or genomic locus, identity of a sample, a subject, or any combination thereof. A barcode sequence can be a portion of a primer, a reporter probe, or both. A barcode sequence may be at the 5 '-end or 3'-end of an oligonucleotide, or may be located in any region of the oligonucleotide. A barcode sequence generally is not part of a template sequence. Barcode sequences may vary widely in size and composition; the following references provide guidance for selecting sets of barcode sequences appropriate for particular uses: Brenner, U.S. Pat. No. 5,635,400; Brenner et al, Proc. Natl. Acad. Sci., 97: 1665-1670 (2000); Shoemaker et al, Nature Genetics, 14: 450-456 (1996); Morris et al, European patent publication 0799897A1; Wallace, U.S. Pat. No. 5,981,179, all of which are incorporated by reference in their entirety for all purposes. A barcode sequence may have a length of about 4 to 36 nucleotides, about 6 to 30 nucleotides, or about 8 to 20 nucleotides.

[16] As used herein, the terms "CNV," "CNA," "copy number alteration" and "copy number variant" are used interchangeably, and are defined as sections of the genome that are repeated and the number of copies can vary between individuals.

[17] As used herein, the term "complementary" is defined as a relationship between two antiparallel nucleic acid sequences in which the sequences are related by the base-pairing rules: A pairs with T or U and C pairs with G. A first sequence or segment that is "perfectly complementary" to a second sequence or segment is complementary across its entire length and has no mismatches. A first sequence or segment is "substantially complementary" to a second sequence of segment when a polynucleotide consisting of the first sequence is sufficiently complementary to specifically hybridize to a polynucleotide consisting of the second sequence.

[18] As used herein, the term "deletion" is defined as a mutation in which part of the genome or DNA sequence is lost or removed with respect to a human genome reference sequence. Deletions can be as small as 1 nucleotide or base pair, and can be as a large as the loss of a chromosome.

[19] As used herein, the term "genomic sequence" is defined as a sequence that occurs in a genome. Because RNAs are transcribed from a genome, this term encompasses sequence that exist in the nuclear genome of an organism, as well as sequences that are present in a cDNA copy of an RNA (e.g., an mRNA) transcribed from such a genome.

[20] As used herein, the term "insertion" is defined as a mutation in which part of the genome or DNA sequence includes a sequence not present in a human genome reference sequence. Insertions can be as small as 1 nucleotide or base pair, and can be as a large as tens of millions of base pairs.

[21] As used herein, the terms "library" or "sequencing library" are used interchangeably and are defined as a plurality of nucleic acid fragments obtained from a biological sample. Generally, the fragments are modified with an adaptor sequence which affects coupling (e.g., capture and/or immobilization) of the fragments to a sequencing platform and which adaptors also include primer sequences for amplifying and/or sequencing of the nucleic acid.

[22] As used herein, the term "ligating" is defined as the enzyme catalyzed j oining of the terminal nucleotide at the 5' end of a first DNA molecule to the terminal nucleotide at the 3' end of a second DNA molecule.

[23] As used herein, the term "locus" is defined as a location of a gene, nucleotide, or sequence on a chromosome. An "allele" of a locus, as used herein, can refer to an alternative form of a nucleotide or sequence at the locus. A "wild-type allele" refers to an allele that has the highest frequency in a population of subjects. A "wild-type allele" generally is not associated with a disease.

[24] As used herein, the term "mutation" is defined as a change of the nucleotide sequence of a wild-type genome. Mutations can involve large sections of DNA (e.g., copy number variation). Mutations can involve whole chromosomes (e.g., aneuploidy). Mutations can involve small sections of DNA. Examples of mutations involving small sections of DNA include, e.g., point mutations or single nucleotide polymorphisms, multiple nucleotide polymorphisms, insertions (e.g., insertion of one or more nucleotides at a locus), multiple nucleotide changes, deletions (e.g., deletion of one or more nucleotides at a locus), and inversions (e.g., reversal of a sequence of one or more nucleotides).

[25] As used herein, the terms "polynucleotides," "nucleic acid," "nucleotides," and "oligonucleotides" can be used interchangeably, and are defined to mean a polymeric form of nucleotides of any length, either deoxyribonucleotides or ribonucleotides, or analogs thereof. Polynucleotides may have any three-dimensional structure, and may perform any function, known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (locus) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA, ribosomal RNA, ribozymes, cDNA, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers. A polynucleotide may comprise modified nucleotides, such as methylated nucleotides and nucleotide analogs. If present, modifications to the nucleotide structure may be imparted before or after assembly of the polymer. The sequence of nucleotides may be interrupted by non-nucleotide components. A polynucleotide may be further modified after polymerization, such as by conjugation with a labeling component.

[26] As used herein, the term "rearrangement" is defined as chromosome changes that result in a different structure of a native chromosome. Rearrangements can be, for example, deletions, duplications, inversions, and translocations.

[27] As used herein, a "sample" or "nucleic acid sample" are defined as any substance containing or presumed to contain nucleic acid. The sample can be a biological sample obtained from a subject. The nucleic acids can be RNA, DNA, e.g., genomic DNA, mitochondrial DNA, viral DNA, synthetic DNA, or cDNA reverse transcribed from RNA. The nucleic acids in a nucleic acid sample generally serve as templates for extension of a hybridized primer. In some embodiments, the biological sample is a liquid sample. The liquid sample can be whole blood, plasma, serum, ascites, cerebrospinal fluid, sweat, urine, tears, saliva, buccal sample, cavity rinse, or organ rinse. The liquid sample can be an essentially cell-free liquid sample (e.g., plasma, serum, sweat, urine, tears, etc). In other embodiments, the biological sample is a solid biological sample, e.g., feces or tissue biopsy, e.g., a tumor biopsy. A sample can also comprise in vitro cell culture constituents (including but not limited to conditioned medium resulting from the growth of cells in cell culture medium, recombinant cells and cell components). [28] As used herein, the term "sequencing" is defined as a method by which the identity of at least 10 consecutive nucleotides (e.g., the identity of at least 20, at least 50, at least 100, at least 200, or at least 500 or more consecutive nucleotides) of a polynucleotide are obtained.

[29] As used herein, the term "single nucleotide variant" or "SNV" is defined as a type of genomic sequence variation resulting from a single nucleotide substitution within a sequence. "NV alleles" or "alleles of a SNV" generally refer to alternative forms of the SNV at particular locus.

[30] As used herein, "small indels" is defined as small insertion and small deletions in the genome. These small insertions and small deletions are 1 to 200 bp in length.

[31] As used herein, the term "subject" is defined to mean a biological entity containing expressed genetic materials. The biological entity can be a plant, animal, or microorganism, including, e.g., bacteria, viruses, fungi, and protozoa. The subject can be tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro. The subject can be a mammal. The mammal can be a human. The human may be diagnosed or suspected of being at high risk for a disease. The disease can be cancer.

[32] As used herein, the term "target polynucleotide," is defined as a polynucleotide of interest under study. A target polynucleotide may contain one or more sequences that are of interest and under study. A target polynucleotide can comprise, for example, a genomic sequence. The target polynucleotide can comprise a target sequence whose presence, amount, and/or nucleotide sequence, or changes in these, are desired to be determined.

[33] As used herein, the term "translocation" is defined as the transfer of a segment from one chromosome to another chromosome, or transfer of the segment to a new location in the same chromosome.

[34] As used herein, the term "wild-type" is defined as a gene sequence that is most prevalent for a locus among a population of subjects of the same species.

Preparation of Nucleic Acids

[35] In some aspects, libraries described herein are prepared from single stranded or double stranded nucleic acids. Single-stranded nucleic acids can be prepared from a sample of double-stranded nucleic acid using any means known in the art or described herein. Starting samples can be a biological sample obtained from a subject. The biological sample can be tissues, cells and their progeny obtained from a subject. The biological sample can be a liquid sample including, for example, whole blood, plasma, serum, ascites, cerebrospinal fluid, sweat, urine, tears, saliva, buccal sample, cavity rinse, or organ rinse. The liquid sample can be an essentially cell-free liquid sample (e.g., plasma, serum, sweat, urine, tears, etc). The biological sample can also be a solid biological sample, e.g., feces, tissue biopsy, a tumor biopsy, FFPE tissues, etc. A biological sample can also comprise in vitro cell culture constituents (including but not limited to conditioned medium resulting from the growth of cells in cell culture medium, recombinant cells and cell components). The biological sample can be any of the above stored in a suitable way such as, for example, FFPE, lyophilized, stored in buffers, frozen, etc.

[36] The nucleic can be DNA obtained from a biological sample. The DNA can be obtained from formalin-fixed paraffin-embedded (FFPE) tissues (or cells) or circulating DNA. DNA can be isolated from FFPE samples using commercially available kits (e.g., those sold by Qiagen or Covaris). The DNA can also be cDNA generated from RNA isolated from a biological sample using random primed reverse transcription (RNaseH+) to generate randomly sized cDNA.

[37] The DNA can be fragmented in situ to a desired size, e.g., the DNA can be sheared to an average size of 500-600 base pairs. Fragmented DNA can be treated with a base excision repair enzyme or enzyme cocktail (e.g., Endo VIII, formamidopyrimidine DNA glycosylase (FPG)) to excise damaged bases that can interfere with polymerization. The DNA can also be treated with a proof-reading polymerase (e.g. T4 DNA polymerase) to polish ends and replace damaged nucleotides (e.g. abasic sites) and a heat-labile phosphatase to remove all phosphate groups from DNA. The reaction mixture can be heated to 80 °C for 10 min to inactivate the phosphatase and/or polymerase and denature double stranded DNA to single strands.

[38] The nucleic acid sample can also be enriched for target polynucleotides. Target enrichment can be by any means known in the art. For example, the nucleic acid sample may be enriched by amplifying target sequences using target-specific primers. The target amplification can occur in a digital PCR format, using any methods or systems known in the art. The nucleic acid sample may be enriched by capture of target sequences onto an array of immobilized, target-selective oligonucleotides. The nucleic acid sample may be enriched by hybridizing to target-selective oligonucleotides free in solution. The oligonucleotides may comprise a capture moiety which enables capture by a capture reagent. Other target capture methods are described in United States patent application Serial No. 15/099,525 filed April 14, 2016, which is incorporated by reference in its entirety for all purposes.

Methods for Making Libraries

[39] Nucleic acid libraries can be made using the methods disclosed herein. For example, double stranded DNA (or other nucleic acids) can be fragmented to a desired size using, e.g., restriction enzymes, nuclease, sonication, shearing, other physical treatments that break the nucleic acids, or combinations of the foregoing. When the fragmented nucleic acids is obtained from formalin fixed paraffin embedded (FFPE) samples or from other sources with damaged DNA, the fragmented nucleic acids can be treated to remove damaged DNA nucleotides from the nucleic acids. Damage can include, for example, apurinic sites, apyrimidinic sites, thymine dimers, nicks, gaps, deaminated cytosine, and 8-oxoguanine. Treatments for damaged nucleic acids include, for example, use of AP-endonuclease 1, Uracil DNA glycosylase (UDG), formamidopyrimidine [fapy]-DNA glycosylase (Fpg), bifunctional DNA glycosylase OGG1, other glycosylases, DNA polymerase β, X-ray repair cross-complementing group 1 (XRCC1), DNA ligase III, Poly(ADP-ribose) polymerase (PARP-1), Uvr proteins, Endo VIII, nucleotide excision repair enzymes (e.g., CETN2, DDB1, DDB2, ERCC1, ERCC2, ERCC3, ERCC4, ERCC5, ERCC6, ERCC8, LIG1, MNAT1, MMS19, RAD23A, RAD23B, RPA1, RPA2, TFIIH, XAB2, XPA, XPC). Commercially available repair products include, for example, PreCR Repair Mix sold by New England Biolabs, NEBNext FFPE DNA Repair Mix sold by New England Biolabs, or the DNA Repair Kits (version B) catalog Nos. 51296 & 51796 sold by Active Motif North America. The damaged nucleic acids can be treated using the above to remove the damaged base and the sugar to which it was attached resulting in a gap or break in the nucleic acid.

[40] When the fragmented and damage treated nucleic acids are double stranded DNA, the double stranded DNA can be converted to single stranded DNA by denaturing the dsDNA with an appropriate treatment (e.g., heat, chaotropes, or combinations). Heat denaturation can be achieved by heating a dsDNA sample to about 60 °C or above, about 65 °C or above, about 70 °C or above, about 75 °C or above, about 80 °C or above, about 85 °C or above, about 90 °C or above, about 95 °C or above, or about 98 °C or above. The dsDNA sample can be heated by any means known in the art, including, e.g., incubation in a water bath, a temperature controlled heat block, or a thermal cycler. In some embodiments the sample is heated for 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 minutes.

[41] Compounds like urea and formamide contain functional groups that can form H- bonds with the electronegative centers of the nucleotide bases. At high concentrations (e.g., 8M urea or 70% formamide) of the denaturant, the competition for H-bonds favors interactions between the denaturant and the N-bases rather than between complementary bases, thereby separating the two strands.

[42] Denaturation by incubation in basic pH can be achieved by, for example, incubation of a dsDNA sample in a solution comprising sodium hydroxide (NaOH) or potassium hydroxide (KOH). The solution can comprise about 1 mM NAOH, 2 mM NAOH, 5 mM NAOH, 10 mM NAOH, 20 mM NAOH, 40 mM NAOH, 60 mM NAOH, 80 mM NAOH, 100 mM NAOH, 0.2M NaOH, about 0.3M NaOH, about 0.4M NaOH, about 0.5M NaOH, about 0.6M NaOH, about 0.7M NaOH, about 0.8M NaOH, about 0.9M NaOH, about 1.0M NaOH, or greater than 1.0M NaOH. The solution can comprise about 1 mM KOH, 2 mM KOH, 5 mM KOH, 10 mM KOH, 20 mM KOH, 40 mM KOH, 60 mM KOH, 80 mM KOH, 100 mM KOH, 0.2M KOH, 0.5M KOH, 1M KOH, or greater than 1M KOH. The dsDNA sample can be incubated in NaOH or KOH for 0.5., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, or more than 60 minutes. The dsDNA can be incubated in Na-acetate following NaOH or KOH incubation.

[43] The ssDNA fragments (or dsDNA prior to denaturation) can then be treated to add a phosphate to the 5' end of the fragments and optionally to remove the phosphate on the 3' end of the fragments. For example, the DNA can be treated with T4 polynucleotide kinase and ATP to add phosphate to the 5' end and remove phosphates from the 3 ' end of the DNA strands. In addition, to removing the phosphate on the 3' end of the DNA fragment or strands, the 3' end can also be blocked with appropriate blocking groups, e.g., dideoxynucleotides can be added to the 3' end, or reversible protection groups can be added to the 3' end to prevent ligation reactions at the 3' end of the DNA fragments or strands. Removal of 3' phosphates or blocking of the 3' end of the DNA fragments can minimize aberrant ligation of two library members. Accordingly, in some embodiments, 3' phosphates are removed and/or 3' ends are blocked in at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or greater than 95% of DNA fragments. Substantially all phosphate groups can be removed and/or substantially all 3' ends can be blocked in the DNA fragments. Substantially all phosphates are removed and/or substantially all 3' ends are blocked in at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or greater than 95% of DNA fragments in a sample.

[44] The 3 '-end a nucleic acid, adapter or other oligonucleotide can be blocked with a phosphate. When the reaction requiring the blocking of the 3' end is stopped the 3'- phosphate can be removed allowing the 3' end of the nucleic acid, adapter or other oligonucleotide to react in a chain polymerization reaction. Other 3' end blocking groups include, for example, a 3'-0-azidomethyl group, a dideoxynucleotide, 3'-0-(a- methoxyethyl)ether, 3'-0-isovaleryl ester, 3'-ONH 2 blocking groups, certain dyes, etc.

[45] Single stranded DNA fragments with 5' phosphate and dephosphorylated (and/or protected) 3' ends are ligated to 5 '-adapters on the 5' end of the fragments using ligase. The ligase can be a DNA or RNA ligase. Commercially sold DNA ligases include, for example, T3 DNA ligase, T4 DNA ligase, T7 DNA ligase, Taq DNA ligase, DNA ligase III, or Pfu DNA ligase. The RNA ligase can be an Rnl 1 or Rnl 2 family ligase. Generally, Rnl 1 family ligases can repair single-stranded breaks in tRNA. Exemplary Rnl 1 family ligases include, e.g., T4 RNA ligase, or thermostable RNA ligase 1 from Thermus scitoductus bacteriophage TS2126. These ligases generally catalyze the ATP-dependent formation of a phosphodiester bond between a nucleotide 3-OH nucleophile and a 5' phosphate group. Generally, Rnl 2 family ligases can seal nicks in duplex RNAs. Exemplary Rnl 2 family ligases include, e.g., T4 RNA ligase 2. The RNA ligase can be an Archaeal RNA ligase, e.g., an archaeal RNA ligase from the thermophilic archaeon Methanobacterium thermoautotrophicum (MthRnl).

[46] The ligation of adaptors to the single-stranded or double stranded nucleic acid fragments can comprise preparing a reaction mixture comprising the DNA fragments, an adaptor, and ligase. The reaction can be performed at room temperature, or at lower temperatures. The reaction mixture can also be heated to effect ligation of the adaptors to the DNA fragments. The reaction mixture can be heated to about 30 °C, about 35 °C, 37 °C, about 40 °C, about 45 °C, about 50 °C, about 55 °C, about 60 °C, about 65 °C, about 70 °C, or above 70 °C. The reaction mixture can be heated to about 60-70 °C. The reaction mixture can be heated for a sufficient time to effect ligation of the adaptor to the DNA fragment. The reaction mixture can be heated for about 5 min, about 10 min, about 15 min, about 20 min, about 25 min, about 30 min, about 35 min, about 40 min, about 45 min, about 50 min, about 55 min, about 60 min, about 70 min, about 80 min, about 90 min, about 120 min, about 150 min, about 180 min, about 210 min, about 240 min, or more than 240 min.

[47] The adaptors can be present at a concentration that is greater than the concentration of DNA fragments in the mixture. The adaptors can be present at a concentration that is at least 10%, 20%, 30%, 40%, 60%, 60%, 70%, 80%, 90%, 100% or more than 100% greater than the concentration of DNA fragments in the mixture. The adaptors can be present at concentration that is at least 10-fold, 100-fold, 1000-fold, or 10000-fold greater than the concentration of DNA fragments in the mixture. The adaptors can be present at a final concentration of 0.1 uM, 0.5 uM, 1 uM, 10 uM or greater. The ligase can be present in the reaction mixture at any amount suitable for ligation, including for example, a saturating amount.

[48] The reaction mixture can also comprise a high molecular weight inert molecule, e.g., PEG of MW 4000, 6000, or 8000. The inert molecule can be present in an amount that is about 0.5%, 1 %, 2%, 3%, 4%, 5%, 7.5%, 10%, 12.5%, 15%, 17.5%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, or greater than 50% weight/volume. The inert molecular can be present in an amount that is about 0.5-2%, about 1 -5%, about 2-15%, about 10-20%, about 15-30%, about 20-50%, or more than 50% weight/volume.

[49] If desired, unligated 5 '-adapters can be separated or removed from the fragment-5'- adapters by purification or other methods including for example, filtration by molecular weight cutoff, size exclusion chromatography, use of a spin column, selective precipitation with polyethylene glycol (PEG), selective precipitation with PEG onto a silica matrix, alcohol precipitation, sodium acetate precipitation, PEG and salt precipitation, or high stringency washing. The fragment-5' -adapters are then ligated with 3 '-adapters on the 3 ' end of the 5'- adapters-fragment. If a protective group has been placed on the 3 ' end of the fragments to prevent ligation reactions, this protective group must be removed prior to the ligation of the 3 '-adapter. If desired, unligated 3 '-adapters can be separated or removed from the 5'- adapter-fragment-3' -adapter by purification or other methods.

[50] The library of 5 '-adapter-fragment-3' -adapter nucleic acids can be directly sequenced, or the library can be subject to amplification. The amplification can be performed using primers specific for sequences in the adapters, or a target directed amplification can be done using approaches as those described in, for example, United States patent application Serial No. 15/099,525 filed April 14, 2016, which is incorporated by reference in its entirety for all purposes.

Libraries

[51] The methods described herein can produce libraries for a variety of purposes including, for example, detection of mutations, detection of alleles, retrospective studies, disease diagnostics and monitoring, diagnostics and monitoring for disorders, research, etc.

[52] The libraries can be made from any biological sample including, for example, a sample obtained from biological entity containing expressed genetic materials. The biological entity can be obtained from a plant, animal, or microorganism, including, e.g., bacteria, viruses, fungi, and protozoa. The biological sample can be tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro. The biological entity can be a mammal, and the mammal can be a human. The human may be diagnosed or suspected of being at high risk for a disease. The disease can be cancer. The biological sample can be a liquid sample including, for example, whole blood, plasma, serum, ascites, cerebrospinal fluid, sweat, urine, tears, saliva, buccal sample, cavity rinse, or organ rinse. The liquid sample can be an essentially cell-free liquid sample (e.g., plasma, serum, sweat, urine, tears, etc). The biological sample can be a solid biological sample, e.g., feces, tissue biopsy, tumor biopsy, FFPE tissues. A biological sample can also comprise in vitro cell culture constituents (including but not limited to conditioned medium resulting from the growth of cells in cell culture medium, recombinant cells and cell components). The biological sample can be any of the above stored in a suitable way such as, for example, FFPE, lyophilized, stored in buffers, frozen, etc.

[53] Nucleic acids are obtained from the biological sample for use in making the libraries described herein. Nucleic acids can be RNA, DNA, e.g., genomic DNA, mitochondrial DNA, viral DNA, synthetic DNA, or cDNA reverse transcribed from RNA. The nucleic acids in a nucleic acid sample can serve as templates for extension of a hybridized primer and/or can be substrates for attachment of adapters. The nucleic acids in the library can be single stranded or double stranded.

[54] The nucleic acids in the library can be directly sequenced or the nucleic acids can be amplified (selectively or non-selectively) followed by sequencing. The nucleic acids are generally modified with an adaptor sequence which affects coupling (e.g., capture and/or immobilization) of the fragments to a sequencing platform, and which adapters can include sequences complementary to primers useful for amplification or sequencing of the nucleic acids. The nucleic acids of the library can also be used to make whole exome libraries by generating a whole-genome library that can then be subject to capture of the known exon regions in the human or other organism genome, by methods known in the art such as hybridization with set of biotinylated long oligonucleotide baits complementary to said regions and subsequent pull-down.

[55] The nucleic acids in the library can be enriched for target polynucleotides. Target enrichment can be by any means known in the art. For example, the nucleic acid sample may be enriched by amplifying target sequences using target-specific primers. The target amplification can occur in a digital PCR format, using any methods or systems known in the art. The nucleic acid sample may be enriched by capture of target sequences onto an array immobilized thereon target-selective oligonucleotides. The nucleic acid sample may be enriched by hybridizing to target-selective oligonucleotides free in solution. The oligonucleotides may comprise a capture moiety which enables capture by a capture reagent. Exemplary capture moieties and capture reagents are described herein.

[56] Libraries can be made from genomic material or mRNA that provide good coverage depth and high percentage coverage of the genome (or expressed genes). The libraries can provide a median coverage depth of at least 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 5,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, or 100,000 fold. The libraries can provide a median coverage depth of at least 20-30, 20-35, 20-40, 20-50, 20-60, 20-70, 20-80, 20-90, 20-100, 30-40, 30-50, 30-60, 30-70, 30-8-, 30-90, 30-100, 40-100, 50-100, 60-100, 70-100, 80-100, 90-100, 100-200, 100-300, 100-400, 100-500, 100-600, 100-700, 100-800, 100-900, 100- 1000, 200-500, 500-1000, 1000-10,000, 10,000-50,000, or 50,000-100,000 fold. The libraries can also provide 70%, 80%, 90% 95%, 99% or 100% coverage of a genome, expressed genes, or target sequence with a coverage depth of 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 5,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, or 100,000 fold. The libraries can provide a sensitivity and/or precision of 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.5% or 99.99%.

Sequencing Methodologies

[57] Any sequencing methodologies may be used with the nucleic acids disclosed herein. Commercially available sequencing methods include, e.g., sequencing-by-synthesis, ion semiconductor sequencing, pyrosequencing, reversible dye terminator sequencing, sequencing by ligation, single-molecule sequencing, sequencing by hybridization, and nanopore sequencing. Platforms for sequencing by synthesis are available from, e.g., Illumina, 454 Life Sciences, Helicos Biosciences, and Qiagen. Illumina platforms can include, e.g., Illumina's Solexa platform, Illumina's Genome Analyzer, and are described in Gudmundsson et al (Nat. Genet. 2009 41 : 1122-6), Out et al (Hum. Mutat. 2009 30: 1703-12) and Turner (Nat. Methods 2009 6:315-6), U.S. Patent Application Pub nos. US20080160580 and US20080286795, U.S. Pat. Nos. 6,306,597, 7,115,400, and 7,232,656, all of which are incorporated by reference in their entirety for all purposes. 454 Life Science platforms include, e.g., the GS Flex and GS Junior, and are described in U.S. Pat. No. 7,323,305, which is incorporated by reference in its entirety for all purposes. Platforms from Helicos Biosciences include the True Single Molecule Sequencing platform. Platforms for ion seminconductor sequencing include, e.g., the Ion Torrent Personal Genome Machine (PGM) and are described in U.S. Pat. No. 7,948,015, which is incorporated by reference in its entirety for all purposes. Platforms for pryosequencing include the GS Flex 454 system and are described in U.S. Pat. Nos. 7,211,390; 7,244,559; 7,264,929, which are incorporated by reference in their entirety for all purposes. Platforms and methods for sequencing by ligation include, e.g., the SOLiD sequencing platform from Thermo Fisher described in U.S. Pat. No. 5,750,341, which is incorporated by reference in its entirety for all purposes, and the DNA nanoball sequencing platform from Complete Genomics. Platforms for single-molecule sequencing include the SMRT system from Pacific Bioscience and the Helicos True Single Molecule Sequencing platform. Sanger sequencing including the automated Sanger sequencing, can also be used to sequence nucleic acids. Exemplary sequencing technologies are described below.

[58] The DNA sequencing technology can utilize the Ion Torrent sequencing platform, which pairs semiconductor technology with a sequencing chemistry to directly translate chemically encoded information (A, C, G, T) into digital information (0, 1) on a semiconductor chip. Without wishing to be bound by theory, when a nucleotide is incorporated into a strand of DNA by a polymerase, a hydrogen ion is released as a byproduct. The Ion Torrent platform detects the release of the hydrogen atom as a change in pH. A detected change in pH can be used to indicate nucleotide incorporation. The Ion Torrent platform comprises a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way. Each well holds a different library member, which may be clonally amplified. Beneath the wells is an ion-sensitive layer and beneath that an ion sensor. The platform sequentially floods the array with one nucleotide after another. When a nucleotide, for example a C, is added to a DNA template and is then incorporated into a strand of DNA, a hydrogen ion will be released. The charge from that ion will change the pH of the solution, which can be identified by Ion Torrent's ion sensor. If the nucleotide is not incorporated, no voltage change will be recorded and no base will be called. If there are two identical bases on the DNA strand, the voltage will be double, and the chip will record two identical bases called. Direct identification allows recordation of nucleotide incorporation in seconds. Library preparation for the Ion Torrent platform generally involves ligation of two distinct adaptors at both ends of a DNA fragment.

[59] Illumina products generally employ cluster amplification of library members onto a flow cell and a sequencing-by-synthesis approach. Cluster-amplified library members are subjected to repeated cycles of polymerase-directed single base extension. Single-base extension can involve incorporation of reversible-terminator dNTPs, each dNTP labeled with a different removable fluorophore. The reversible-terminator dNTPs are generally 3' modified to prevent further extension by the polymerase. After incorporation, the incorporated nucleotide can be identified by fluorescence imaging. Following fluorescence imaging, the fluorophore can be removed and the 3' modification can be removed resulting in a 3' hydroxyl group, thereby allowing another cycle of single base extension. Library preparation for the Illumina platform generally involves ligation of two distinct adaptors at both ends of a DNA fragment.

[60] HELICOS™ True Single Molecule Sequencing (TSMS™), can employ sequencing- by-synthesis technology. In the TSMS™ technique, a polyA adaptor can be ligated to the 3' end of DNA fragments. The adapted fragments can be hybridized to poly-T oligonucleotides immobilized on the TSMS™ flow cell. The library members can be immobilized onto the flow cell at a density of about 100 million templates/cm2. The flow cell can be then loaded into an instrument, e.g., HELISCOPE™ sequencer, and a laser can illuminate the surface of the flow cell, revealing the position of each template. A CCD camera can map the position of the templates on the flow cell surface. The library members can be subjected to repeated cycles of polymerase-directed single base extension. The sequencing reaction begins by introducing a DNA polymerase and a fluorescently labeled nucleotide. The polymerase can incorporate the labeled nucleotides to the primer in a template directed manner. The polymerase and unincorporated nucleotides can be removed. The templates that have directed incorporation of the fluorescently labeled nucleotide can be discerned by imaging the flow cell surface. After imaging, a cleavage step can remove the fluorescent label, and the process can be repeated with other fluorescently labeled nucleotides until a desired read length is achieved. Sequence information can be collected with each nucleotide addition step.

[61] The 454 sequencing platform (Roche) (e.g. as described in Margulies, M. et al. Nature 437:376-380 (2005), which is incorporated by reference in its entirety for all purposes) generally uses two steps. In a first step, DNA can be sheared into fragments. The fragments can be blunt-ended. Oligonucleotide adaptors can be ligated to the ends of the fragments. The adaptors generally serve as primers for amplification and sequencing of the fragments. At least one adaptor can comprise a capture reagent, e.g., a biotin. The fragments can be attached to DNA capture beads, e.g., streptavi din-coated beads. The fragments attached to the beads can be PCR amplified within droplets of an oil-water emulsion, resulting in multiple copies of clonally amplified DNA fragments on each bead. In a second step, the beads can be captured in wells, which can be pico-liter sized. Pyrosequencing can be performed on each DNA fragment in parallel. Pyrosequencing generally detects release of pyrophosphate (PPi) upon nucleotide incorporation. PPi can be converted to ATP by ATP sulfurylase in the presence of adenosine 5' phosphosulfate. Luciferase can use ATP to convert luciferin to oxyluciferin, thereby generating a light signal that is detected. A detected light signal can be used to identify the incorporated nucleotide. [62] The SOLiD™ platform generally utilizes a sequencing-by-ligation approach. Library preparation for use with a SOLiD™ platform generally comprises ligation of adaptors to the 5' and 3' ends of the fragments to generate a fragment library. Alternatively, internal adaptors can be introduced by ligating adaptors to the 5' and 3' ends of the fragments, circularizing the fragments, digesting the circularized fragment to generate an internal adaptor, and attaching adaptors to the 5' and 3' ends of the resulting fragments to generate a mate-paired library. Next, clonal bead populations can be prepared in microreactors containing beads, primers, template, and PCR components. Following PCR, the templates can be denatured. Beads can be enriched for beads with extended templates. Templates on the selected beads can be subjected to a 3' modification that permits bonding to a glass slide. The sequence can be determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After a color is recorded, the ligated oligonucleotide can be removed and the process can then be repeated.

[63] Single molecule, real-time (SMRT™) sequencing (PACIFIC BIOSCIENCES®) uses the continuous incorporation of dye-labeled nucleotides with imaging during DNA synthesis. Single DNA polymerase molecules can be attached to the bottom surface of individual zero- mode wavelength identifiers (ZMW identifiers) that obtain sequence information while phospolinked nucleotides are being incorporated into the growing primer strand. A ZMW generally refers to a confinement structure which enables observation of incorporation of a single nucleotide by DNA polymerase against a background of fluorescent nucleotides that rapidly diffuse in an out of the ZMW on a microsecond scale. By contrast, incorporation of a nucleotide generally occurs on a milliseconds timescale. During this time, the fluorescent label can be excited to produce a fluorescent signal, which is detected. Detection of the fluorescent signal can be used to generate sequence information. The fluorophore can then be removed, and the process repeated. Library preparation for the SMRT™ platform generally involves ligation of hairpin adaptors to the ends of DNA fragments.

[64] Nanopore sequencing DNA analysis techniques are being industrially developed by a number of companies, including Oxford Nanopore Technologies (Oxford, United Kingdom). Nanopore sequencing is a single-molecule sequencing technology whereby a single molecule of DNA is sequenced directly as it passes through a nanopore. A nanopore can be a small hole, of the order of 1 nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential (voltage) across can result in a slight electrical current due to conduction of ions through the nanopore. The amount of current which flows is sensitive to the size and shape of the nanopore and to occlusion by, e.g., a DNA molecule. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule obstructs the nanopore to a different degree, changing the magnitude of the current through the nanopore in different degrees. Thus, this change in the current as the DNA molecule passes through the nanopore represents a reading of the DNA sequence.

[65] The DNA sequencing technology can utilize a chemical-sensitive field effect transistor (chemFET) array (e.g., as described in U. S. Patent Application Publication No. 20090026082, which is incorporated by reference in its entirety for all purposes). In one example of the technique, DNA molecules can be placed into reaction chambers, and the template molecules can be hybridized to a sequencing primer bound to a polymerase. Incorporation of one or more triphosphates into a new nucleic acid strand at the 3' end of the sequencing primer can be discerned by a change in current by a chemFET. An array can have multiple chemFET sensors. In another example, single nucleic acids can be attached to beads, and the nucleic acids can be amplified on the bead, and the individual beads can be transferred to individual reaction chambers on a chemFET array, with each chamber having a chemFET sensor, and the nucleic acids can be sequenced.

[66] The DNA sequencing technology can utilize transmission electron microscopy (TEM). The method, termed Individual Molecule Placement Rapid Nano Transfer (IMPRNT), generally comprises single atom resolution transmission electron microscope imaging of high-molecular weight (150 kb or greater) DNA selectively labeled with heavy atom markers and arranging these molecules on ultra-thin films in ultra-dense (3 nm strand- to-strand) parallel arrays with consistent base-to-base spacing. The electron microscope is used to image the molecules on the films to determine the position of the heavy atom markers and to extract base sequence information from the DNA. The method is further described in PCT patent publication WO 2009/046445, which is incorporated by reference in its entirety for all purposes. The method allows for sequencing complete human genomes in less than ten minutes.

[67] Sequencing By Hybridization (SBH) generally comprises contacting a plurality of polynucleotide sequences with a plurality of polynucleotide probes, wherein each of the plurality of polynucleotide probes can be optionally tethered to a substrate. The substrate might be flat surface comprising an array of known nucleotide sequences. The pattern of hybridization to the array can be used to determine the polynucleotide sequences present in the sample. In other embodiments, each probe is tethered to a bead, e.g., a magnetic bead or the like. Hybridization to the beads can be identified and used to identify the plurality of polynucleotide sequences within the sample.

[68] The length of the sequence read can vary depending on the particular sequencing technology utilized. Sequencing methodologies can provide sequence reads that vary in size from tens to hundreds, or thousands of base pairs. Using sequencing methods described herein, and others known in the art, the sequence reads can be about 20 bases long, about 25 bases long, about 30 bases long, about 35 bases long, about 40 bases long, about 45 bases long, about 50 bases long, about 55 bases long, about 60 bases long, about 65 bases long, about 70 bases long, about 75 bases long, about 80 bases long, about 85 bases long, about 90 bases long, about 95 bases long, about 100 bases long, about 110 bases long, about 120 bases long, about 130, about 140 bases long, about 150 bases long, about 200 bases long, about 250 bases long, about 300 bases long, about 350 bases long, about 400 bases long, about 450 bases long, about 500 bases long, about 600 bases long, about 700 bases long, about 800 bases long, about 900 bases long, about 1000 bases long, 2,000 bases long, 3000 bases long, 4000 base long, 5000 bases long, 10,000 bases long, 20,000 bases long, 30,000 bases long, 40,000 bases long, 50,000 bases long, 100,000 bases long, or more than 100,000 bases long.

[69] Mapping of the sequences can be achieved by comparing the sequence with the sequence of a reference genome to determine the chromosomal origin of the sequenced nucleic acid (e.g. cell free DNA) molecule, and specific genetic sequence information is not needed. A number of computer algorithms are available for aligning sequences, including without limitation BLAST (Altschul et al, 1990), BLITZ (MPsrch) (Sturrock & Collins, 1993), FASTA (Person & Lipman, 1988), BOWTIE (Langmead et al, Genome Biology 10:R25.1-R25.10 [2009]), or ELAND (Illumina, Inc., San Diego, Calif, USA). One end of the clonally expanded copies of the DNA molecule can be sequenced and processed by bioinformatic alignment analysis for the Illumina Genome Analyzer, which uses the Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) software. Additional software includes SAMtools (SAMtools, Bioinformatics, 2009, 25(16):2078-9), and the Burroughs- Wheeler block sorting compression procedure which involves block sorting or preprocessing to make compression more efficient.

[70] Sequences obtained from multiple samples from a plurality of chromosomes can also be compared to identify sequence variants using the tools described herein. This direct sequence comparison can be done without the use of a reference genome.

Adaptors [71] An adaptor sequence can comprise a defined oligonucleotide sequence that affects coupling of a library member to a sequencing platform. The adaptor can include a bar code, and sequences complementary to an immobilizing polynucleotide, primers for amplification, sequencing primers, and other primers. An adaptor can include all of these sequences and others, or it can have a subset of these sequences.

[72] By way of example only, the adaptor can comprise a sequence appropriate for a capture probe on immobilized onto a solid support (e.g., a sequencing flow cell or bead). The adaptor sequence for capture has sufficient complementarity so that the adaptor anneals to the capture probe under appropriate conditions. An adaptor sequence can also comprise a defined oligonucleotide sequence appropriate for a sequencing primer (e.g., the adaptor sequence has sufficient complementary or identity so the sequencing primer can anneal under the appropriate conditions). The sequencing primer can enable nucleotide incorporation by a polymerase, wherein incorporation of the nucleotide is monitored to provide sequencing information. The sequencing primer can be about 15-25 bases. The sequencing primer can be conjugated to the 3' end of the adaptor. An adaptor can comprise a sequence that has sufficient complementary or identity to an oligonucleotide sequence immobilized onto a solid support and a sequence so the immobilized probe and the adaptor can anneal under appropriate conditions. Coupling can also be achieved through serially stitching adaptors together. The number of adaptors that can be stitched can be 1, 2, 3, 4 or more. The stitched adaptors can be at least 35 bases, 70 bases, 105 bases, 140 bases or more.

[73] The adaptor can also comprise a barcode sequence. Each fragment in the library can have a unique bar code, or fragments in the library can share a bar code, depending on the use and purpose of the bar code. For example, at least 0.01 %, 0.1 %, 1 %, 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100% of sequencing library members in a library could comprise the same adaptor sequence. The adaptor sequence can be chosen by a user according to the sequencing platform used for sequencing. By way of example only, an Illumina sequencing by synthesis platform comprises a solid support with a first and second population of surface-bound oligonucleotides immobilized thereon. Such oligonucleotides comprise a sequence for hybridizing to a first and second Illumina-specific adaptor oligonucleotide and priming an extension reaction. Accordingly, a DNA library member can comprise a first Illumina-specific adaptor that is partially or wholly complementary to a first population of surface bound oligonucleotides of an Illumina system. By way of other example only, the SOLiD system, and Ion Torrent, GS FLEX system comprises a solid support in the form of a bead with a single population of surface bound oligonucleotides immobilized thereon. Accordingly, in some embodiments the ssDNA library member comprises an adaptor sequence that is complementary to a surface-bound oligonucleotide of a SOLiD system, Ion Torrent system, or GS Flex system.

Kits

[74] Kits for performing methods are also disclosed herein. The kit can comprise a 5'- adaptor, a 3 '-adaptor, and a ligase. The ligase can be a DNA ligase or an RNA ligase. The kit can optionally include a DNA repair mix comprised of repair enzymes such as glycosylases, polymerases (e.g., proofreading DNA polymerases), nucleases, etc. The kit can optionally include a solid support, e.g., a bead with a capture reagent. The kit can optionally include a kinase and reagents for reacting the nucleic acids with the kinase. The kit can optionally include a polymerase. The polymerase can be a thermostable polymerase having a 5' to 3' exonuclease activity and not having a 3' to 5' exonuclease activity. The kit can include a negative control sample. The kit can also include a positive control sample.

[75] The kits can also include a packaging material. As used herein, the term "packaging material" refers to a physical structure housing the components of the kit. The packaging material can maintain sterility of the kit components, and can be made of material commonly used for such purposes (e.g., paper, corrugated fiber, glass, plastic, foil, ampules, etc.). Kits can also include a buffering agent, a preservative, or a protein/nucleic acid stabilizing agent.

[76] The methods and kits described herein can be used for the sensitive detection of a mutation in a target polynucleotide. In some aspects, the methods and kits of the invention can be used for the discrimination of alleles in a target tissue. For example, the invention provides methods and kits for the detection of mutant alleles in a background of high wild- type allelic ratio. For another example, the methods and kits disclosed herein can be used for the detection of multiple alleles.

[77] Kits can include one or more primer sets. Kits can further comprise instructions for use of the one or more primer sets, e.g., instructions for practicing a method described herein. In some embodiments, the kit includes a packaging material. Kits can also include a buffering agent, a preservative, or a protein/nucleic acid stabilizing agent. Kits can also include other components of a reaction mixture as described herein. For example, kits may include one or more aliquots of thermostable DNA polymerase, and/or one or more aliquots of dNTPs. Kits can also include control samples of known amounts of template DNA molecules harboring the individual alleles of a locus. The kit can include a negative control sample, e.g., a sample that does not contain DNA molecules harboring the individual alleles of a locus. The kit can also include a positive control sample, e.g., a sample containing known amounts of one or more of the individual alleles of a locus.

Uses of Libraries

[78] The libraries and methods disclosed herein can be used to detect new mutations, detect new alleles, retrospective studies, disease diagnostics and monitoring, diagnostics and monitoring for disorders, research to study, monitor, and/or improve the treatment of subjects suffering from a disease or disorder.

[79] High through put sequencing of the libraries described herein can provide the sequence of regions of interest and/or the entire genome. The whole genome sequence information from one subject can identify the alleles and mutations carried by the subject in genes of interest from tissues of interest. This information on alleles and mutations can be used to diagnose the subject's disease and select treatment options with the best predicted outcomes. The whole genome sequence information from a plurality of subject's whom share disease or diagnosis, can also be compared to characterize common alleles and mutations in the subject that correlate with disease, severity of disease, response to various treatment options, morbidity, mortality, etc. This group of subjects can have known outcomes and responses to courses of treatment, and the whole genome sequencing of the subject's nucleic acids can provide retrospective information on the genetic make-up of the subjects that influenced the outcomes in the subjects.

[80] The disease can be a cancer, e.g., a tumor or a leukemia such as acute leukemia, acute t-cell leukemia, acute lymphocytic leukemia, acute myelocytic leukemia, myeloblastic leukemia, promyelocytic leukemia, myelomonocytic leukemia, monocytic leukemia, erythroleukemia, chronic leukemia, chronic myelocytic (granulocytic) leukemia, or chronic lymphocytic leukemia, polycythemia vera, lymphomas such as Hodgkin's lymphoma, follicular lymphoma or non-Hodgkin's lymphoma, multiple myeloma, Waldenstrom's macroglobulinemia, heavy chain disease, solid tumors, sarcomas, carcinomas such as, e.g., fibrosarcoma, myxosarcoma, liposarcoma, chondrosarcoma, osteogenic sarcoma, lymphangiosarcoma, mesothelioma, Ewing's tumor, leiomyosarcoma, rhabdomyosarcoma, colon carcinoma, colorectal cancer, pancreatic cancer, breast cancer, ovarian cancer, prostate cancer, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma, sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma, papillary adenocarcinomas, cystadenocarcinoma, medullary carcinoma, bronchogenic, carcinoma, renal cell carcinoma, hepatoma, bile duct carcinoma, choriocarcinoma, seminoma, embryonal carcinoma, Wilms' tumor, cervical cancer, uterine cancer, testicular tumor, lung carcinoma, small cell lung carcinoma, bladder carcinoma, epithelial carcinoma, glioma, craniopharyngioma, ependymoma, pinealoma, hemangioblastoma, acoustic neuroma, oligodendroglioma, meningioma, melanoma, neuroblastoma, retinoblastoma, endometrial cancer, or non small cell lung cancer.

[81] The disease or disorder can be any the afflicts a subject including, for example, infectious diseases, hereditary diseases, autoimmune diseases, inflammatory syndromes or diseases, coronary artery diseases, cerebrovascular diseases, cognition diseases and disorders (e.g., Alzheimer's, dementia, Parkinson's, etc.), other disorders and diseases of the brain and central nervous system, substance abuse, etc.

[82] The nucleic acids sequenced can include a region of a gene associated with a disease. The nucleic acids can be obtained from tissue samples from subjects who have a disease (e.g., FFPE samples), or can be obtained from cell lines and organoids. The genome sequenced can include druggable targets. As used herein, the term "druggable target" means a gene or cellular pathway that can be modulated by a disease therapy. The disease can be cancer. Accordingly, the genome sequenced can contain known cancer-related genes. Cancer-related genes can include, for example, ABCA1, BRAF, CHD5, EP300, FLT1, ITPA, MYC, PIK3R1, SKP2, TP53, ABCA7, BRCA1, CHEK1, EPHA3, FLT3, JAK1, MYCL1, PIK3R2, SLC19A1, TP73, ABCB1, BRCA2, CHEK2, EPHA5, FLT4, JAK2, MYCN, PKHD1, SLC1A6, TPM3, ABCC2, BRIP1, CLTC, EPHA6, FN1, JAK3, MYH2, PLCB1, SLC22A2, TPMT, ABCC3, BUB IB, COL1A1, EPHA7, FOS, JUN, MYH9, PLCG1, SLC01B3, TPO, ABCC4, Clorfl44, COPS5, EPHA8, FOXOl, KBTBD11, NAV3, PLCG2, SMAD2, TPR, ABCG2, CABLES 1, CREB1, EPHB1, FOX03, KDM6A, NBN, PML, SMAD3, TRIO, ABL1, CACNA2D1, CREBBP, EPHB4, FOXP4, KDR, NCOA2, PMS2, SMAD4, TRRAP, ABL2, CAMKV, CRKL, EPHB6, GAB1, KIT, NEK11, PPARG, SMARCA4, TSC1, ACVR1B, CARD11, CRLF2, EPO, GATA1, KLF6, NF1, PPARGC1A, SMARCB1, TSC2, ACVR2A, CARM1, CSF1R, ERBB2, GLI1, KLHDC4, NF2, PPP1R3A, SMO, TTK, ADCY9, CAV1, CSMD3, ERBB3, GLI3, KRAS, NKX2-1, PPP2R1A, SOCS1, TYK2, AGAP2, CBFA2T3, CSNK1G2, ERBB4, GNA11, LM02, NOS2, PPP2R1B, SOD2, TYMS, AKT1, CBL, CTNNA1, ERCC1, GNAQ, LRP1B, NOS3, PRKAA2, SOS1, UGT1A1, AKT2, CCND1, CTN A2, ERCC2, GNAS, LRP2, NOTCH1, PRKCA, SOX10, UMPS, AKT3, CCND2, CTN B1, ERCC3, GPR124, LRP6, NOTCH2, PRKCZ, SOX2, USP9X, ALK, CCND3, CYFIPl, ERCC4, GPR133, LTK, NOTCH3, PRKDC, SPl, VEGF, ANAPC5, CCNE1, CYLD, ERCC5, GRB2, MAN1B1, NPM1, PTCH1, SPRY2, VEGFA, APC, CD40LG, CYP19A1, ERCC6, GSK3B, MAP2K1, NQOl, PTCH2, SRC, VHL, APC2, CD44, CYP1B1, ERG, GSTP1, MAP2K2, NR3C1, PTEN, ST6GAL2, WRN, AR, CD79A, CYP2C19, ERN2, GUCY1A2, MAP2K4, NRAS, PTGS2, STAT1, WT1, ARAF, CD79B, CYP2C8, ESR1, HDAC1, MAP2K7, NRP2, PTPN11, STAT3, XPA, ARFRP1, CDC42, CYP2D6, ESR2, HDAC2, MAP3K1, NTRK1, PTPRB, STK11, XPC, ARID 1 A, CDC42BPB, CYP3A4, ETV4, HGF, MAPK1, NTRK2, PTPRD, SUFU, ZFY, ATM, CDC73, CYP3A5, EWSR1, HIF1A, MAPK3, NTRK3, RAD50, SULT1A1, ZNF521, ATP5A1, CDH1, DACH2, EXT1, HM13, MAPK8, OMA1, RAD51, SUZ12, ATR, CDH10, DCC, EZH2, HMGA1, MARK3, OR10R2, RAF1, TAF1, AURKA, CDH2, DCLK3, FANCA, HNF1A, MCL1, PAK3, RARA, TBX22, AURKB, CDH20, DDB2, FANCD2, HOXA3, MDM2, PARP1, RBI, TCF12, BAI3, CDH5, DDR2, FANCE, HOXA9, MDM4, PAX5, REM1, TCF3, BAP1, CDK2, DGKB, FANCF, HRAS, MECOM, PCDH15, RET, TCF4, BARD1, CDK4, DGKZ, FAS, HSP90AA1, MEN1, PCDH18, RICTOR, TEK, BAX, CDK6, DIRAS3, FBXW7, IDH1, MET, PCNA, RIPK1, TEP1, BCL11A, CDK7, DLG3, FCGR3A, IDH2, MITF, PDGFA, ROR1, TERT, BCL2, CDK8, DLL1, FES, IFNG, MLH1, PDGFB, ROR2, TET2, BCL2A1, CDKN1A, DNMT1, FGFR1, IGF1R, MLL, PDGFRA, ROS1, TGFBR2, BCL2L1, CDK 1B, DNMT3A, FGFR2, IGF2R, MLL3, PDGFRB, RPS6KA2, THBS1, BCL2L2, CDK 2A, DNMT3B, FGFR3, IKBKE, MPL, PDZRN3, RPTOR, TNFAIP3, BCL3, CDKN2B, DOT1L, FGFR4, IKZF1, MRE11A, PHLPP2, RSP02, TNKS, BCL6, CDK 2C, DPYD, FH, IL2RG, MSH2, PIK3C3, RSP03, TNKS2, BCR, CDKN2D, E2F1, FHOD3, INHBA, MSH6, PIK3CA, RUNX1, TNNI3K, BIRC5, CDX2, EED, FIGF, INSR, MTHFR, PIK3CB, SDHB, TNR, BIRC6, CEBPA, EGF, FLG2, IRS1, MTOR, PIK3CD, SF3B1, TOPI, BLM, CERK, EGFR, FLNC, IRS2, MUTYH, PIK3CG, SHC1, and TOP2A.

[83] The inventions disclosed herein will be better understood from the experimental details which follow. However, one skilled in the art will readily appreciate that the specific methods and results discussed are merely illustrative of the inventions as described more fully in the claims which follow thereafter. Unless otherwise indicated, the disclosure is not limited to specific procedures, materials, or the like, as such may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.

EXAMPLES

Example 1 : Making a Library from a FFPE Biological Sample

[84] A library was made from the Ashkenazim PGP Son reference standard in an FFPE format as sold by Horizon Discovery (Horizon Catalog ID GM24385). The Ashkenazim PGP Son reference standard is a reference genome material selected by the Genome in a Bottle Consortium and developed by the National Institute of Standards and Technology (NIST). DNA was isolated from the FFPE sample using the Reliaprep FFPE gDNA miniprep kit from Promega. The DNA was fragmented to a size of about 550 base pairs using a Covaris sonicator. Damaged nucleotides in the DNA fragments were removed using the Repair mix in the TOMA Biosciences DNA Repair kit. After DNA repair, the DNA fragments were phosphorylated on the 5' ends and dephosphorylated on the 3' ends using the kinase mix from the TOMA Biosciences DNA Repair kit. The DNA fragments were then isolated from the reaction mix and denatured to make single stranded DNA fragments. The ssDNA fragments were ligated to 5 '-adaptors using the TOMA Biosciences Adaptor Set, ligase, and the activation mix and AD buffer from the TOMA Bioscience Library Preparation Reagents. (See the TOMA OS-Seq Tumor Profiling System: Library Preparation Module, 2017). After this ligation, the 5'-adaptor-ssDNA fragments are isolated. These 5'-adaptor ssDNA fragments were reacted with a 3 '-adaptor which has a sequence complementary for appropriate Illumina sequencing primers and a sequence suitable for flow cell binding, ligase, and the activation mix and AD buffer from the TOMA Bioscience Library Preparation Reagents. After this ligation step, the 5'-adaptor ssDNA fragment 3'-adaptors were isolated.

[85] The library of 5'-adaptor ssDNA fragment 3'-adaptors made from the FFPE, Ashkenazim PGP Son reference standard was sequenced using the Illumina HiSeq 2500 (2X250 pair-ended sequencing). The resulting sequencing provided a median coverage depth of about 35 fold for the genome, and about 80% of the genome had at least a coverage depth of 20 fold. The sequence information obtained from the FFPE materials had a precision of about 99.5% (0.5% false negatives) and a sensitivity of about 99.6% (0.4% false positives).

[86] These results obtained with the FFPE reference materials was compared to sequencing results obtained from the NIST RM HG002 Ashkenazim PGP son reference material (a cell line). DNA was prepared from the cell line and then sequenced on an Illumina HiSeq 2500. This data set is found at ftp- trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG 001_HiSeq_300x/NHGR I_Illumina300X_novoalign_bams/. Data was selected from this set to provide coverage depth of about 37, with 80% of the genome having a coverage depth of about 27. This sequence information had precision of about 99.9% (0.1% false negatives) and a sensitivity of about 90.1% (9.9% false positives). [87] Thus, the sequencing data obtained from the FFPE materials using the methods described herein was similar in quality to that obtained from sequencing DNA obtained from the cell line.

[88] All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.

[89] Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.




 
Previous Patent: DECODING OF AUDIO SIGNALS

Next Patent: DECODING OF AUDIO SIGNALS