Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR USE IN POLYNUCLEOTIDE SEQUENCING
Document Type and Number:
WIPO Patent Application WO/2010/064040
Kind Code:
A1
Abstract:
A method for sequencing of polynucleic acids, the method comprising the steps of (i) ligating a library of polynucleic acid fragments to adapters which facilitate hybridisation of the library fragment to a solid support to provide a surface bound polynucleic acid; (ii) amplification of the surface bound polynucleic acid fragment by multiple cycles of annealing, extension and denaturation ("cluster amplification"); and (iii) sequencing the amplified polynucleic acids, wherein the polynucleic acid fragments ligated to adapters are not amplified prior to binding to the solid support.

Inventors:
KOZAREWA, Iwanka (Genome Research Limited, Gibbs Building215 Euston Road, London NW1 2BE, GB)
TURNER, Daniel John (Genome Research Limited, Gibbs Building215 Euston Road, London NW1 2BE, GB)
Application Number:
GB2009/051635
Publication Date:
June 10, 2010
Filing Date:
December 02, 2009
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
GENOME RESEARCH LIMITED (Gibbs Building, 215 Euston Road, London NW1 2BE, GB)
KOZAREWA, Iwanka (Genome Research Limited, Gibbs Building215 Euston Road, London NW1 2BE, GB)
TURNER, Daniel John (Genome Research Limited, Gibbs Building215 Euston Road, London NW1 2BE, GB)
International Classes:
C12Q1/68
Foreign References:
US20050100900A12005-05-12
Other References:
MARGULIES MARCEL ET AL: "Genome sequencing in microfabricated high-density picolitre reactors.", NATURE 15 SEP 2005, vol. 437, no. 7057, 15 September 2005 (2005-09-15), pages 376 - 380, XP002572507, ISSN: 1476-4687
HIMMELREICH R ET AL: "Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium.", NUCLEIC ACIDS RESEARCH 15 FEB 1997, vol. 25, no. 4, 15 February 1997 (1997-02-15), pages 701 - 712, XP002572503, ISSN: 0305-1048
QUAIL MICHAEL A ET AL: "A large genome center's improvements to the Illumina sequencing system.", NATURE METHODS DEC 2008, vol. 5, no. 12, 25 November 2008 (2008-11-25), pages 1005 - 1010, XP002572504, ISSN: 1548-7105
BENTLEY DAVID R ET AL: "Accurate whole human genome sequencing using reversible terminator chemistry.", NATURE 6 NOV 2008, vol. 456, no. 7218, 6 November 2008 (2008-11-06), pages 53 - 59, XP002572505, ISSN: 1476-4687
MARDIS ELAINE R: "Next-generation DNA sequencing methods", ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, ANNUAL REVIEWS, US, vol. 9, 1 January 2008 (2008-01-01), pages 387 - 402, XP002512993, ISSN: 1527-8204, [retrieved on 20080624]
SHENDURE JAY ET AL: "Next-generation DNA sequencing.", NATURE BIOTECHNOLOGY OCT 2008, vol. 26, no. 10, October 2008 (2008-10-01), pages 1135 - 1145, XP002572506, ISSN: 1546-1696
Attorney, Agent or Firm:
STEPHEN, Robert et al. (90 High Holborn, London, Greater London WC1V 6XX, GB)
Download PDF:
Claims:
Claims

1 A method for sequencing of polynucleic acids, the method comprising the steps of:

1 ligating a library of polynucleic acid fragments to adapters which facilitate hybridisation of the library fragment to a solid support to provide a surface bound polynucleic acid; ii amplification of the surface bound polynucleic acid fragment by multiple cycles of annealing, extension and denaturation ("cluster amplification"); and iii sequencing the amplified polynucleic acids wherein the polynucleic acid fragments ligated to adapters are not amplified prior to binding to the solid support.

2 A method according to claim 1 wherein the adapters comprise both a region suitable for sequencing primer annealing and a region suitable for attachment to a solid support.

3 A method according to claim 1 or 2 wherein the genome has a neutral GC content.

4 A method according to claim 1 or 2 wherein the library of polynucleic acid fragments is derived from a genome that has an uneven nucleotide composition.

5 A method according to claim 4 wherein the genome has a mean AT content of > 60% or a mean GC content of > 60%.

6 A method according to any preceding claim wherein 2 adapters are used which are partially non complementary.

7 A method according to any preceding claim wherein the product of step (i) comprises polynucleic acid fragments having a different adapter sequence at each end.

8 A method according to any preceding claim wherein the library generated in step 1 is quantified by qPCR.

Description:
Method for use in polynucleotide sequencing

The present invention relates to methods for DNA amplification and sequencing.

Background

Sequencing genomes with extremely uneven nucleotide compositions poses great technical challenges to all of the currently available sequencing platforms. The best documented examples of this are our attempts to sequence the highly GC-poor genomes of Plasmodium species, which is difficult even for the traditional BAC to BAC Sanger method [1-4]. The genomes of several malaria species, including Plasmodium falciparum, have an unusually high percentage of adenine and thymine nucleotides: in exons, the mean AT content is > 75%, and in intergenic and intronic regions, this content can be close to 100% [5,6].

When loaded at a reasonably high density, a single lane of an lllumina Genome Analyzer (GA) flowcell 7 can currently yield > 400 x 10E6 bases of purity filtered (PF) sequence data in a 7 day paired end run. This would represent >18 x coverage of the genome of the 23Mb reference P. falciparum clone, 3D7 [6]. To obtain the same amount of data on a 96-capillary Sanger sequencer would take several months. But to make the most of the sequencing capacity of a GA, it is essential to obtain as broad a representation of the genome as possible, and to avoid a high number of duplicate sequences. If this is not achieved, it becomes necessary to perform additional sequencing runs, in order to get the desired depth of coverage, which is costly and time-consuming.

The lllumina library preparation pipeline is a multi-step process: adapters are ligated onto fragmented, end-repaired, A-tailed sample DNA, via a 3' T-overhang. The structure of the adapters ensures that whenever they ligate to both ends of a template strand, each strand receives a unique adapter sequence at either end. Following ligation, to generate sufficient quantities of adapter-ligated DNA to allow accurate quantification, to enrich for successfully ligated fragments and to allow the sequencing reaction to take place, the lllumina library preparation pipeline exploits the polymerase chain reaction (PCR) [7]. For the last 20 years, PCR has been used ubiquitously to amplify specific sections of DNA exponentially [8], but it is an inherently biased procedure [9-12].

To help overcome these amplification biases, and also to reduce the formation of primer dimers, the lllumina library prep uses carefully designed universal PCR primers, which allow for more optimal, simultaneous amplification of all loci, and facilitate amplification of complex template pools [7]. There is a narrow range of template concentrations in the PCR that will give clean libraries with adequate representation: too high a template concentration often generates an unexpected peak with an apparently higher molecular weight peak; too low a mass of template DNA in the PCR causes an increased incidence of PCR duplicates in the resulting sequences.

Even when performed under optimal conditions, however, the PCR step is still sensitive to biases, particularly when the template to be amplified has particularly high AT content.

The present invention addresses the issue of optimising sequencing techniques.

Statement of invention

The present invention relates to a method for sequencing of polynucleic acids, the method comprising the steps of

1 ligating a library of polynucleic acid fragments to adapters which facilitate hybridisation of the library fragment to a solid support to provide a surface bound polynucleic acid; and 2 amplification of the surface bound polynucleic acid fragment by multiple cycles of annealing, extension and denaturation; and

3 sequencing the amplified polynucleic acids, wherein the polynucleic acid fragments ligated to adapters are not amplified prior to binding to the solid support.

Figures

Figure 1 shows Distribution of genome sequence coverage, a) The distribution of sequence coverage across the unmasked genome is shown with various datasets with or without the PCR step, b) Accumulated portion of unmasked genome at different depth of coverage

Figure 2. shows 'No-PCR library preparation'.

Figure 3 shows GC content and depth of coverage.

Figure 4 shows frequencies of duplicate sequences.

Detailed description

The malaria sequencing programme at the Wellcome Trust Sanger Institute aims to sequence hundreds of cell lines, including clinical isolates collected from wild environment. As a pilot study, we started with a few sequencing runs of Plasmodium falciparum 3D7, the same strain as the reference genome 6, on the lllumina Genome Analyzer, with the intention of correcting base errors in the reference. This was followed by several more sequencing runs for a variety of malaria strains. Sequencing using the standard lllumina pipeline was successful in throughput, but quality of read mapping against the reference was very poor. The short reads were mapped to the reference using the modified SSAHA program [13] (see methods). As shown in Figure 1(a), data from the three previously sequenced 3D7 runs and one run of clinical isolates all failed to show a typical Poisson distribution with a peak around the average read depth. The situation can be further illustrated in Figure 1(b), where accumulated fractions of unmasked genome are plotted against depth of base coverage. It can be seen, for example, that only 30% bases are covered by the mapped reads at 10 times or higher for STD-245, from which the raw data should cover the genome 96 times by average (Supplementary Table 1 ). This causes serious difficulties with variation detection, such as SNPs and short indels, in addition to increasing the sequencing cost, as only a portion of the reads are useful.

Here we report an alternative method of lllumina library preparation that omits the PCR step entirely. For the extremely GC-poor malaria genomes, datasets obtained from these libraries not only improve SNP detection significantly, but also facilitate de novo assemblies using newly developed short read assemblers. We also illustrate the wider applicability of this approach by applying it to the GC-neutral Escherichia coli and GC-rich Bordetella pertussis genomes (Supplementary Table 1 ).

The lllumina library preparation generally introduces tails onto library DNA in a 2-step process. Firstly, adapters, essentially consisting of the sequencing primer annealing sequences, are ligated on. Then an additional section is added, by PCR, which facilitates hybridisation of library fragments to the flowcell. But, even though the number of cycles of PCR amplification is kept low (10 cycles) 7, it is a source of duplicate sequences, and amplification bias, and struggles with base compositions that lie at the extremes of low or high GC content [14].

The lllumina process is described in detail in reference [7] herein, incorporated fully by reference. In summary the process comprises the following steps:

A Preparation of samples

A library of polynucleic acid (such as DNA) fragments are generated, for example, by random shearing and joined to a pair of oligonucleotides in a forked adaptor configuration.

The ligated products are amplified using two oligonucleotide primers, resulting in double-stranded blunt-ended material with a different adaptor sequence on either end.

B Formation of clonal single-molecule array.

Polynucleic acid fragments prepared as in 'A' are denatured and single strands are annealed to complementary oligonucleotides on the flowcell surface. A new strand is copied from the original strand in an extension reaction that is primed from the 3' end of the surfacebound oligonucleotide; the original strand is then removed by denaturation. The adapter sequence at the 3' end of each copied strand is annealed to a new surface-bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand.

Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters, each ~1 μm in physical diameter.

This follows the basic method outlined in Fedurco, M., Romieu, A., Williams, S., Lawrence, I. & Turcatti, G. BTA, a novel reagent for DNA attachment on glass and efficient generation of solid- phase amplified DNA colonies. Nucleic Acids Res. 34, e22 (2006).

C Sequencing

The polynucleic acid in each cluster is linearized by cleavage within one adapter sequence and denatured, generating single-stranded template for sequencing by synthesis to obtain a sequence read (read 1 ). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesized and the opposite strand is then cleaved to provide the template for the second read (read 2).

In contrast to the above protocol the present invention omits the step of PCR amplification and is thus able to improve on known methods.

The present invention relates to a method for sequencing of polynucleic acids, the method comprising the steps of:

1 ligating a library of polynucleic acid fragments to adapters which facilitate hybridisation of the library fragment to a solid support to provide a surface bound polynucleic acid;

2 amplification of the surface bound polynucleic acid fragment by multiple cycles of annealing, extension and denaturation ("cluster amplification"); and

3 sequencing the amplified polynucleic acids wherein the polynucleic acid fragments ligated to adapters are not amplified prior to binding to the solid support.

In a further aspect the invention relates to a method for preparation of a DNA library, the method comprising the step of ligating a library of polynucleic acid fragments to adapters which facilitate hybridisation of the library fragment to a solid support to provide a surface bound polynucleic acid, wherein the adapters comprise both a region suitable for sequencing primer annealing and a region suitable for attachment to a solid support.

In one aspect the present invention uses a solid support amplification step (cluster amplification) to select for fully ligated template strands, rather than a PCR step. In one aspect there is no PCR amplification of the library of polynucleic acid fragments before or after annealing of the adaptor.

In one aspect the method comprises ligating a library of polynucleic acid fragments from a genome that has an uneven nucleotide composition, for example having a mean AT content of > 60%, >70%, > 80%, or higher, or a genome having a mean GC content of > 60%, >70%, > 80%, or higher.

In one aspect the polynucleic acid is DNA.

In one aspect the solid support is glass or a plastics material, and may be a flat surface or curved surface, such as a bead.

In one aspect the adapter comprises a region suitable for sequencing primer annealing and a region suitable for attachment to a solid support. In one aspect these regions are separate, such that the target for sequencing may be accessible for productive sequencing even when the adapter is bound to a solid support.

In one aspect one of the adapters is equivalent in design (and may be identical) to one of the 2 "lllumina" PCR primers used to amplify the ligated products of the prior art method outlined above, whereas the other adapter is the reverse complement of the other lllumina PCR primer. Prior art methods are for example as disclosed in the lllumina methodology used in reference [7] and as generally disclosed in see http://www.illumina.com.

In one aspect the library polynucleic acids fragments are ligated on appropriate adapters after end repair and A-tailing, suitably using standard lllumina protocols (see http://www.illumina.com).

In one aspect the adapters are partially non-complementary, thus ensuring that a different adapter sequence is added to either end of the template strands.

In one aspect the adapters consist of a pair of oligonucleotides that have a region of nucleotides that are complementary, the remainder being non-complementary. The oligonucleotides may be 40 or more nucleotides in length, such as 50, 60, 70 or 80 nucleotides or more. Oligonucleotides may have a complementary region of 10 or more nucleotides, such as 11 , 12, 13, 14, 15, 16, 17, 18, 19, 20, or more nucleotides.

In one aspect the pair of oligonucleotides are mixed together in substantially eqimolar or equimolar quantity, in a suitable buffer, and suitably are phosphorylated at their 5' ends using, for example, an enzyme such as T4 polynucleotide kinase. The oligonucleotides are then hybridized to one another. In one method the phosphrylation method is allowed to proceed for a suitable length of time, then the temperature is raised so as to denature the kinase enzyme, and to disrupt any secondary structure within the oligonucleotides. The temperature is then lowered to room temperature slowly, so that the complementary sections of the two oligonucleotides can anneal together, forming a Υ-shaped' structure.

In one aspect the complementary section of one strand of the adapter has an additional T nucleotide at its 3' end, so that after annealing, a T-overhang is formed, which allows more efficient ligation to a suitably prepared target DNA. This T-nucleotide is attached to the rest of that oligonucleotide strand via a phosphorothioate linkage, which confers resistance against exonuclease digestion. This prevents removal of this T-nucleotide, and thus prevents blunt ended self-ligation of adapter molecules.

One strand of the adapter has a nucleotide region at its 5' end, which facilitates hybridisation to oligonucleotides on a solid surface. The remaining nucleotides of this strand consist of a region to which a sequencing primer can hybridise or which otherwise facilitates sequencing primer hybridisation.

The other strand has a nucleotide region at its 3' end, which facilitates hybridisation to oligonucleotides on a solid surface. The remaining nucleotides of this strand consist of a region to which a sequencing primer can hybridise.

Once annealed, the adapter molecule can be ligated to any double stranded DNA template that has been prepared in such a way as to have a single protruding A-nucleotide at the 3' termini of both strands.

Once ligated, adapter - template complexes are suitably run in an agarose gel, for the purpose of size selection. This allows a gel slice to be taken, representing a mixture of ligated template molecules, all within a particular size range. It also allows removal of adapter dimers - i.e. those that have ligated to one another. DNA is extracted from the gel and is quantified by qPCR and then sequenced.

The invention also relates to mixtures of oligonucleotides as described above and to kits comprising such pairs of oligonucleotides, separately or in the form of ligated adapter molecules. .

The use of forked adapters is described in US 2007/0172839 A1 (2007), the disclosure of which is incorporated by reference.

In one aspect a portion of the library is quantified using qPCR after ligation of the adaptors. In one aspect the qPCR is used to quantify only those strands that have an adapter at either end. In one aspect the amplification of the surface bound polynucleic acid fragment by multiple cycles of annealing, extension and denaturation is bridge amplification, wherein during the annealing step of the amplification cycle the extension product from one bound primer forms a bridge to the other bound primer. In one aspect the amplification step uses well known conditions for annealing, extension and denaturation.

In one aspect the sequence information obtained in step 3 is used for genomic DNA analysis.

In one aspect the sequencing in step 3 is carried out using the process of reversible terminator chemistry as disclosed in reference [7].

In one aspect the method is as described in reference [7] herein, with the exception of the PCR amplification step prior to attachment of the polynucleic acids to the solid support. The method thus comprises the following steps:

1 DNA fragments are joined to a pair of adapter oligonucleotides, suitably in a forked adaptor configuration wherein one end of the adaptor strand is not complementary to the other end of the other adaptor strand.

2 DNA fragments prepared as in '1' are denatured and single strands are annealed to complementary oligonucleotides on a solid support, suitably a flowcell surface. A new strand is copied from the original strand in an extension reaction that is primed from the 3' end of the surfacebound oligonucleotide; the original strand is then removed by denaturation. The adapter sequence at the 3' end of each copied strand is annealed to a new surface-bound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand.

Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters.

C The DNA in each cluster is linearized by cleavage within one adapter sequence and denatured, generating single-stranded template for sequencing by synthesis to obtain a sequence read (read 1 ). To perform paired-read sequencing (optional step), the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re- synthesized and the opposite strand is then cleaved to provide the template for the second read (read 2)

Methods

Incompletely ligated fragments will be inert in the cluster amplification step, and so it is not necessary to retain the PCR step to enrich for properly ligated fragments, provided that only those fragments with an adapter at either end can be quantified. This can be achieved by quantitative PCR (see below) using primers that target the adapter regions, and so the library prep PCR step can be removed entirely, by ligating on appropriate adapters after A-tailing (Figure 2).

These adapters are partially non-complementary, thus ensuring that a different adapter sequence is added to either end of the template strands [15]. The 3' T overhang is modified with a phosphorothioate linkage to protect it from digestion by any contaminating exonuclease activity in the ligase preparation. The method yields ample library DNA for most purposes: > 400 high density lanes (40,000 clusters / tile GA1 , 160,000 clusters / tile GAII) of adapter ligated malaria DNA from 5μg genomic DNA, as measured by qPCR, and considerably more from genomes with a more balanced nucleotide composition.

Adapter preparation

We obtained two HPLC-purified oligonucleotides (Sigma): A_adapter_t (AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTC CGATC*T, * indicates phosphorothioate) and A_adapter_b (GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTC TTCTGCTTG). A_adapter_t contained a phosphorothioate modification to resist exonuclease digestion at the T-overhang. 40μM oligos were phosphorylated at the 5' end by 1 unit / μl T4 polynucleotide kinase in 1x T4 ligase buffer (both New England

Biolabs) for 30 minutes at 37 0 C in an thermocycler (MJ Research). We then denatured the kinase by heating, and annealed the oligos by reduction of the temperature to 2O 0 C by 0.1 0 C every 2 seconds. Annealing more slowly e.g. 0.1C every 5 seconds can also be used. We divided adapter oligos into single-use aliquots and stored them at -2O 0 C.

DNA preparation and adapter ligation

Genomic DNA was quantified by NanoDrop. We fragmented 4.5μg DNA to approximately 200bp using Covaris Adaptive Focused Acoustics technology, using the settings: 5% Duty Cycle; Intensity 10; 200 Cycles per burst over the course of 12 minutes (using Covaris product number 520031 , 300μl, 6 x 32mm Round bottom glass tube, with no fibre, with crimp-cap system. It is also possible to use Covaris product number 520052, 100μl, 6 x 16mm Round bottom glass tube and AFA fibre with crimp-cap system over a course of 90 seconds. After end repair and A-tailing, following the standard lllumina protocols, we set up ligation reactions in a total volume of 50μl containing 10μl end-repaired and A-tailed DNA, 8μM adapters, 1x lllumina DNA ligation buffer, 5μl lllumina DNA ligase, and incubated reactions for 15 minutes at room temperature.

qPCR quantification

We obtained two desalted oligonucleotide primers (cq_v2.1 AATGATACGGCGACCACCGAGATC and PEq_v2.2 CAAGCAGAAGACGGCATACGAGATC) and an HPLC-purified dual labeled probe (DLP [6FAM]CCCTACACGACGCTCTTCCGATCT[TAMRA]), all from Sigma. We diluted ligated libraries 200-fold, using 1OmM Tris pH8.5 with 0.1 % Tween20 and low bind tubes (Eppendorf).

We performed qPCR reactions in a total volume of 25μl, containing 1 Platinum Taq buffer (Invitrogen), 1.5mM MgCI2, 1 μl template DNA, 25OnM DLP, 1x ROX

(Invitrogen), 30OnM cq_v2.1 , 30OnM PEq_v2.2, 200μM dNTPs, 0.04 units / μl Platinum Taq. Cycling conditions were: 94 0 C for 2 minutes followed by 40 cycles of 94 0 C for 15 seconds, 62 0 C for 15 seconds and 72 0 C for 32 seconds on an Applied Biosystems Step One Plus qPCR machine.

We performed quantification of our unknown libraries alongside three dilutions of a concentration standard library - i.e. one we had sequenced previously, and for which we knew the precise cluster number based on its Bioanalyzer concentration. We diluted this standard library to 10OpM, 1OpM and 1 pM, based upon the Bioanalyzer concentration, and thus generated a standard curve in the qPCR. This allowed us to calculate the relative concentration of our libraries and to convert this to loading concentration for the sequencing reaction.

Libraries were sequenced on lllumina GAI and GAII Analyzers following the manufacturer's standard cluster generation and sequencing protocols, for 35-76 cycles of sequencing per read.

Quail et al 2009 "Improved protocols for the lllumina Genome Analyzer Sequencing System", Current Protocols in Human Genetics 18.2 also discloses suitable protocols.

Standard library preparation

Standard sequencing libraries were prepared following the manufacturer's recommended protocol, except that genomic DNA was fragmented using Covaris Adaptive Focused Acoustics rather than nebulisation, as described above. We end repaired, phosphorylated and A-tailed the fragmented DNA and ligated lllumina paired end adapters to fragments, before size selection and 10 cycles of PCR. We quantified these libraries using an Agilent Bioanalyser 2100 and by qPCR, as described above, and prepared paired end flowcells and sequenced for 35 or 36 cycles on an lllumina Genome Analyzer and 36 or 76 cycles on a Genome Analyzer Il fitted with a paired end module. 76 cycle runs were performed using an alternative deblock reagent, supplied by lllumina.

lllumina read alignment, SNP calling and de novo assembly For read mapping, we used our modified SSAHA (Sequence Search and Alignment by Hashing Algorithm) program http://www.sanger.ac.uk/Software/analysis/SSAHA2/ 13, which has been optimized for short-reads recently. SSAHA achieves its fast search speed by encoding sequence information in a perfect hash function. The first step is to identify regions in the reference which have similarity to the query and then align the query and a small segment from the reference using a banded Smith-Waterman algorithm. This gapped alignment tool reports short indels up to 3 bps and there are no restrictions on read length and 2 base mismatches. The alignment files were further processed for SNP detection using a variation detection pipeline ssaha_pileup ( ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/). It is expected that the high sensitivity offered by the alignment tool should improve read mapping, particularly, for those extremely AT biased genomes such as Plasmodium falciparum.

Pair-ending Solexa typed data provides challenging, but exciting prospects for de novo assemblies, where requirement for read coverage across the genome is more than that for variation detections. Within a generated contig, every base has to be covered several times from raw reads in order to make a consensus. We used our newly developed short read assembler: fuzzypath ( ftp://ftp.sanger.ac.uk/pub/zn1/fuzzypath ) for the noPCR datasets. The assembly process consists of two distinctive steps: read extension and whole genome assembly. Firstly, raw sequencing Solexa reads are extended into segments of consensus sequences each with a maximum length of 2 kb. Sequence extension starts from kmer seeds which are randomly sampled to ensure overlaps among extended segments. To obtain genome assemblies, the new data set of the extended sequences with 10-15X coverage is processed and assembled using the Phusion assembler [16] - the previously-developed capillary read assembly pipeline.

Results

Using the no-PCR library preparation method (Figure 2) , we produced paired-end P. falciparum 3D7 data of 1.0Gb for 36-base reads and 1.4Gb for 76-base reads, corresponding to read coverage of 44x and 65x respectively (Supplementary Table 1 ). Figure 1(a) shows the distribution of fraction of unmasked genome against coverage depth for 7 different datasets. The peak of fraction distribution from the no-PCR data is in close agreement with the read depth, a feature which does not occur for other four sets of data using the standard lllumina library prep pipeline. Assuming that at least 10-fold coverage is required to call SNPs reliably, the no-PCR data performs significantly better than the other four datasets, with 97% of bases covered 10 times or more (Figure 1 b). In variation analysis, we aligned reads using ssaha_pileup and indentified 2059 SNPs. By comparison to our E. coli data (below), for which a finished sequence is available, we are confident that 95% of the P. falciparum are base errors in our previous finishing efforts on capillary data.

The high quality of the no-PCR P. falciparum datasets makes de novo assembly possible, whereas standard libraries do not permit this. From the 2 x 36-base dataset with approximately 14 million paired-end reads, we obtained an assembly of 19.0 Mb with 26,803 contigs (>100 bp) and N50 = 1 ,458. With the 2 x 76-base data, we produced an assembly of 21.1 Mb with 22,839 contigs (>100 bp) and N50 = 1 ,621 from 9.8 million paired end reads. The differences between the two assemblies are not large as the insert size is only 170 bp, which is close to the paired read length of 152 bp.

The extreme AT bias of the Plasmodium falciparum genome may be correlated to the sequence GC content at zero coverage, shown in Figure 3 for the two no-PCR datasets. If this is the case, those sequences with zero coverage are likely to be repetitive segments, where short reads cannot be uniquely assigned even using the read pairs. When read coverage is low (e.g. less than 10), the curves observed in Figure 3 are notably different using various library preparations. The value of GC content decreases with an increase in read coverage for the no-PCR datasets, indicating that it is more difficult to place reads on lower GC sequences. However, the value GC content increases with an increase in read coverage for the 3D7 datasets using the standard lllumina pipeline (libraries STD-386, STD-245, and STD-851 ). Given that the fraction of covered genome is small (see Figure 1 ) for these standard datasets, it seems probable that those poorly covered regions are not caused by read mapping, but due to lack of reads truly belonging to that region. Finally for all the datasets, GC content increases with the increase of coverage depth before it reaches fluctuations at high values of GC content.

For E. coli strain 042, a 5.35Mb genome with 50.5% GC, a finished sequence obtained from capillary sequencing is available for comparison

(http://www.sanger.ac.uk/Projects/Escherichia_Shigella/ ). Using 7 million paired-end reads of 2 x 36 bases, we assembled the genome into 186 contigs with N50 = 91 Kb, compared to N50 of 20kb on a standard library. Using the same reads against the reference, we detected 3 SNPs and 2 deletions, which are confirmed by de novo assembly, indicating that these variants are finishing errors in our previous assembly. The 2 x 36 base no-PCR library of Bordetella pertussis yielded an assembly with N50 18kb. This genome has a very high mean GC content (68%) which, coupled with a complicated repeat structure, makes assembly more difficult than for E. coli.

PCR duplicates are a major concern in lllumina sequencing and, from our previous experience, are generally dependent upon the quality of library preparations. Presumably, duplicates are more abundant in smaller genome libraries because they represent a higher percentage of the total number of possible fragments. Reducing duplicates would be beneficial for all sizes of genomes, both in lowering costs and allowing improved read mapping. We assessed the frequency of duplicate sequences in our no-PCR libraries by mapping to the reference sequence. Here a duplicate is regarded as a number of reads which share exactly the start and end matching locations and the read number is defined as duplication depth or number of exactly duplicated matches. As shown in Figure 4, the duplication rate is very low for the two no-PCR datasets and high for STD-368 and STD-245 as the tails on the curves extend far. It is interesting to note that the duplication rate looks high for STD-883 from a PF clinical run, but in fact it is normal judged from the distribution of read duplication. The duplication distribution has a relatively short tail and a peak value at -5.0, which is in agreement with the theoretical value, obtained by dividing mean coverage by read length, of = 4.7. The no-PCR duplication frequency is caused by noise in the cluster detection and analysis software, and by the fact that each double stranded template is capable of forming two identical clusters.

Figure legends

Figure 1. Distribution of genome sequence coverage, a) The distribution of sequence coverage across the unmasked genome is shown with various datasets with or without the PCR step, b) Accumulated portion of unmasked genome at different depth of coverage

Figure 2. No-PCR library preparation, a) Partially complementary ('Y-shaped') adapters with a 3' T overhang are ligated onto fragmented, end-repaired, 3' A-tailed DNA. This stricture ensures that each strand of the template molecule receives a different adapter sequence at each end. The adapter strands each consist of two sections: FP1 or FP2', which allow hybridisation of the ligated template molecules to the flowcell surface, and R1 or R2', which allow hybridisation of sequencing primers, b) The standard library prep uses PCR to enrich for fully ligated templates. Only these templates amplify on the flowcell surface. With the no-PCR approach, the flowcell itself is used to select for fully ligated template molecules. Unligated template molecules do not anneal to the flowcell oligos, whereas semi-ligated template strands hybridise and will be copied in the initial extension reaction, but without the adapter at the other end, these are unable to participate in bridge amplification. If, by virtue of coincidental sequence similarity, a semi-ligated template does amplify to form a cluster, it will not sequence from the unligated end because sequencing primer hybridization will not be possible.

Figure 3. GC content and depth of coverage. GC content of genome sequence is plotted against the depth of genome coverage with various datasets with or without the PCR step.

Figure 4. Frequencies of duplicate sequences. Percentage of matched reads against duplication depth for sequence data derived from libraries prepared both with and without a PCR step

URLs. Alignment software can be found at http://www.sanger.ac.uk/Software/analysis/SSAHA2/. All the computer codes on the detection of SNPs and short indels can be found at ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/. The short read assembler is at ftp://ftp.sanger.ac.uk/pub/zn1/fuzzypath/ Some raw Solexa reads and assemblies can be found at: ftp://ftp.sanger.ac.uk/pub/zn1/PCR_free/

Discussion

The lllumina library preparation pipeline introduces tails onto library DNA in a 2-step process. Firstly, adapters, essentially consisting of the sequencing primer annealing sequences, are ligated on. Then an additional section is added, by PCR, which facilitates hybridisation of library fragments to the flowcell. But, even though the number of cycles of PCR amplification is kept low (10 cycles) [7], it is a source of duplicate sequences, and amplification bias, and struggles with base compositions that lie at the extremes of low or high GC content [14]. The consequences are that each run becomes less efficient and that assembly, mapping and SNP detection are made more complicated than necessary.

By ligating on long adapters, which consist of all sections required for sequencing primer annealing and attachment to the flowcell surface, we can avoid the requirement of a PCR step. The quantity of template DNA generated in this way is lower than when PCR is employed, which complicates quantification. To overcome this problem, we have developed a sensitive and accurate qPCR-based assay for library quantification. Even though the yield of no-PCR libraries is lower than those prepared in the standard way, the yield of a 200bp no-PCR library from 5ug starting DNA is typically sufficient for > 400 lanes, more than enough for most sequencing purposes. As with the qPCR assay, lllumina cluster amplification can only amplify template strands that are correctly ligated - i.e. those with a different adapter sequence on either end. The structure of the adapters ensures that all fully ligated templates receive a different sequence at each end, though because the efficiency of ligation is not 100%, many template strands will receive no adapters, or will only be partially ligated. However, without the adapter sequences at either end, templates cannot amplify on the flowcell surface, and in this way, the cluster amplification step performs the enrichment that is otherwise provided in the PCR.

We have demonstrated that the sequence coverage provided by the no-PCR approach is more even than the standard, PCR-based lllumina library prep, contains very few duplicates, aids mapping and SNP calling, and makes assembly more straightforward. This is best illustrated by the P. falciparum genome, which until now has resisted attempts at de novo assembly from short read data. The success of this approach indicates that a combination of low coverage capillary sequences and deep coverage short read sequences will permit high quality malaria assemblies.

Although the benefits of the no-PCR library prep are most clearly demonstrated by the extremely high AT-containing malaria 3D7 genome, they are not limited to this type of genome. Assembly is also improved in the GC-neutral E. col i genome and the GC-rich Bordetella pertussis genome. Because of the absence of the PCR step, the method is quicker to perform than the standard lllumina library prep 7, and we feel that it should be employed routinely in the preparation of libraries for lllumina sequencing.

For all genomes, neutral, GC rich or AT rich the method of the invention does not generate any PCR duplicates, which provides an advantage, as it makes sequencing more efficient.

All documents referred to herein are incorporated by reference.

It will be understood that particular embodiments described herein are shown by way of illustration and not as limitations of the invention. The principal features of this invention can be employed in various embodiments without departing from the scope of the invention. Those skilled in the art will recognize, or be able to ascertain using no more than routine study, numerous equivalents to the specific procedures described herein. Such equivalents are considered to be within the scope of this invention and are covered by the claims. All publications and patent applications mentioned in the specification are indicative of the level of skill of those skilled in the art to which this invention pertains. All publications and patent applications are herein incorporated by reference to the same extent as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference. The use of the word "a" or "an" when used in conjunction with the term "comprising" in the claims and/or the specification may mean "one," but it is also consistent with the meaning of "one or more," "at least one," and "one or more than one." The use of the term "or" in the claims is used to mean "and/or" unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and "and/or."

Throughout this application, the term "about" is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.

As used in this specification and claim(s), the words "comprising" (and any form of comprising, such as "comprise" and "comprises"), "having" (and any form of having, such as "have" and "has"), "including" (and any form of including, such as "includes" and "include") or "containing" (and any form of containing, such as "contains" and "contain") are inclusive or open-ended and do not exclude additional, unrecited elements or method steps

Where combinations of elements are referred to herein the term combination refers to all permutations and combinations of the listed items preceding the term. For example, "A, B, C, or combinations thereof is intended to include at least one of: A, B, C, AB, AC, BC, or ABC, and if order is important in a particular context, also BA, CA, CB, CBA, BCA, ACB, BAC, or CAB. Continuing with this example, expressly included are combinations that contain repeats of one or more item or term, such as BB, AAA, MB, BBC, AAABCCCC, CBBAAA, CABABB, and so forth. The skilled artisan will understand that typically there is no limit on the number of items or terms in any combination, unless otherwise apparent from the context.

All of the compositions and/or methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the compositions and/or methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.

All features of each of the aspects and embodiments of the invention apply to all other aspects and embodiments mutatis mutandis.

References

1. Goman M. et al. The establishment of genomic DNA libraries for the human malaria parasite Plasmodium falciparum and identification of individual clones by hybridisation. MoI Biochem Parasitol 5, 391-400 (1982).

2. Camargo A.A., Fischer K., Lanzer M. & del Portillo H.A. Construction and characterization of a Plasmodium vivax genomic library in yeast artificial chromosomes. Genomics 42, 467-473 (1997).

3. de Bruin D., Lanzer M. & Ravetch J.V. Characterization of yeast artificial chromosomes from Plasmodium falciparum: construction of a stable, representative library and cloning of telomeric DNA fragments. Genomics 14, 332-339 (1992).

4. Triglia T. & Kemp D.J. Large fragments of Plasmodium falciparum DNA can be stable when cloned in yeast artificial chromosomes. MoI Biochem Parasitol 44, 207-211 (1991 ).

5. Pollack Y., Katzen A.L., Spira DT. & Golenser J. The genome of Plasmodium falciparum. I: DNA base composition. Nucleic Acids Res 10, 539-546 (1982).

6. Gardner M.J. et al. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 419, 498-511 (2002).

7. Bentley D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008). 8. Saiki R. K. et al. Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science 239, 487-491 (1988).

9. Day D.J. et al. Identification of non-amplifying CYP21 genes when using PCRbased diagnosis of 21 -hydroxylase deficiency in congenital adrenal hyperplasia (CAH) affected pedigrees. Human Molecular Genetics 5, 2039-2048 (1996). 10. Barnard R., Futo V., Pecheniuk N., Slattery M. & Walsh T. PCR bias toward the wild-type k-ras and p53 sequences: implications for PCR detection of mutations and cancer diagnosis. Biotechniques 25, 684-691 (1998).

11. Hahn S., Garvin A.M., Di Naro E. & Holzgreve W. Allele drop-out can occur in alleles differing by a single nucleotide and is not alleviated by preamplification or minor template increments. Genet Test 2, 351-355 (1998).

12. Ogino S. & Wilson R. B. Quantification of PCR bias caused by a single nucleotide polymorphism in SMN gene dosage analysis. J MoI Diagn 4, 185-190 (2002).

13. Ning Z., Cox A.J. & Mullikin J. C. SSAHA: a fast search method for large DNA databases. Genome Res 11 , 1725-1729 (2001 ). 14. Dohm J. C, Lottaz C, Borodina T. & Himmelbauer H. Substantial biases in ultrashort read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36, e105 (2008).

15. Smith D. & Malek J. Asymmetrical adapters and methods of use thereof. US

2007/0172839 A1 (2007) 16. Mullikin J. C. & Ning Z. The phusion assembler. Genome Res 13, 81-90 (2003).