Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NUCLEIC ACID PREPARATION METHOD
Document Type and Number:
WIPO Patent Application WO/2015/145133
Kind Code:
A1
Abstract:
This invention relates to the preparation of nucleic acids, for example bisulfite-treated nucleic acids, for the analysis of modified cytosine marks. Included is a method of preparing a bisulfite treated nucleic acid library comprising a two-step ligation procedure, where a first adapter is added before bisulfite treatment and a second adapter afterwards.

Inventors:
BALASUBRAMANIAN SHANKAR (GB)
RAIBER EUN-ANG (GB)
MCINROY GORDON (GB)
Application Number:
PCT/GB2015/050871
Publication Date:
October 01, 2015
Filing Date:
March 24, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CAMBRIDGE ENTPR LTD (GB)
International Classes:
C12N15/10
Domestic Patent References:
WO2009132315A12009-10-29
Other References:
R. LISTER ET AL: "Finding the fifth base: Genome-wide sequencing of cytosine methylation", GENOME RESEARCH, vol. 19, no. 6, 9 March 2009 (2009-03-09), pages 959 - 966, XP055190057, ISSN: 1088-9051, DOI: 10.1101/gr.083451.108
SHAWN J. COKUS ET AL: "Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning", NATURE, vol. 452, no. 7184, 17 February 2008 (2008-02-17), pages 215 - 219, XP055190064, ISSN: 0028-0836, DOI: 10.1038/nature06745
RYAN LISTER ET AL: "Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis", CELL, vol. 133, no. 3, 1 May 2008 (2008-05-01), pages 523 - 536, XP055190066, ISSN: 0092-8674, DOI: 10.1016/j.cell.2008.03.029
BOOTH MICHAEL J ET AL: "Oxidative bisulfite sequencing of 5-methylcytosine and 5-hydroxymethylcytosine", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, GB, vol. 8, no. 10, 1 October 2013 (2013-10-01), pages 1841 - 1851, XP009175350, ISSN: 1750-2799, [retrieved on 20130905]
M. J. BOOTH ET AL: "Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution", SCIENCE, vol. 336, no. 6083, 18 May 2012 (2012-05-18), pages 934 - 937, XP055064913, ISSN: 0036-8075, DOI: 10.1126/science.1220671
TANAKA K ET AL: "Degradation of DNA by bisulfite treatment", BIOORGANIC & MEDICINAL CHEMISTRY LETTERS, PERGAMON, AMSTERDAM, NL, vol. 17, no. 7, 1 April 2007 (2007-04-01), pages 1912 - 1915, XP026265824, ISSN: 0960-894X, [retrieved on 20070312], DOI: 10.1016/J.BMCL.2007.01.040
HOEIJMAKERS WIETEKE A M ET AL: "Linear amplification for deep sequencing", NATURE PROTOCOLS, NATURE PUBLISHING GROUP, GB, vol. 6, no. 7, 1 July 2011 (2011-07-01), pages 1026 - 1036, XP009177348, ISSN: 1750-2799, [retrieved on 20110623]
AIRD DANIEL ET AL: "Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries", GENOME BIOLOGY, BIOMED CENTRAL LTD., LONDON, GB, vol. 12, no. 2, 21 February 2011 (2011-02-21), pages R18, XP021091793, ISSN: 1465-6906, DOI: 10.1186/GB-2011-12-2-R18
Attorney, Agent or Firm:
BARNES, Colin Lloyd (Meridian CourtComberton Road,Toft, Cambridge CB23 2RY, GB)
Download PDF:
Claims:
Claims :

1. A method of preparing a bisulfite treated nucleic acid library comprising a two-step ligation procedure, where a first adapter is added before bisulfite treatment and a second adapter afterwards .

2. A method of preparing a nucleic acid library comprising;

(i) providing a population of double-stranded nucleic acids,

(ii) adding an adaptor sequence to the 3' ends to produce a population of double stranded nucleic acids having an adaptor sequence at the 3' end of each strand,

(iii) denaturing the population of nucleic acids to produce a population of nucleic acid strands having the adaptor sequence at the 3' end,

(iv) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(v) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

(vi) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having an adaptor at the first end and a second adaptor at the second end.

3. A method according to claims 1 or 2 wherein the library is interrogated by sequencing the nucleic acids.

4. A method according to claim 2 wherein the nucleic acids are bisulfite treated after step ii and before step iii.

5. A method according to any one preceding claim comprising determining the identity of a base in one or more nucleic acids in the library at a position that corresponds to cytosine in the non- bisulfite treated nucleic acids.

6. A method according to any one of claims 1 to 5 wherein the nucleic acids are modified before treatment with bisulfite.

7. A method according to any one of the preceding claims comprising immobilising the population of nucleic acid strands on a solid support.

8. A method according to claim 7 wherein the adaptor

oligonucleotide is linked to a binding tag and the method comprises binding the tag to a capture member immobilised on a solid support, thereby immobilising the population of nucleic acid strands.

9. A method according to claim 7 wherein the oligonucleotide primer is linked to a binding tag and the method comprises binding the tag to a capture member immobilised on a solid support thereby immobilising the population of nucleic acids.

10. A method according to any one of claims 8 or 9 wherein the nucleic acids are released from the solid support following ligation of the second adaptor.

11. A method according to claim 10 wherein the adaptor

oligonucleotide or oligonucleotide primer is linked to the binding tag by a cleavable linker and the nucleic acids are released from the solid support by cleavage of the cleavable linker.

12. A method according to any one of the preceding claims wherein the population of double-stranded nucleic acids is provided by a method comprising;

isolating nucleic acids from one of: a cell, a sample of cells, and a biological fluid sample,

fragmenting the nucleic acids and,

repairing the ends of the nucleic acids .

13. A method according to any one of the preceding claims wherein the adaptor sequence is added to the 3' ends of the nucleic acids by ligating an adaptor oligonucleotide to said 3' ends but not to the 5' ends of the nucleic acids.

14. A kit for use in the preparation of a nucleic acid library according to any one of claims 1 to 49 comprising;

an adaptor oligonucleotide,

a complementary oligonucleotide,

an oligonucleotide primer,

a bisulfite reagent and

a second adaptor, or;

a hairpin adaptor comprising an adaptor sequence and a a complementary sequence,

an oligonucleotide primer,

a bisulfite reagent and

a second adaptor.

15. A kit according to claim 50 wherein the adaptor

oligonucleotide consists of methylated cytosine nucleotide

analogues .

16. A kit according to claim 14 or claim 15 wherein the

complementary oligonucleotide lacks a 3' hydroxyl group.

17. A kit according to any one of claims 14 to 16 further

comprising one or more end-repair reagents.

18. A kit according to any one of claims 14 to 17 further comprising a solid support.

19. Use of a kit according to any one of claims 14 to 18 in a method of preparing a nucleic acid library according to any one of claims 1 to 13

20. A method of assessing an individual for a disease or

predisposition thereto comprising;

(i) providing a sample obtained from the individual,

(ii) isolating a population of double-stranded nucleic acids from the sample,

(iii) adding an adaptor oligonucleotide to the 3' ends of the strands of the nucleic acids to produce a population of nucleic acids having an adaptor sequence at the 3' ends, (iv) denaturing the population of nucleic acids to produce a population of nucleic acid strands having the adaptor sequence at the 3' end,

(v) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(vi) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

(vii) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having an adaptor at the first end and a second adaptor at the second end,

(viii) optionally denaturing the double-stranded nucleic acids to produce a library of nucleic acid strands having an adaptor sequence at the first end and a second adaptor sequence at the second end, and

(ix) interrogating one or more nucleic acids in the library to determine the identity of one or more bases in said nucleic acids .

Description:
Nucleic Acid Preparation Method

This invention relates to the preparation of nucleic acids, for example bisulfite-treated nucleic acids, for the analysis of modified cytosine marks.

5-methylcytosine (5mC), formed by methylation at the C5 position of the DNA base cytosine, is an important epigenetic mark. 5mC has been shown to regulate gene expression (Deaton, A. M. ; Bird, A. Genes & development 2011, 25, 1010-22) and is involved in a plethora of important processes including X-chromosome inactivation (Jones, P. A.; Takai, D. Science (New York, N.Y.) 2001, 293, 1068-70), genomic imprinting and cancer progression (Jones, P. A. Oncogene 2002, 21, 5358-60) . Other epigenetic marks at the C5 position of cytosine include 5-hydroxymethylcytosine (5hmC) and 5-formylcytosine (5fC) . Central to understanding the function of 5mC, 5hmC, 5fC and other modified cytosine marks is to determine where in the genome they occur. This positional information can be attained by a variety of methods, including bisulfite sequencing (BS-seq) and variations of bisulfite sequencing, such as oxidative bisulfite sequencing (OxBS- seq; Booth et al (2013) Nat Protoc 8 1841-1851; Booth et al (2012) Science 336 934-937), reductive bisulfite sequencing (redBS-seq; Nature Chemistry 6 , 435-440 (2014) . M J Booth, G Marsico, M Bachman, D Beraldi, and S Balasubramanian) , tet-assisted bisulfite sequencing (TAB-seq; Yu et al (2012) Nat Protoc. 7 (12) 2159-2170), chemical assisted bisulfite sequencing (CAB-seq) , 5fC chemical assisted bisulfite sequencing (fCAB-seq), reduced representation bisulfite sequencing, whole genome bisulfite sequencing, and targeted

bisulfite sequencing.

A key step in bisulfite sequencing involves the chemical

modification of a nucleic acid sample with bisulfite ions (HS0 3 ~ ) , which convert cytosine bases to uracil but do not significantly affect 5mC. A significant drawback to this step is that, under the bisulfite treatment conditions, a great deal of the DNA sample is degraded by depyrimidination causing strand scission (Tanaka, K. ; Okamoto, A. Bioorganic & Medicinal Chemistry Letters 2007, 17, 1912- 1915). Quantification of full-length target DNA showed that only 10 11 copies out of 10 14 of amplifiable DNA remained after 16h bisulfite incubation, which corresponded to 0.1% of the original DNA amount ( Figure 1 ) .

Due to the significant loss of DNA after bisulfite treatment, PCR amplification is necessary in order to obtain sufficient DNA for sequencing analysis. PCR, however, introduces additional biases and limitations. Loci with extreme base compositions are often severely under-represented, or absent entirely (Aird et al . Genome Biology 2011, 12:R18). GC-rich regions are known to be particularly

problematic due to the formation of secondary structures and higher melting temperatures. This is a substantial problem as many human genes are located in GC-rich regions (defined as having a GC content > 60%) (Saccone S et al PNAS USA (1992) 89 4913-4917). AT-rich regions are also under-represented, which confounds research into organisms with AT-rich genomes, such as the pathogen Plasmodium.

The under-representation of fragments with certain sequence

compositions reduces the coverage obtained during sequencing, i.e. read depth is lost in those areas. This leads to an additional and often overlooked problem: loss of quantitative power. Bisulfite sequencing is theoretically able to give the percentage methylation at single base resolution. This is achieved by exploiting the depth of coverage and digital readout that next generation sequencing (NGS) techniques give for a specific base. If 18 out of 30 reads covering a base indicates 5mC at that position, the location is saic to be 60% methylated. However, after bisulfite treatment, the composition of reads changes to a greater or lesser extent,

depending on their 5mC content. This in turn affects their

amplification during PCR and thus their representation in the library. If a PCR amplification step is required, then the

quantitative power of bisulfite is severely reduced. The present inventors have developed a process that provides improved yields of nucleic acids, for example bisulfite-treated nucleic acids, that carry adaptors on both ends and are suitable for sequencing .

Whilst bisulfite treatment leads to loss of sequenceable DNA via fragmentation, the majority of cleaved fragments still contain useful information and are of a mappable length. These lost

fragments and the associated information can be recovered by employing a two-step ligation procedure, where a first adapter is added before bisulfite treatment and a second adapter afterwards .

An aspect of the invention provides a method of preparing a nucleic acid library comprising;

(i) providing a population of double-stranded target nucleic acids ,

(ii) adding an adaptor sequence to the 3' ends of the strands of the target nucleic acids to produce a population of nucleic acids having an adaptor sequence at the 3' end,

(iii) denaturing the population of double stranded nucleic acids to produce a population of nucleic acid strands having the adaptor sequence at the 3' end,

(iv) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(v) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

(vi) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having a first adaptor at a first end and a second adaptor at a second end.

Each double-stranded nucleic acid in the library may comprise a target nucleic acid from the population with the first adaptor at a first end and the second adaptor at a second end. n some preferred embodiments, the nucleic acids with the 3' adapto equence may be treated with bisulfite. Optionally, the double-stranded nucleic acids in the library may be denatured to produce a library of nucleic acid strands having a first adaptor sequence at a first end and a second adaptor sequence at a second end. Each nucleic acid strand in the library may comprise the sequence of a target nucleic acid from the population with the first adaptor sequence at a first end and the second adaptor sequence at a second end.

Following preparation, the nucleic acids in the nucleic acid library may be interrogated, for example to determine the identity of one or more bases in the target nucleic acid sequence. Suitable methods of interrogation include sequencing or hybridisation, for example to a probe, e.g. a probe immobilised on an array. In some preferred embodiments, a method may comprise sequencing nucleic acids in the nucleic acid library following preparation as described above.

A nucleic acid library is a diverse collection of single or double stranded target nucleic acids. Preferably, the nucleic acids have adapted ends i.e. the nucleic acids in the library have an adaptor at each end. The presence of terminal adaptors at the ends of the target nucleic acid sequence allows the nucleic acids in the library to be sequenced. Preferably, all or substantially all of the nucleic acids in a library are sequenceable (i.e. the nucleic acids in the library each comprise adaptors at both ends) following production as described herein.

Nucleic acids may be ribonucleic acids (RNA) or more preferably deoxyribonucleic acids (DNA) . For example, the population of nucleic acids may be DNA molecules.

The population of double-stranded target nucleic acids for

preparation as described herein may be a diverse population, for example a population of genomic DNA molecules, such as mammalian genomic DNA; or RNA molecules, for example genomic RNA, mRNA, tRNA, rRNA or non-coding RNA. For example, DNA molecules in the population may comprise all or part of the sequence of one or more genes, including exons, introns or upstream or downstream regulatory elements, and/or the sequences may comprise a genomic sequence that is not associated with a gene. In some embodiments, a population of double-stranded DNA molecules may represent the whole genome or a specific genomic locus of an organism or a population or organisms.

Because an amplification step may be omitted in some preferred embodiments, methods described herein allow the efficient sequencing of nucleic acids that are under-represented in nucleic acid

libraries that are produced by amplification. For example, the population of nucleic acids may comprise one or more CpG islands, GC-rich regions (GC content > 60%) and/or AT rich regions (AT content > 60%) .

In some embodiments, the population of target nucleic acids may be obtained or isolated from a sample of cells, for example, mammalian cells, preferably human cells. Suitable samples include isolated cells and tissue samples, such as biopsies.

Suitable cells include somatic and germ-line cells and may be at any stage of development, including fully or partially differentiated cells or non-differentiated or pluripotent cells, including stem cells, such as adult or somatic stem cells, foetal stem cells or embryonic stem cells. Suitable cells also include induced

pluripotent stem cells (iPSCs), which may be derived from any type of somatic cell in accordance with standard techniques.

For example, target nucleic acids may be obtained or isolated from neural cells, including neurons and glial cells, contractile muscle cells, smooth muscle cells, liver cells, hormone synthesising cells, sebaceous cells, pancreatic islet cells, adrenal cortex cells, fibroblasts, keratinocytes , endothelial and urothelial cells, osteocytes, and chondrocytes.

Suitable cells also include cells associated with disease

conditions, for example cancer cells, such as carcinoma , sarcoma, lymphoma, blastoma or germ-line tumour cells, and cells with the genotype of a genetic disorder, such as Huntington's disease, cystic fibrosis, sickle cell disease, phenylketonuria, Down syndrome or Marfan syndrome.

The population of target nucleic acids may be obtained from a population of cells or an individual cell (e.g. single cell

genomics) . The analysis of nucleic acids from a single cell may, for example, allow genetic variability and epigenetic variability in individual cells and cell-types to be determined.

In other embodiments, the population of target nucleic acids may obtained from a sample of biological fluid, for example a sample amniotic fluid, cerebrospinal fluid, mucus , sebum, blood, plasma, serum, urine or saliva

Methods of extracting and isolating target nucleic acids, such as genomic DNA or RNA, from an individual cell, a sample of cells or a biological fluid, are well-known in the art. For example, genomic DNA or RNA may be isolated using any convenient isolation

techniques, such as phenol/chloroform extraction and alcohol precipitation, caesium chloride density gradient centrifugation, solid-phase anion-exchange chromatography and silica gel-based techniques .

Following isolation, a sample of target nucleic acids, such as genomic DNA or RNA, may be fragmented to produce target nucleic acid fragments. Fragmentation may reduce the size of the nucleic acids in the population. For example, following fragmentation, the nucleic acids may be lObp to 5000bp, preferably 20bp to 2000bp or 30bp to lOOObp.

Suitable fragmentation methods are well-known in the art and include nebulization, sonication or acoustic shearing, mechanical shearing and endonuclease digestion. The whole or a fraction of the

fragmented nucleic acid sample may be used as described herein.

Suitable fractions of genomic DNA and/or RNA may be based on size or other criteria. For example, a fraction of genomic DNA and/or RNA fragments which is enriched for CpG islands (CGIs) may be used as described herein.

Following fragmentation, the ends of the nucleic acid fragments may be repaired to produce a population of blunt-ended nucleic acids. Suitable methods of repairing nucleic acid ends are well-known in the art. For example, fragmented nucleic acids may be converted into blunt-ended molecules by filling in 5' overhangs using a 5'→3' polymerase, and removing 3' overhangs using a 3' to 5' exonuclease, in accordance with standard techniques . Suitable polymerases include T4 DNA polymerase and/or Klenow fragment.

In some preferred embodiments, the ends of the nucleic acids are not treated with a 5' kinase, such as T4 polynucleotide, and remain unphosphorylated at the 5' ends. This may be useful in preventing the ligation of adaptors or other nucleotide sequences to the 5' ends of the nucleic acid strands .

Following fragmentation and end-repair, the target nucleic acids in the population are blunt-ended and may, in some preferred

embodiments, comprise free hydroxyls at both the 5' and 3' ends of each strand.

Suitable techniques and kits for the end-repair of nucleic acid molecules are widely available from commercial suppliers (e.g. End- It™, Epicentre; NEBNext™ end repair Module, New England Biolabs; Fast DNA End Repair, Thermo Fisher Scientific; DNA End Repair Mix, Life Technologies; Paired-End Sample Prep Kit, Illumina Inc) .

An adaptor or adaptor oligonucleotide may be ligated directly to the 3' strand of the blunt-ended target nucleic acids in the population as described herein or the blunt-ends may be modified before ligation of the adaptor oligonucleotide. For example, a one-base overhang consisting of an adenine residue (A-tail) may be added to the 3' strands of the blunt ends. This 3' overhang may facilitate ligation of the adaptor oligonucleotide. Other suitable modifications to facilitate the ligation of nucleic acids and oligonucleotides are well-known in the art.

In some embodiments, the double-stranded target nucleic acids may initially comprise a 5' phosphate group, and a method may comprise removing this 5' phosphate group, for example using a phosphatase, such as antarctic phosphatase or alkaline phosphatase, to produce double-stranded nucleic acids that lack 5' phosphate groups. This may prevent the ligation of oligonucleotides to the 5' ends of the DNA strands.

Suitable techniques and reagents for A-tailing and phosphatase treatment are well-known in the art.

In some preferred embodiments, the double-stranded nucleic acids in the population may comprise 3' adenine overhangs and lack 5' phosphate groups following end-repair and/or modification.

In other embodiments, the double-stranded nucleic acids may compris a 5' phosphate group but ligation of an oligonucleotide to the 5' ends of the strands may be blocked by a blocking group on the oligonucleotide, for example a 3' blocking group (e.g. a group othe than a 3' hydroxyl) or a 3' dideoxynucleotide residue.

An adaptor is a short nucleic acid at an end of a target nucleic acid that facilitates the sequencing of the target nucleic acid. Adaptors may be located at both ends of a nucleic acid. A nucleic acid may have different adaptors at each end, or more preferably the same adaptor at each end. The adaptors at the ends of a nucleic acid are preferably full-length sequencing adaptors that allow the sequencing of the nucleic acids without the need for the

incorporation additional nucleotide sequences .

A suitable adaptor for a single-stranded nucleic acid may comprise an adaptor sequence . A suitable adaptor for a double stranded nucleic acid may comprise an adaptor sequence and a complementary sequence which hybridises to all or part of the adaptor sequence, such that the adaptor comprises a double-stranded portion that is ligated to the target nucleic acid and a single-stranded overhang (i.e. a double-stranded region proximal to the target nucleic acid and a single-stranded tail).

A nucleic acid in a library may comprise a first adaptor at one end (e.g. a first end) and a second adaptor at the other end (e.g. a second end) . The sequence of the target nucleic acid is located between the first and second adaptors . The nucleic acids in the library may have the same first adaptor at their 3' ends and the same second adaptor at their 5' ends i.e. all of the nucleic acids in the library may be flanked by the same pair of adaptors. The first and second adaptors may be different or more preferably the same (i.e. the nucleic acids may have the same adaptor at each end) .

To facilitate high-throughput sequencing, all of the nucleic acids in a library may comprise the same adaptor sequence, or adaptor sequences which differ only in an index sequence. In some

embodiments, the adaptors and adaptor sequences are synthetic sequences that are not found within the mammalian genome.

Adaptors suitable for use in sequencing nucleic acids are well-known in the art. Adaptors are generally specific for a sequencing platform and the sequence of the adaptor therefore depends on the specific sequencing method to be employed. Adaptors suitable for any specific sequencing method are well-known in the art and may be designed and produced using known techniques or obtained from commercial sources . The choice of adaptor nucleotide sequence depends on the sequencing method employed and suitable adaptors . Suitable sequencing platforms include Sanger sequencing, Solexa- Illumina sequencing platforms, such as Hiseq™, MiSeq™ and NextSeq™, semiconductor array sequencing ( IonTorrent™; LifeTech) ,

pyrosequencing (e.g. 454 Sequencing; Roche 454), single molecule real-time sequencing (SMRT™; PacBio RS) and Ligation-based

sequencing (SOLiD™; Life technologies) . Adaptors suitable for any of these sequencing platforms may be used in the methods described herein. In some embodiments, adaptors may include a region that is complementary to the universal primers on a solid support (e.g. a flowcell or bead) and a region that is complementary to universal sequencing primers (i.e. which when annealed to the adaptor sequence and extended allows the sequence of the nucleic acid molecule to be read) .

Adaptor sequences suitable for use as described herein may consist of 20 to 80 nucleotides long. In some embodiments, the adaptor may comprise a sequence that hybridises to complementary primers immobilised on the solid support (e.g. 20-30 nucleotides); a sequence that hybridises to a sequencing primer (e.g. 30-40

nucleotides) and a unique index or barcode sequence (e.g. 6-10 nucleotides) . For example a suitable adaptor may be 56-80

nucleotides in length. Adaptors for Ilumina truseq™ sequencing may be 64 nucleotides long (including 6 nucleotide index) .

In applications involving bisulfite analysis, one or more of the adaptors may have all the cytosine bases in the methylated form. If the adaptors contain unmethylated cytosines, the adaptors are altered during the bisulfite conversion such that any unmethylated cytosines become uracil. Thus any adaptors attached to the sample prior to bisulfite exposure may be free of unmethylated cytosine bases .

Preferably, the adaptor sequence comprises 5 ' methyl-cytosines instead of cytosines, in order to prevent deamination of cytosines in the bisulfite conversion reaction. Preventing the conversion of cytosines in the adaptor sequence to U (read as T) may be useful in ensuring that the adaptor sequence is able to hybridise to the flowcell of the sequencing platform.

In some preferred embodiments, the 3' adaptor sequence may comprise one or more modified nucleotides or nucleotide analogues that are resistant to bisulfite damage. Preferably, the 3' adaptor sequence consists of modified nucleotides or nucleotide analogues or

mimetics. Suitable modified nucleotides, nucleotide analogues and nucleotide mimetics are well known in the art. For example the adaptor sequence may comprise or consist of locked nucleic acid (LNA) nucleotides, peptide nucleic acid (PNA) nucleotides, glycol nucleic acid (GNA) , threose nucleic acid (TNA) , morpholino

oligomers, 2- substituted nucleotides, such as 2- fluorocytosine, 2 aza-cytosine and 2-O-methylcytosine, 6-substituted nucleotides, sua as 6-fluorocytosine, 6-O-methylcytosine 6-aza- cytosine and/or othe modified nucleotides. Suitable modified nucleotides prevent the bisulfite from forming an adduct with the bases, reducing the propensity for abasic site formation and hence reducing the chance of fragmentation.

An adaptor may comprise or consist of nucleotide sequences that are common to the members of the library i.e. each nucleic acid in the library may contain the same adaptor sequences . Libraries produced from different sources may be mixed before sequencing.

In some embodiments, one or both of the adaptors attached to a target nucleic acid may comprise an individual barcode or index nucleotide sequence that identifies the source of the nucleic acid (e.g. the sample) and allows the multiple samples to be sequenced in a multiplex sequencing reaction. Outside the index, the sequences of the adaptors or adaptor sequences may be the same for all the nucleic acids in the library.

The nucleotide sequence of the index allows unambiguous

identification of reads from a specific sample in a pooled multiplex sequencing reaction. Each sample may have a unique index, so that all of the nucleic acid strands from the same sample receive the same index. Once prepared, populations of nucleic acid strands from different samples are mixed into a single pool and sequenced. The sample from which a sequence read from the pool originates may then be identifed from the index. For example, a suitable index for the multiplex sequencing of 24 samples (a 24-plex reaction) may consist of at least 6 nucleotides, preferably 6 nucleotides (Craig DW et al. 2008. Nat Methods 5, 887; Cronn R et al . 2008. Nucleic Acids Res, 36, el22) . The use of indexes, barcodes or identifiers in sequencing reactions is well-known in the art. The adaptors at the 3' and 5' ends of the nucleic acids in a library produced as described herein from a first sample may have the same "core" sequence as the sequences at the 3' and 5' ends of nucleic acids in a library produced from a second sample except for the index, which is unique to the nucleic acid strands from a particular sample. For example, for the multiplex sequencing of n samples, an index of n/4 bases may differ in the sequences at the 3' and/or 5' ends of nucleic acids from different populations. This maintains the specificity of the adaptor sequences for the sequencing platform (e.g. solid support of the flowcell) whilst allowing discrimination between populations of nucleic acid strands prepared with adaptors having different indexes. For example, a sample may be allocated a unique index. Multiple samples may be pooled together in the same sequencing run and sequenced in parallel and then the sequences arising from each individual sample identified from the unique index sequences .

In some embodiments, the adaptor sequence may be added to the 3' ends of the target nucleic acids by ligating an adaptor

oligonucleotide to the 3' ends. The adaptor oligonucleotide may be ligated to the 3' ends by any convenient ligation method. The adaptor oligonucleotide may be ligated to the 3' ends without ligation or other modification of the 5' ends of the nucleic acids.

The adaptor oligo

nucleotide sequen

sequence to which

comprises or cons

In some embodiments, the adaptor oligonucleotide may be linked to binding tag, for example via a cleavable linker. This is described in more detail below.

The ligation of the adaptor oligonucleotide is directional

the 3' ends of the nucleic acids are ligated to the adaptor

oligonucleotide and the 5' ends of the nucleic acids remain unadapted (i.e. no adaptor sequence is ligated to the 5' ends of the nucleic acids) .

The adaptor oligonucleotide may be added to the 3' ends of the double-stranded target nucleic acids by any convenient method. For example, the adaptor oligonucleotide may be attached to the 3' ends by suitable ligation methods, including double-stranded ligation, single-stranded ligation, blunt-ended ligation or overhanging ligation .

Preferably, the adaptor oligonucleotide is added to the 3' ends of the double-stranded target nucleic acids as part of an at least partially double-stranded complex or molecule. For example, ligation may be carried out using an enzyme with double-stranded ligase activity in the presence of an inert hybridisation partner that hybridises to the adaptor oligonucleotide but does not ligate to the nucleic acids .

For example, the adaptor oligonucleotide may be hybridised to a complementary oligonucleotide to form a ligation complex that allows the adaptor oligonucleotide to be ligated to the double-stranded target nucleic acids using a double-strand specific ligase, such as T4 ligase, without ligation of the complementary oligonucleotide. The complementary oligonucleotide may be complementary to all or part of the adaptor oligonucleotide. For example, the complementary oligonucleotide may be complementary to the 5' end of the adaptor oligonucleotide, such that the ligation complex comprises a double stranded region at the 5' end of the adaptor oligonucleotide and a single stranded overhang at the 3' end of the adaptor

oligonucleotide.

In some embodiments, the adaptor oligonucleotide of the ligation complex is ligated to the 3' ends of the double-stranded target nucleic acids but the complementary oligonucleotide of the complex is not ligated to the 5' ends of the double-stranded nucleic acids. For example, one or both of the 5' ends of the double-stranded target nucleic acids and the 3' end of the complementary oligonucleotide may be non-ligatable . The 5' ends of the double- stranded nucleic acids may be non-ligatable through the absence of a phosphate group and/or the 3' end of the complementary

oligonucleotide may be non-ligatable through the absence of an OH group, for example due to the presence of a blocking group, such as a halogen, or more preferably a dideoxynucleotide .

For example, the complementary oligonucleotide may be 3' substituted or comprise a 3' dideoxynucleotide.

As described above, the directional ligation of the adaptor

oligonucleotide specifically to the 3' ends of the nucleic acids may be facilitated by the modification of the nucleic acid ends . For example, a 3' overhanging adenine (A) residue may be present at the ends of the nucleic acids. The ligation complex may comprise a 3' overhanging T residue which facilitates ligation to nucleic acids in the population comprising a 3' overhanging A residue.

In some embodiments, the adaptor oligonucleotide may be ligated to the 3' ends of the nucleic acid population by a method comprising; producing a population of nucleic acids lacking 5' phosphate groups and having overhanging 3' A residues,

contacting the population with a complex comprising the adaptor oligonucleotide hybridised to a complementary

oligonucleotide, wherein the 3' residue of the complementary oligonucleotide forms a 3' T overhang in said complex,

ligating the population and the complex, such that the adaptor oligonucleotide is covalently linked to the 3' A residue of the nucleic acids but the complementary oligonucleotide is not

covalently linked to the unphosphorylated 5' residue of the nucleic acids .

In other embodiments, the adaptor sequence may be added to the 3' ends of the target nucleic acids by ligating a double-stranded adaptor to the ends of the target nucleic acids. The double stranded adaptor may comprise the adaptor sequence hybridised to a

complementary sequence . The adaptor sequence may be ligated to the 3' ends of the double-stranded target nucleic acids and the

complementary sequence may be ligated to the 5' ends of the double- stranded target nucleic acids. Any convenient double stranded ligation method may be used to ligate the double stranded adaptor. After ligation and optionally bisulfite treatment, the 5' ends of the nucleic acids may be cleaved to remove the complementary sequence and produce nucleic acids having an adaptor sequence at the 3' end but not at the 5' end.

In some embodiments, the adaptor sequence of the double-stranded adaptor may be linked to a binding tag, for example via a cleavable linker. This is described in more detail below.

Preferably, the double stranded adaptor comprises a cleavage site at the 3' end of the complementary sequence. After ligation of the double stranded adaptor, the adapted nucleic acids may be cleaved at the cleavage site to remove the complementary sequence ligated to the 5' ends of the nucleic acids, such that adapted nucleic acids have an adaptor sequence at the 3' end, but lack additional sequence at the 5' end.

In some embodiments, a double-stranded adaptor may comprise more than one cleavage site, for example two or three. For example, the double-stranded adaptor may comprise cleavage site at the 3' end of the complementary sequence and one or more additional cleavage sites, for example within a hairpin sequence or elsewhere.

In preferred embodiments, the double stranded adaptor is added to the target nucleic acids before bisulfite treatment and the

complementary sequence is removed after bisulfite treatment.

Preferably, the double-stranded adaptor is a hairpin adaptor which comprises a hairpin nucleotide sequence that links the adaptor sequence and the complementary sequence i.e. the double-stranded adaptor consists of a polynucleotide chain which forms a double stranded region and a single-stranded hairpin region. This may be useful in protecting the ends of the nucleic acids from damage, for example during bisulfite treatment.

Examples of the use of hairpin adaptors are shown in Figures 5 and 6.

Preferably, the hairpin adaptor comprises a first cleavage site at the 3' end of the complementary sequence and a second cleavage site at the 5' end of the adaptor sequence. Cleavage of the first and second cleavage sites produces a population of nucleic acids having the adaptor sequence at the 3' ends but lacking an adaptor sequence at the 5' ends.

Suitable cleavage sites include any site that is specifically cleavable by enzymatic, chemical or other means. Suitable cleavag sites are well known in the art and include modified nucleotides, such as 8-oxoguanine or 8-oxoadenine , which are cleavable by formamidopyrimidine [fapy]-DNA glycosylase (Fpg) and restriction enonuclease recognition sites.

In some preferred embodiments, the nucleic acids may be treated with bisulfite following addition of the adaptor sequence to the 3' ends.

In embodiments in which an adaptor oligonucleotide is ligated specifically to the 3' ends, the nucleic acids treated with

bisulfite may have unmodified 5' ends. In embodiments in which a double stranded adaptor, such as a hairpin adaptor, is ligated specifically to the ends of the nucleic acids, the nucleic acids treated with bisulfite may have complementary sequences ligated to the 5' ends. These complementary sequences may be removed after the bisulfite treatment to produce bisulfite treated nucleic acids having an adaptor sequence at the 3' end but not the 5' end.

Treatment with bisulfite converts unmodified cytosine residues to uracil residues, thereby producing a population of nucleic acid strands comprising uracil residues instead of unmodified cytosines. This may be useful in bisulfite sequencing methods (BS-seq) . Bisulfite treatment may also denature the population of double stranded nucleic acids . A method as described above may comprise treating the population of nucleic acids with bisulfite. In some embodiments, a method of preparing a nucleic acid library may comprise ;

providing a population of double-stranded nucleic acids, adding an adaptor sequence to the 3' ends to produce a

population of double stranded nucleic acids having an adaptor sequence at the 3' end of each strand,

treating the nucleic acids with bisulfite, such that

unmodified cytosine residues in said nucleic acids are converted to uracil ,

denaturing the population of nucleic acids to produce nucleic acid strands,

hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

ligating a second adaptor to the second end of the double- stranded nucleic acids to produce a library of double-stranded nucleic acids having a first adaptor at a first end and a second adaptor at a second end.

The adaptor sequence may be added to the 3' ends of the nucleic acids by ligating an adaptor oligonucleotide to the 3' ends but not to the 5' ends of the nucleic acids.

Alternatively, the adaptor sequence may be added to the 3' ends of the nucleic acids by ligating a double-stranded adaptor comprising the adaptor sequence hybridised to a complementary sequence to the ends of the nucleic acids, preferably a hairpin adaptor, such that the adaptor sequence is ligated to the 3' ends of the double- stranded nucleic acids and the complementary sequence is ligated to the 5' ends of the double-stranded nucleic acids. After treatment with bisulfite, the 5' ends of the nucleic acids are cleaved to remove the complementary sequence. A method of preparing a nucleic acid library may comprise;

providing a population of double-stranded nucleic acids, ligating a double-stranded adaptor comprising the adaptor sequence hybridised to a complementary sequence to the ends of the nucleic acids,

such that the adaptor sequence is ligated to the 3' ends of the double-stranded nucleic acids and the complementary sequence is ligated to the 5' ends of the double-stranded nucleic acids,

treating the nucleic acids with bisulfite, such that

unmodified cytosine residues in said nucleic acids are converted to uracil ,

cleaving the nucleic acids to remove the complementary

sequence, thereby producing a population of nucleic acids having an adaptor sequence at the 3' end of each strand but not at the 5' end, denaturing the population of nucleic acids to produce nucleic acid strands,

hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

ligating a second adaptor to the second end of the double- stranded nucleic acids to produce a library of double-stranded nucleic acids having a first adaptor at a first end and a second adaptor at a second end.

Preferably, the double-stranded adaptor is a hairpin adaptor comprising a single stranded hairpin nucleotide sequence that links the hybridised adaptor and complementary sequences, as described above .

In other embodiments, the nucleic acids may be treated with

bisulfite before the addition of the adaptor oligonucleotide to the 3' ends. For example, the initial population of nucleic acids in step (i) above may be a bisulfite treated population of nucleic acids . The initial population of double-stranded nucleic acids may be provided by a method comprising;

(a) treating a population of nucleic acids with bisulfite to produce a population of bisulfite-treated nucleic acid strands,

(b) hybridising random primers to the bisulfite-treated nucleic acid strands, and;

(c) extending the random primers along the nucleic acid strands to generate complementary strands, thereby producing a population of double-stranded nucleic acids .

The population of double-stranded nucleic acids may then be treated in accordance with steps (i) to (vi) above.

The hybridisation and extension of random primers to convert nuclei acid strands into double stranded molecules is well-known in the art .

In some embodiments of the above methods, the nucleic acids may be subjected to an additional treatment before treatment with

bisulfite. This may be useful, for example, in performing variants of standard bisulfite sequencing methods (BS-seq) , for example to identify specific cytosine modifications, such as 5hmC, 5fC and 5caC.

For example, methods may comprise treating the nucleic acids with ai oxidising agent, and then treating the oxidised nucleic acids with bisulfite. Suitable oxidising agents are well known in the art and include metal oxides, such as KRuO 4, Mn02 and KMn04, and

perruthenates , such as potassium perruthenate (KRu04) . Techniques for oxidative bisulfite sequencing are well known in the art (OxBS- seq; Booth et al (2013) Nat Protoc 8 1841-1851; Booth et al (2012) Science 336 934-937) and reagents are available from commercial sources (e.g. Cambridge Epigenetix Ltd. UK) .

Methods may comprise treating the nucleic acids with a reducing agent, and then treating the reduced nucleic acids with bisulfite. Suitable reducing agents are well-known in the art and include NaBH 4 , NaCNBH 4 and LiBH 4 . Techniques for reductive bisulfite sequencing are available in the art (redBS-seq; WO2013/017853 ) .

Methods may comprise treating the nucleic acids with 3- glu.cosyltran.sfera.se in the presence of UDP-Glucose to add a glucosyl protecting group to 5hmC residues in the nucleic acids; treating the nucleic acids with TET to oxidise 5mC residues in the nucleic acids to 5caC and then treating the TET-oxidised nucleic acids with bisulfite. Techniques for TET-assisted bisulfite sequencing are well-known in the art (TAB-seq; Yu et al (2012) Nat Protoc. 7 (12) 2159-2170; Yu et al Cell (2012) 149(6) : 1368-1380) and reagents are available from commercial sources (e.g, Wisegene LLC USA) .

Methods may comprise labelling 5caC residues in the nucleic acids with l-ethyl-3- [3-dimethylaminopropyl] carbodiimide hydrochloride

(EDC) ; and then treating the labelled nucleic acids with bisulfite. Techniques for chemical modification-assisted bisulfite sequencing are well-known in the art (CAB-seq; Lu et al J. Am. Chem. Soc.

(2013) 135 (25) 9315-9317)

Methods may comprise labelling 5fC residues in the nucleic acids with O-ethylhydroxylamine; and then treating the labelled nucleic acids with bisulfite. Techniques for 5fC chemical modification- assisted bisulfite sequencing are well-known in the art (fCAB-seq; Song et al (2013) Cell 153 1-14) .

The strands of nucleic acids in the libraries described herein may comprise nucleotide sequences that are bisulfite-treated (i.e.

containing uracil instead of unmodified cytosine in the untreated sequence) ; nucleotide sequences that are the complement of

bisulfite-treated sequences (i.e. containing adenine instead of unmodified cytosine in the untreated sequence); or nucleotide sequences that are the complement of the sequences complementary to bisulfite-treated sequences (i.e. containing thymine instead of unmodified cytosine in the untreated sequence) . For example, unmodified cyotosines may be replaced by uracil in nucleic acid strands following bisulfite treatment. Following primer extension, the resultant double-stranded molecules comprise a uracil-containing strand and a non-uracil containing complementary strand. Either or both of these strands may be subsequently isolated and sequenced. In some embodiments, it may be preferred to isolate and sequence the complementary strand to avoid the need to use uracil-tolerant polymerases. The sequences of bisulfite-treated nucleic acids (or complementary sequences thereto) may be useful in determining the presence or frequency of modified cytosine residues, such as 5mC, in samples of nucleic acid.

Bisulfite treatment causes extensive depyrimidination and strand cleavage in populations of nucleic acids. For example, 50% or more, 60% or more, 70% or more, 80% or more, 90% or more, 95% or more, or 99% or more of the nucleic acids may be cleaved during bisulfite treatment. The nucleic acids in the populations used to produce libraries as described herein may include nucleic acids that are not cleaved by the bisulfite treatment and nucleic acids that are cleaved by the bisulfite treatment.

The bisulfite treated population therefore comprises nucleic acid strands of a range of sizes, depending on whether cleavage has occurred and its location relative to the 3' adaptor sequence. For example, a library produced as described herein may comprise nucleic acids ranging from lObp to 5kb, 20bp to 2kb or 30bp to lkb (Ehrich et al Nucleic Acids Research, 2007, Vol. 35, No. 5 e29) .

Because bisulfite cleaved and uncleaved nucleic acids are

represented in the libraries described herein, the libraries contain a greater proportion of the initial nucleic acid population than libraries that only comprise uncleaved nucleic acids. For example, the number of nucleic acids in the nucleic acid library may be greater than 0.1%, greater than 1%, greater than 5% or greater than 10% of the number of nucleic acids in the initial population.

Libraries produced by the methods described herein may contain sufficient sequenceable nucleic acid molecules to allow sequencing without amplification i.e. the nucleic acids in the library may be sequenced without being amplified. In preferred embodiments, a nucleic acid library is produced as described herein without any amplification of the nucleic acids in the sample.

Bisulfite treatment converts unmethylated cytosine residues in a polynucleotide into uracil. The use of bisulfite ions (HS0 3 ~ ) to convert unmethylated cytosines in nucleic acids into uracil is standard in the art and suitable reagents and conditions are well known to the skilled person. Numerous suitable protocols and reagents are also commercially available (for example, EpiTect™, Qiagen NL; EZ DNA Methylation™ Zymo Research Corp CA; CpGenome Turbo Bisulfite Modification Kit; Millipore) .

In some embodiments, the population of double stranded nucleic acids may be treated with bisulfite by incubation with bisulfite ions (HS0 3 ~ ) , for example sodium bisulfite (NaHS0 3 ) . Suitable conditions for bisulfite treatment are well known in the art and typically range from 1-16 hours.

Bisulfite treatment as described above may denature double-strande 3' adapted nucleic acids to produce nucleic acid strands of a rang of sizes that all have the adaptor sequence at the 3' end.

Following bisulfite treatment, the nucleic acid strands in the population are then converted into double stranded DNA molecules through the generation of a complementary strand.

In some embodiments, nucleic acids may be subjected to an additional denaturation step following bisulfite treatment.

In other embodiments, double-stranded nucleic acids may be denatured following addition of the 3' adaptor sequence without bisulfite treatment .

The nucleic acids may be denatured to disrupt any inter or intra molecular hybridisation. Denaturation converts 3' adapted double- stranded nucleic acids into single nucleic acid strands. The population of double-stranded nucleic acids may be denatured by any convenient method following the addition of the 3' adaptor sequence. For example, the nucleic acids may be denatured by heating or treatment with a chemical denaturant .

The complementary strand may be generated by annealing an

oligonucleotide primer to the 3' adaptor of the nucleic acid strand and extending the primer in a 5' to 3' direction along the strands to synthesise complementary strands. The sequence of the

oligonucleotide primer is preferably complementary to all or part o the 3' adaptor, so that it hybridises under standard hybridisation conditions .

Since the 3' adaptor is the same for all the nucleic acid strands in the population, complementary strands may be generated for all the nucleic acid strands using the same oligonucleotide primer.

In some embodiments, the oligonucleotide primer may be linked to a binding tag, for example via a cleavable linker. This is described in more detail below.

Suitable techniques and protocols or the hybridisation of

oligonucleotide primers and primer extension along a single strandi template are well-known in the art and reagents are available from commercial sources .

Following the generation of the complementary strand, the population comprises double-stranded nucleic acids that have an adaptor at one end (i.e. an adapted first end) comprising the 3' adaptor sequence. The double-stranded nucleic acids in the population have a uracil- containing strand and a non-uracil containing complementary strand.

In some embodiments, the unadapted (i.e. second) end of the nucleic acids in the population may be repaired and/or adapted to facilitate ligation of the second adaptor. For example, 5' phosphate groups and/or 3' adenine overhangs may be added. Suitable methods for A tailing and/or phosphorylation are well-known in the art. The second ends of the double stranded nucleic acids are adapted through the ligation of the second adaptor. The second adaptor may have the same or a different nucleotide sequence to the adaptor. The second adaptor may be ligated to the second end by any convenient technique. In some preferred embodiments, the second adaptor may comprise a 3' T overhang at one or both ends to facilitate ligation to second ends that comprise a 3' A overhang. For example, the second adaptor may be ligated to the second ends by;

providing a population of nucleic acids having an adapted first end and a 3' A overhang and a 5' phosphate group at the second end, as described above,

contacting the population with a second adaptor comprising a 3' T overhang at an end,

ligating the second adaptor to the second end of the nucleic acids .

Other suitable methods of ligation are well-known in the art.

In preferred embodiments, nucleic acids that are sequenceable (i.e. molecules that comprise adaptors at both ends) are isolated from nucleic acids that are non-sequenceable (i.e. molecules not

comprising adaptors at both ends) . This may increase the reliability of library quantification.

For example, nucleic acids that comprise adaptors at both ends may be isolated, separated or removed from other nucleic acids.

Any suitable technique may be used to isolate nucleic acids that comprise adaptors at both ends. For example, the nucleic acids may be immobilised on a support and other nucleic acids washed away.

In some embodiments, immobilised nucleic acids may be interrogated directly. For example, the immobilised nucleic acids may be

amplified, hybridised to probes or subjected to pyrosequencing or other sequencing methods. In other embodiments, the nucleic acids may be released from the solid support, following washing. In some embodiments, nucleic acid strands that comprise the adaptor at the 3' end (i.e. sequenceable nucleic acids) are immobilised following treatment with bisulfite. 3' adapted nucleic acids may then be separated from nucleic acid strands cleaved by the bisulfite treatment and lacking a 3' adaptor. Preferably, the nucleic acid strands are immobilised before the double stranded nucleic acid is regenerated by primer extension. For example, the population of nucleic acid strands having a 3' adaptor sequence may be immobilised on a solid support following step (iii) .

A method of preparing a nucleic acid library may comprise;

(i) providing a population of double-stranded target nucleic acids ,

(ii) adding an adaptor seq uence to the 3' ends of the strands of the target nucleic acids but not the 5' ends to produce a population of nucleic acids having an adaptor sequence at the 3' end but not the 5' end of each strand,

(iii) denaturing the population of nucleic acids, preferably by treating the population of nucleic acids with bisulfite such that unmodified cytosine residues in said molecules are converted to uracil, thereby producing a population of nucleic acid strands having the adaptor sequence at the 3' end,

(iv) immobilising the population of nucleic acid strands on a solid support,

(v) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(vi) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

(vii) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having a first adaptor at a first end and a second adaptor at a second end, and

(viii) optionally releasing the library of double-stranded nucleic acids from the solid support. The nucleic acid strands may be immobilised through a binding tag that is linked to the adaptor sequence for example via a chemical linker. The binding tag may be linked to the adaptor sequence, such that addition of the adaptor sequence, for example by ligation of the adaptor oligonucleotide, links the binding tag to the nucleic acid strands .

In other embodiments, the population of nucleic acids may be immobilised following primer extension and the generation of the complementary strand. For example, the population of nucleic acids comprising the first adaptor may be immobilised on a solid support following step (v) . Preferably, the nucleic acids having the first adaptor are immobilised through a binding tag that is linked to the regenerated complementary strand. For example, the binding tag may be covalently linked to the oligonucleotide primer, such that hybridisation of the oligonucleotide primer and subsequent

generation of the complementary strand links the binding tag to the regenerated double stranded DNA molecules.

A method of preparing a nucleic acid library may comprise;

(i) providing a population of double-stranded target nucleic acids ,

(ii) adding an adaptor sequence to the 3' ends of the strands of the target nucleic acids to produce a population of nucleic acids having an adaptor sequence at the 3' end of each strand,

(iii) treating the population of nucleic acids with bisulfite such that unmodified cytosine residues in said molecules are converted to uracil, thereby producing a population of nucleic acid strands having the adaptor sequence at the 3' end,

(iv) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(vi) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end;

(vii) immobilising the complementary strands of the population of double-stranded nucleic acids on a solid support, (viii) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having a first adaptor at a first end and a second adaptor at a second end, and;

(ix) optionally releasing the library of nucleic acids from the solid support.

The nucleic acids may be isolated and/or purified through the binding of a capture member to the binding tag.

The capture member and the binding tag may form a specific binding pair. Numerous suitable combinations of binding tags and capture members that form specific binding pairs are available in the art. Suitable specific binding pairs may include antibody/immunogenic epitope, such as anti-digoxigenin antibody/digoxigenin; glutathione S-transferase/glutathione; and biotin/biotin binding protein. For example, the binding tag may be an antigen, such as digoxigenin, glutathione, or biotin and the capture member may be an antibody, such as an anti-digoxigenin antibody, glutathione-S-transferase , or a biotin-binding protein, such as streptavidin, avidin, anti-biotin antibody or neutravidin, respectively.

In some preferred embodiments, the tag is biotin and the capture member is streptavidin.

The capture member may be immobilised, for example on a solid support. Binding of the tag to the capture member immobilise as the nucleic acid linked to the tag on the solid support.

A solid support is an insoluble body which presents a surface on which the capture member can be immobilised for capture of the labelled nucleic acid. Examples of suitable supports include glass slides, microwells, membranes, or microbeads . The support may be in particulate or solid form, including for example a plate, a test tube, bead, a ball, filter, fabric, polymer or a membrane. Nucleic acids may, for example, be fixed to an inert polymer, a 96-well plate, other device, apparatus or material which is used in nucleic acid sequencing or other investigative context. The immobilisation of polynucleotides to the surface of solid supports is well-known in the art. In some embodiments, the solid support itself may be immobilised. For example, microbeads may be immobilised on a second solid surface. In preferred embodiments, the solid support may be a magnetic bead.

Following immobilisation, the nucleic acid-binding tag-capture member complex may be washed, for example, to remove non-immobilised molecules from its environment, including unlabelled nucleic acids and other reagents and molecules . Suitable techniques and reagents for washing immobilised complexes are well-known in the art.

The nucleic acids may then be released from the solid support using any convenient technique to produce a nucleic acid library.

The oligonucleotide primer or oligonucleotide adaptor may be linked to the binding tag through a cleavable linker, for example a linker comprising a chemically sensitive cleavage site. This may faciliate release of the nucleic acids from the solid support.

The cleavable linker is attached to a terminus of the backbone of the nucleic acids in the population through a chemical modification, such as a 5' modified benzaldehyde group, and is not attached to a base in the nucleic acid.

The linker is chemically cleavable by the disruption of one or more covalent bonds to separate the ends of the probe. Preferably, the linker comprises a cleavage site which may be chemically cleaved under appropriate conditions. A range of suitable cleavage sites are available in the art, including azide masked hemiaminal ethers, protected hemiaminal ethers, phosphine containing groups, silicon containing groups, disulphides, cyanoethyl groups and photocleavable groups. Examples of suitable cleavage chemistries are shown in Figure 2. In some embodiments, the linker may comprise an azide masked hemiaminal ether site. Azide masked hemiaminal ether sites (-OCHN 3 -) may be cleaved by reduction of the azide to an amine, followed by spontaneous hemiaminal ether cleavage ( reaction 1 in Figure 2) .

Suitable reducing agents include phosphines (e.g.: TCEP) , thiols (e.g.: DTT, EDT) and metal-ligand complexes, including

organometallic Ru-, Ir-, Cr-, Rh- and Co- complexes. Suitable metal- ligand complexes may include organometallic ruthenium (II)

complexes, for example ruthenium (II) polypyridine complexes, tris (bipyridine ) ruthenium ( II ) (Ru(bpy) 3 2+ ) and salts thereof, including Ru(bpy) 3 Cl 2 . Other suitable metal-ligand complexes may include organometallic iridium (II) complexes for example iridium polypyridine complexes, such as Ir (ppy) 2 ( dtb-bpy) +, where ppy is phenylpyridine and dtb-bpy is 4, ' -di-tert-butyl-2 , 2 ' -bipyridine, and salts thereof.

The linker may comprise a protected hemiaminal ether site. Protected hemiaminal ether sites may be cleaved by removal of the amine protecting group, followed by spontaneous hemiaminal ether cleavage (reaction 2 in Figure 2) . Suitable protecting groups include allyl or allyl carbamates, which may be cleaved using transition metals with water soluble ligands, e.g. Pd with water soluble phosphine ligands); sulfmoc, which may be cleaved with a mild base, e.g. 1% Na 2 C0 3 ; m-chloro-p-acyloxybenzyl carbamate, which may be cleaved with mild base, e.g.: 0.1 M NaOH; and 4-azidobenzyl carbamate, which may be cleaved with reducing agents, e.g.: TCEP, DTT) .

The linker may comprise a phosphine containing site. Phosphine containing sites, for example comprising the structure shown in reaction 3 of Figure 2, may be cleaved by the addition of an azide reagent, for example an alkyl or aryl azide, such as benzyl azide. The Staudinger aza-ylid generated reacts intramolecularly with an ester to release the captured DNA.

The linker may comprise a silicon containing site. Silicon

containing sites may be cleaved by vicinal elimination of silicon in the presence of fluoride ions, such as KF and tetra-n-butylammonium fluoride (TBAF) (reaction 4 in Figure 2) .

The linker may comprise a disulfide site. Disulfide sites may be cleavage by reduction with phosphines, such as TCEP or thiols, such as DTT.

The linker may comprise a cyanoethyl site. Cyanoethyl sites may be cleaved under basic conditions, such as NH 3 or 10% K 2 C0 3 .

The linker may comprise a photocleavable site. Photocleavable sites may be cleaved by treatment with UV light, preferably of a

sufficiently long wavelength so as to not damage DNA. Suitable photocleavable sites are well known in the art. For example, an orthonitrobenzyl group may be cleaved by UV at 365 nm.

Other suitable cleavage sites are well known in the art.

A suitable linker may have a total length not exceeding the length of a normal alkyl chain of 2-20 carbons and may comprise from one to about 50 atoms. For example, a suitable linker may have the formula: Ri-Cs-R.2, wherein Cs is the cleavage site and R x and R 2 are

independently absent or possess a length not exceeding the length of a normal alkyl chain of 2-20 carbons and may comprise from one to about 50 atoms. Suitable R x and R 2 may be selected from the group consisting of substituted or unsubstituted alkyl, substituted or unsubstituted alkenyl, substituted or unsubstituted alkynyl, substituted or unsubstituted cycloalkyl, substituted or

unsubstituted cycloalkenyl, substituted or unsubstituted aryl, substituted or unsubstituted heteroaryl, substituted or

unsubstituted heteroalicyclyl or substituted or unsubstituted ether, substituted or unsubstituted thioether, and substituted or

unsubstituted amines .

The nucleic acids may be released from the solid support by chemical cleavage of the cleavage site in the linker. This may, for example, release the nucleic acid from a nucleic acid-binding tag-capture member complex. Cleavage of the linker separates the nucleic acid from the binding tag which remains bound to the capture member. immobilised

non-uraci 1

ds released

e-stranded

Immobilisation through a binding tag linked to the non-uracil containing complementary strand allows the isolation or separation of the complementary strand from the original strand. This may be useful for example, because the complementary strand may be loaded directly onto a sequencer and, as it lacks uracil residues, it may be sequenced using standard techniques. For example, step (ix) above may comprise;

(a) denaturing the immobilised nucleic acids to produce a population of immobilised complementary nucleic acid strands, and

(b) releasing the complementary nucleic acid strands from the support to produce a nucleic acid library comprising nucleic acid strands having a first adaptor at a first end and a second adaptor at a second end.

The immobilised nucleic acids may be denatured, for example using a denaturant such as NaOH, in accordance with standard techniques.

Many known techniques and protocols for nucleic acid manipulation, including fragmentation, end-repair, ligation, A-tailing, and primer extension as described herein, are known in the art for example, Molecular Cloning: a Laboratory Manual: 3rd edition, Russell et al . , 2001, Cold Spring Harbor Laboratory Press; Protocols in Molecular Biology, Second Edition, Ausubel et al . eds . John Wiley & Sons, 1992) .

Methods described herein may comprise interrogating the nucleic acids in the library to identify one or more bases . In preferred embodiments, a method may comprise sequencing one or more,

preferably all, of the nucleic acids in the nucleic acid library. Nucleic acids may be sequenced using any convenient low or high throughput sequencing technique or platform, including Sanger sequencing, Solexa-Illumina sequencing, Ligation-based sequencing (SOLiD™), pyrosequencing; Pacific Biosciences single molecule rea time sequencing (SMRT™) ; and semiconductor array sequencing (Ion Torrent™) . Suitable protocols, reagents and apparatus for nucleic acid sequencing are well-known in the art and are available

commercially .

Preferably, the nucleic acids in the library are sequenced without amplification .

Following preparation according to some embodiments described herein, a strand of the nucleic acids in the library may contain uracil residues. These nucleic acids may be sequenced using a uracil tolerant polymerase. Nucleic acid strands lacking uracil residues (i.e. the complement of the uracil containing strand) may be prepared and sequenced using standard techniques.

In other embodiments, the nucleic acids in the nucleic acid library may be interrogated by PCR, hybridisation, for example to an array of immobilised probes or other analysis methods.

A method may comprise determining the identity of a residue in nucleic acids in the library at a position that corresponds to cytosine in a non-bisulfite treated nucleic acid. This may allow the identification of modified cytosine residues, such as

5-methylcytosine .

The extent or amount of cytosine modification in the sample

nucleotide sequence may be determined. For example, the proportion or amount of 5-methylcytosine at a position in a nucleotide sequence compared to unmodified cytosine may be determined in a sample.

Another aspect of the invention provides a kit for use in the preparation of a nucleic acid library, for example using a method described above, comprising; an adaptor oligonucleotide,

a complementary oligonucleotide

an oligonucleotide primer, and

a double-stranded second adapto

Other kits for use in the preparation of a nucleic acid library as described herein may comprise;

a double stranded adaptor comprising an adaptor sequence and a complementary sequence,

an oligonucleotide primer, and

a double-stranded second adaptor.

Suitable adaptors, oligonucleotides, primers and adaptors are described in detail above.

The complementary oligonucleotide and the oligonucleotide primer are hybridisable to the adaptor oligonucleotide and are preferably complementary to all or part of the adaptor oligonucleotide.

As discussed above, the adaptor oligonucleotide may comprise or consist of nucleotide analogues, such as PNA or LNA, or modified nucleotides, such as 2' substituted nucleotides, and may be

resistant to bisulfite damage.

In some embodiments, the complementary oligonucleotide may lack a 3'

OH group. , For example, the complementary oligonucleotide may comprise a 3' blocking group, such as a 3' halogen group, or may comprise a 3' dideoxynucleotide .

One of the adaptor oligonucleotide and the oligonucleotide primer may be linked to a binding tag via a cleavable linker. Cleavable linkers and binding tags are discussed above.

The kit may further comprise a bisulfite reagent (HS0 3 ) , as

described above. The kit may further comprise nucleic acid isolation reagents.

Suitable reagents are well-known in the art and include spin- chromatography columns.

The kit may further comprise end-repair reagents, for example reagents to produce blunt ended nucleic acid fragments. Suitable reagents are well-known in the art and may include a 5'→3'

polymerase and a 3'→5' exonuclease. For example, the end-repair reagents may comprise T4 DNA polymerase and Klenow fragment. In some preferred embodiments, the end-repair reagents produce blunt ended nucleic acid fragments lacking 5' phosphate groups and do not include a 5' kinase, such as T4 kinase.

In embodiments in which the kit comprises a hairpin adaptor

containing one or more cleavage sites, the kit may further comprise a cleavage agent which cleaves the adaptor at the cleavage sites. Suitable cleavage agents are described in more detail above.

The kit may further comprise end-modification reagents, for example reagents for the addition of a 3' A tail, such as dATP and Taq DNA Polymerase .

The kit may further comprise one or more reagents for performing a variant bisulfite sequencing method. For example, a kit may comprise an oxidising agent, such as a metal oxide, such as KRu0 4 , Mn0 2 and KMn0 4 , or a perruthenate, such as potassium perruthenate (KRu0 4 ) , for oxidative bisulfite sequencing. A kit may comprise a reducing agent, such as NaBH 4 , NaCNBH 4 or LiBH 4 , for reductive bisulfite sequencing. A kit may comprise a β-glucosyltransferase , UDP-Glucose and a TET enzyme for TET-ass.ist.ed bisulfite sequencing. A kit may comprise 1- ethyl-3- [3-dimethylaminopropyl] carbodiimide hydrochloride (EDC) or O-ethylhydroxylamine for chemical modification-assisted bisulfite sequencing and 5fC chemical modification-assisted bisulfite

sequencing, respectively.

A kit may include one or more other reagents required for the method, such as buffer solutions, sequencing and other reagents For example, the kit may further comprise a labelling buffer for attachment of the capture member to nucleic acid containing the binding tag.

The kit may further comprise a release buffer for cleavage of a cleavable linker which is attached to nucleic acid. Suitable release buffers depend on the cleavage chemistry involved and may comprise a reducing agent, for example a thiol, phosphine or metal-ligand complex reducing agent, as described above.

The kit may further comprise a capture member. The capture member may bind specifically to the binding tag of the oligonucleotide adaptor or primer in the kit.

The kit may further comprise a solid support. The solid support may be coated or coatable with the capture member. Suitable solid supports are described above and include magnetic beads. In some preferred embodiments, the binding tag is biotin and the solid support is streptavidin-coated magnetic beads. A magnet may be included in the kit for purification of the magnetic beads.

The kit may further comprise sequencing reagents. For example, the kit may comprise a uracil-tolerant polymerase.

The kit may further comprise one or more oligonucleotides or nucleic acids for use as controls. A suitable positive control

oligonucleotide or nucleic acid may comprise at least one modified cytosine residue. A suitable negative control oligonucleotide or nucleic acid may be devoid of modified cytosines. Control

oligonucleotides may be made synthetically by standard methods .

In some embodiments, the kit may comprise a DNA strand for

quantitation .

A kit for use in preparation of a nucleic acid library may include one or more articles and/or reagents for performance of the method, such as means for providing the test sample itself, including DNA and/or RNA isolation and purification reagents, sample handling containers (such components generally being sterile), and other reagents required for the method, such as buffer solutions,

sequencing and other reagents .

The kit may include instructions for use in a method of preparation of a nucleic acid library as described above.

Another aspect of the invention provides the use of a kit as set out above in the preparation of a nucleic acid library, for example using a method described above.

Certain aspects and embodiments of the invention will now be illustrated by way of example and with reference to the figures described below.

Figure 1 shows quantification of the full-length target DNA

before/after a 16 h bisulfite incubation (triangle) . A 10-fold dilution series is shown as a standard plot (circle) .

Figure 2 shows examples of cleavage chemistries which may be used in a cleavable linker in the library preparation protocols described herein .

Figure 3 shows a library preparation protocol according to an embodiment of the invention.

Figure 4 shows a library preparation protocol according to another embodiment of the invention.

Figure 5 shows a library preparation protocol according to another embodiment of the invention.

Figure 6 shows a library preparation protocol according to another embodiment of the invention. Figure 7 shows qPCR of control after BS treatment and sample DNA using the modified protocol described above.

Figure 8 shows an Agilent Tapestation electropherogram of the DNA fragment distribution of the sample mixture, analysed after qPCR.

Figure 9 shows a first set of results of a sequencing run on 150 bp paired ends using an Illumina Miseq instrument and 500ng of E. coli genomic DNA. Upper track shows DNA prepared by a new PCR free BS treatment process according to an embodiment of the invention. Lower track shows DNA prepared using a standard process comprising BS treatment and 15 cycles of PCR.

Figure 10 shows a second set of results of a sequencing run on 150 bp paired ends using an Illumina Miseq instrument and 500ng of E. coli genomic DNA. Upper track shows DNA prepared by a new PCR free BS treatment process according to an embodiment of the invention. Lower track shows DNA prepared using a standard process comprising BS treatment and 15 cycles of PCR.

Figure 11 shows the genomic coverage of a BS sequencing preparation described herein relative to standard BS sequencing preparation as a log 2 ratio (old method/new method) ) . Coverage was summed in 100 base pair windows along the genome, and normalized to correct for differences in mapped reads. Output values are negative where the old method has less reads and positive where the new method has less reads. A noise threshold was set at +/- 1.5, beyond which a window was defined a A gap' (points), ie : has significantly fewer reads than it should.

Figure 12A shows Tapestation images using standard sensitivity of ladder (left) , illumina adapter ligated product (middle) and hairpin ligated adapter product (right) .

Figure 12B shows Tapestation images using high sensitivity of ladder (left), bisulfite treated illumina adapted DNA (middle) and

bisulfite treated hairpin adapted DNA (right) . Figure 13 Quality of P.berghei sequence data obtained with the method of the invention involving adapter attachment at one end of the sample fragments followed by bisulfite treatment and attachment of the second adaptor (REBUiLT) and the known method PCR-BS (a) Summary of the retention of raw sequence data through bioinformatic preprocessing, showing the REBUiLT method consistently produced high quality data. Trimming indicates removal of adapter contamination and low quality sequences, and alignment was to a chimeric

P.berghei/M.musculus genome to remove any host contamination - the mapping to P. berghei shows the fraction of reads aligning to the parasitic genome. Throughout the process the REBUiLT method retained almost twice the number of reads, reducing the sequencing effort required, (b) The distribution of mapped reads in 50 bp windows across the genome is shown. The Poisson distribution (dashed line) describes the ideal distribution in the absence of external biases. REBUiLT approximates this distribution, while PCR-BS does not.

Figure 14 The effect of GC content on depth of coverage (a) A genome browser view showing the coverage obtained across the P. berghei apicoplast for both methods. While near constant for REBUiLT, there are distinct read pile-ups in PCR-BS that appear to track GC content, (b) The GC content was calculated in 300 bp windows across the genome and plotted against the normalized informative read count. Optimally, no correlation would be observed; however, PCR amplification induces a strong preference for more balanced base compositions, (c) In P. berghei the GC content has a distinct profile across exons (dashed line) . REBUiLT depth of coverage is unaffected across these genomic features, while PCR-BS tracks the GC percentage closely.

Figure 15 Cytosine methylation in P.berghei . (a) The sequence context of significantly methylated sites was found to be almost entirely CHH. Even against the background of all genomic cytosine contexts, there is a strong preference for CAH methylation in particular. (H = A, T, G) (b) The distribution of methylation across genomic features, (c) The profile of 5mC levels over exons. Traditional PCR-BS gives the same profile as ReBuiIT, but over- estimates the 5mC levels .

Figure 16 Duplication Rates. Duplicate reads obtained using read one only. The dashed horizontal line indicates the expected duplication rate given the read number and genome size. The ReBuilT libraries show a small increase over the expected value, while the PCR-BS libraries show over double the expected duplication rate. The observed duplication rate includes PCR duplicates, but is also affected by uneven coverage. Local increases in coverage will increase the observed duplication rate over the expected.

Figure 17 Sequence composition and coverage Plot shows the

normalized read count against local GC content of the reference genome in 300 base pair windows. Ideally, the GC content of a window should have no impact on the read count. The ReBuilT libraries exhibit a remarkable insensitivity to the base composition of the window. However, the PCR-BS libraries show a strong preference for more balanced base compositions, as evidenced by the positive correlation of 0.69.

The invention may be described by the embodiments shown below:

1.1 A method of preparing a nucleic acid library comprising;

(i) providing a population of double-stranded nucleic acids,

(ii) adding an adaptor sequence to the 3' ends to produce a population of double stranded nucleic acids having an adaptor sequence at the 3' end of each strand,

(iii) denaturing the population of nucleic acids to produce a population of nucleic acid strands having the adaptor sequence at the 3' end,

(iv) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands,

(v) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and; (vi) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having an adaptor at the first end and a second adaptor at the second end.

1.2 A method according to embodiment 1.1 further comprising

(vii) denaturing the double-stranded nucleic acids to produce a library of nucleic acid strands having an adaptor sequence at the first end and a second adaptor sequence at the second end.

1.3 A method according to embodiment 1.1 or embodiment 1.2 comprising

(viii) interrogating one or more nucleic acids in the library to determine the identity of one or more bases in said nucleic acids .

1.4 A method according to embodiment 1.3 wherein the one or more nucleic acids are interrogated by sequencing the nucleic acids.

1.5 A method according to any one of embodiments 1.1 to 1.4 wherein the method does not comprise amplifying the nucleic acids.

1.6 A method according to any one of the preceding embodiments comprising treating the nucleic acids with bisulfite such that unmodified cytosine residues in said molecules are converted to uracil .

1.7 A method according to embodiment 1.6 wherein the nucleic acids are bisulfite treated after step ii and before step iii.

1.8 A method according to any one of embodiments 1.1 to 1.5 wherein the population of double-stranded nucleic acids is provided by a method comprising;

(a) treating a sample of nucleic acids with bisulfite to produce a population of bisulfite-treated nucleic acid strands,

(b) hybridising random primers to the bisulfite-treated nucleic acid strands, and; (c) extending the random primers along the nucleic acid strands to generate complementary strands, thereby producing the population of double-stranded nucleic acids .

1.9 A method according to any one of embodiments 1.6 to 1.8 comprising determining the identity of a base in one or more nuclei acids in the library at a position that corresponds to cytosine in the non-bisulfite treated nucleic acids .

1.10 A method according to any one of embodiments 1.1 to 1.9 wherein the nucleic acids are modified before treatment with bisulfite .

1.11 A method according to embodiment 1.10 wherein the nucleic acids are oxidised to convert hydroxymethylcytosine residues into 5 formylcytosine (5fC) residues before said treatment with bisulfite

1.12 A method according to embodiment 1.10 wherein the nucleic acids are reduced to convert 5-formylcytosine (5fC) residues into hydroxymethylcytosine residues before said treatment with bisulfite

1.13 A method according to embodiment 1 .10 wherein 5hmC residues in the nucleic acids are protected with a glucGsyl preytecting group an the nucleic acids are oxidised with TET protein to convert 5mC residues in the nucleic acids to 5caC, before said treatment with bisulfite

1.14 A method according to embodiment 1.10 wherein the nucleic acids are labelled with l-ethyl-3- [3- dimethylaminopropyl ] carbodiimide hydrochloride (EDC) , before said treatment with bisulfite.

1.15 A method according to embodiment 1.10 wherein the nucleic acids are labelled with O-ethylhydroxylamine before said treatment with bisulfite. 1.16 A method according to any one of the preceding embodiments comprising isolating double-stranded nucleic acids having an adaptor at a first end and a second adaptor at a second end.

1.17 A method according to any one of the preceding embodiments comprising immobilising the population of nucleic acid strands on a solid support following said denaturation .

1.18 A method according to embodiment 1.17 wherein the adaptor oligonucleotide is linked to a binding tag and the method comprises binding the tag to a capture member immobilised on a solid support, thereby immobilising the population of nucleic acid strands.

1.19 A method according to any one of embodiments 1.1 to 1.16 comprising immobilising the population of nucleic acids on a solid support following production of the complementary strands.

1.20 A method according to embodiment 1.19 wherein the

oligonucleotide primer is linked to a binding tag and the method comprises binding the tag to a capture member immobilised on a solid support thereby immobilising the population of nucleic acids.

1.21 A method according to embodiment 1.19 or embodiment 1.20 wherein the method comprises;

(a) denaturing the immobilised nucleic acids to produce immobilised complementary nucleic acid strands, and

(b) releasing the complementary nucleic acid strands from the support to produce a nucleic acid library comprising nucleic acid strands having a first adaptor sequence at a first end and a second adaptor sequence at a second end.

1.22 A method according to embodiment 1.18 or embodiment 1.20 wherein the tag is biotin and the capture member is streptavidin .

1.23 A method according to any one of embodiments 1.17 to 1.22 wherein the nucleic acids are released from the solid support following ligation of the second adaptor. 1.24 A method according to embodiment 1.23 wherein the adaptor oligonucleotide or oligonucleotide primer is linked to the binding tag by a cleavable linker and the nucleic acids are released from the solid support by cleavage of the cleavable linker.

1.25 A method according to any one of the preceding embodiments wherein the nucleic acids are DNA molecules .

1.26 A method according to embodiment 1.24 wherein the DNA

molecules are genomic DNA molecules .

1.27 A method according to any one of the preceding embodiments wherein the population of nucleic acids is obtained from a sample cells .

1.28 A method according to any one of embodiments 1.1 to 1.26 wherein the population of nucleic acids is obtained or isolated from a single cell.

1.29 A method according to any one of embodiments 1.1 to 1.26 wherein the population of nucleic acids is obtained or isolated from a biological fluid sample.

1.30 A method according to embodiment 1.29 wherein the fluid sample is plasma.

1.31 A method according to any one of the preceding embodiments wherein the population of double-stranded nucleic acids is provided by a method comprising;

isolating nucleic acids from one of: a cell, a sample of cells, and a biological fluid sample,

fragmenting the nucleic acids and,

repairing the ends of the nucleic acids .

1.32 A method according to embodiment 1.31 comprising modifying the 3' ends of the nucleic acids. 1.33 A method according to embodiment 1.32 wherein the 3' ends are modified by the addition of an overhanging adenine residue. 1.34 A method according to any one of the preceding embodiments wherein the double-stranded nucleic acids in the population comprise a one base 3' overhang consisting of an adenine residue.

1.35 A method according to any one of the preceding embodiments wherein the adaptor sequence is added to the 3' ends of the nucleic acids by ligating an adaptor oligonucleotide to said 3' ends but not to the 5' ends of the nucleic acids.

1.36 A method according to embodiment 1.35 wherein the adaptor oligonucleotide is ligated to the 3' ends of the strands of the nucleic acids by contacting the population of double-stranded nucleic acids with a complex comprising the adaptor oligonucleotide hybridised to a complementary oligonucleotide. 1.37 A method according to embodiment 1.36 wherein the complex has a 3' T overhang at an end.

1.38 A method according to embodiment 1.36 or embodiment 1.37 comprising ligating the complex to the population such that the adaptor oligonucleotide of the complex is covalently linked to the 3' ends of the double-stranded nucleic acids and the complementary oligonucleotide of the complex is not linked to the 5' ends of the double-stranded nucleic acids. 1.39 A method according to embodiment 1.38 wherein the double- stranded nucleic acids in the population lack 5' phosphate groups.

1.40 A method according to embodiment 1.38 wherein the 3' end of the complementary oligonucleotide has a blocking group or

dideoxynucleotide residue. 1.41 A method according to any one of embodiments 1.1 to 1.34 wherein the adaptor sequence is added to the 3' ends of the nucleic acids by;

ligating a double-stranded adaptor comprising the adaptor sequence hybridised to a complementary sequence to the ends of the nucleic acids,

such that the adaptor sequence is ligated to the 3' ends of the double-stranded nucleic acids and the complementary sequence is ligated to the 5' ends of the double-stranded nucleic acids, and cleaving the 5' ends of the nucleic acids to remove the complementary sequence.

1.42 A method according to embodiment 1.41 wherein the double- stranded adaptor is a hairpin adaptor,

said hairpin adaptor comprising a hairpin sequence that links the adaptor sequence and the complementary sequence.

1.43 A method according to embodiment 1.42 wherein the hairpin adaptor comprises a first cleavage site at the 3' end of the complementary sequence and a second cleavage site at the 5' end of the adaptor sequence .

1.44 A method according to embodiment 1.43 wherein cleavage of the first and second cleavage sites produces a population of nucleic acids having the adaptor sequence at the 3' ends but lacking an adaptor sequence at the 5' ends.

1.45 A method according to any one of embodiments 1.41 to 1.44 wherein the nucleic acids are treated with bisulfite after ligation of the double-stranded adaptor and before cleavage of the 5' ends.

1.46 A method according to any one of the preceding embodiments wherein the adaptor and the second adaptor are sequencing adaptors.

1.47 A method according to any one of the preceding embodiments comprising modifying the second ends of the double stranded nucleic acids generated by extension of the oligonucleotide primer. 1.48 A method according to embodiment 1.47 comprising adding a 5' phosphate group and a 3' adenine residue to the second end of the double-stranded nucleic acids .

1.49 A method according to embodiment 1.48 comprising;

contacting the double-stranded nucleic acids with a second adaptor comprising a 3' T overhang at an end, and;

ligating the second adaptor to the second end of the nucleic acids .

1.50 A kit for use in the preparation of a nucleic acid library according to any one of embodiments 1.1 to 1.49 comprising;

an adaptor oligonucleotide,

a complementary oligonucleotide,

an oligonucleotide primer, and

a second adaptor, or;

a hairpin adaptor comprising an adaptor sequence and a a complementary sequence,

an oligonucleotide primer, and

a second adaptor.

1.51 A kit according to embodiment 1.50 wherein the adapto

oligonucleotide consists of nucleotide analogues or modifie nucleotides .

1.52 A kit according to embodiment 1.50 or embodiment 1.51 wherein the complementary oligonucleotide lacks a 3' hydroxyl group.

1.53 A kit according to any one of embodiments 1.50 to 1.52 further comprising a bisulfite reagent

1.54 A kit according to any one of embodiments 1.50 to 1.53 further comprising one or more nucleic acid isolation reagents.

1.55 A kit according to any one of embodiments 1.50 to 1.54 further comprising one or more end-repair reagents. 1.56 A kit according to embodiment 1.55 wherein the end-repair reagents do not include a 5' kinase.

1.57 A kit according to any one of embodiments 1.50 to 1.56 further comprising one or more end-modification reagents.

1.58 A kit according to any one of embodiments 1.50 to 1.57 further comprising one or more cleavage reagents for cleavage of the hairpin primer .

1.59 A kit according to embodiment 1.58 wherein the cleavage reagents comprise formamidopyrimidine [fapy]-DNA glycosylase.

1.60 A kit according to any one of embodiments 1.50 to 1.59 further comprising a solid support.

1.61 A kit according to any one of embodiments 1.50 to 1.60 further comprising one or more sequencing reagents.

1.62 Use of a kit according to any one of embodiments 1.50 to 1.61 in a method of preparing a nucleic acid library according to any one of embodiments 1.1 to 1.49

1.63 A method of assessing an individual for a disease or

predisposition thereto comprising;

(i) providing a sample obtained from the individual,

(ii) isolating a population of double-stranded nucleic acids from the sample,

(iii) adding an adaptor oligonucleotide to the 3' ends of the strands of the nucleic acids to produce a population of nucleic acids having an adaptor sequence at the 3' ends,

(iv) denaturing the population of nucleic acids to produce a population of nucleic acid strands having the adaptor sequence at the 3' end,

(v) hybridising an oligonucleotide primer to the adaptor sequences at the 3' ends of the nucleic acid strands, (vi) extending the primer along the nucleic acid strands to produce complementary strands, the strands and complementary strands forming double-stranded nucleic acids that comprise an adaptor at a first end, and;

(vii) ligating a second adaptor to the second end of the double-stranded nucleic acids to produce a library of double- stranded nucleic acids having an adaptor at the first end and a second adaptor at the second end,

(viii) optionally denaturing the double-stranded nucleic acids to produce a library of nucleic acid strands having an adaptor sequence at the first end and a second adaptor sequence at the second end, and

(ix) interrogating one or more nucleic acids in the library to determine the identity of one or more bases in said nucleic acids .

1.64 A method according to embodiment 1.63 wherein the one or more nucleic acids are interrogated by sequencing the nucleic acids.

1.65 A method according to embodiment 1.63 or embodiment 1.64 wherein the identity of one or more bases or the sequence of the one or more nucleic acids in the library obtained from the sample is indicative of a disease or a predisposition to a disease in the individual .

1.66 A method according to any one of embodiments 1.63 to 1.65 wherein the sample is a plasma sample from the individual.

1.67 A method according to any one of embodiments 1.63 to 1.66 wherein the population of nucleic acids is denatured by treatment with bisulfite.

1.68 A method according to any one of embodiments 1. 63 to 1.67 wherein the adaptor sequence is added to the 3' ends of the nucleic acids by ligating an adaptor oligonucleotide to said 3' ends . 1.69 A method according to any one of embodiments 1.63 to 1.68 wherein the adaptor sequence is added to the 3' ends of the nucleic acids by;

ligating a double-stranded adaptor comprising the adaptor sequence hybridised to a complementary sequence to the ends of the nucleic acids,

such that the adaptor sequence is ligated to the 3' ends of the double-stranded nucleic acids and the complementary sequence is ligated to the 5' ends of the double-stranded nucleic acids, and cleaving the nucleic acids to remove the complementary sequence .

1.70 A method according to embodiment 1.69 wherein the double- stranded adaptor is a hairpin adaptor comprising a hairpin sequence which links the adaptor sequence and the complementary sequence.

1.71 A method according to any one of embodiments 1.63 to 1.70 comprising isolating the double-stranded nucleic acids having an adaptor at a first end and a second adaptor at a second end.

Summary of experiments performed

Whilst bisulfite treatment leads to loss of sequenceable DNA via fragmentation, the majority of cleaved fragments still contain useful information and are of a mappable length. These lost fragments and the associated information can be recovered by employing a two-step ligation procedure, where the P7 adapter is added before bisulfite treatment and the P5 adapter afterwards.

The recovery after bisulfite treatment (ReBuilT) method begins with fragmentation, end repair and A-tailing. We then employ custom methylated adapters, with one strand bearing a 3' biotin label and the other a 3' dideoxythymidine (ddT) terminator. The presence of a 3' ddT prevents ligation to the 5' end of the insert DNA, resulting in a single stranded directional ligation to the 3' insert terminus Following bisulfite conversion, a primer extension step with a high fidelity uracil tolerant polymerase is performed to generate blunt ended double stranded DNA, which is immobilized on streptavidin coated magnetic beads via the biotin label. The immobilized DNA is end repaired and A-tailed before ligation of a fully complementary adapter. To generate sequenceable fragments we copy the bisulfite- converted strands by single primer extension. These new strands contain only the canonical DNA bases (A, T, G and C) , which is necessary as standard next-generation sequencing platforms are incompatible with uracil containing DNA. It should be noted that the first directional ligation prevents the formation of adapter dimers, a common sequencing contaminant that lead to non-insert sequencing reads . Adapter dimers forming during the second ligation have no impact on library composition, as they are completely removed during washing of the beads. Finally, the immobilization on beads enables near lossless library manipulation.

As a proof of concept experiment we generated sequencing libraries of a small genome {E.coli), and utilized qPCR to compare the concentration of sequenceable fragments obtained with either ReBuilT or a traditional BS-seq library preparation. Starting from equal input DNA, the concentration of sequenceable fragments was two orders of magnitude higher with the ReBuilT protocol than with a traditional protocol excluding PCR amplification (data shown below) .

Ct value (1/1000 dilution)

Replicate 1 Replicate 2 Average

REBUiLT protocol 13.47 13.53 13.5

Traditional protocol 20.06 20.21 20.14

We decided to exemplify our method by generating a PCR-free

methylome for the challenging, AT-rich Plasmodium genome. We first analysed the global DNA modification levels by tandem mass

spectrometry. The level of 5mC was 0.31% of total cytosine species, and no oxidised cytosine derivatives were detected (less than detection limit of 1 hmC per 10,000 total cytosine species. It is noteworthy that the only other modification detected was N6- methyladenine (data shown below) : Quantitative concentration (n ) relative composition

C mC N6MeA mC/(C+mC)% N6meA/(C+N6 eA)%

QE_PR502_PV_011114-B56-33 11958.256 36.745 0.234 0.306 0.0020

We employed the ReBuilT method to generate PCR-free libraries from 50 ng of P. berghei DNA, extracted from an asynchronous population oJ erythrocytic stages . In tandem we generated traditional bisulfite libraries (termed PCR-BS here) that included post-bisulfite PCR amplification. We sequenced the libraries on the Illumina NextSeq platform, with paired end reads of 75 or 100 bases. We were able to obtained up to 285 million reads from 13% of an amplification free library generated from 50 ng (i.e.: equivalent to 6.5 ng of input DNA) , giving ample data for analysis of low methylation levels with high confidence .

The impact of library preparation method on sequencing data quality To evaluate the benefit of the ReBuilT method, we compared a range of data quality metrics to the corresponding PCR-BS libraries generated from the same source of genomic DNA. We first looked at how much of the raw sequencing data was retained following adapter trimming, quality trimming and alignment to the Plasmodium genome (Fig. 13a) . Following trimming the ReBuilT libraries retained on average 87.4% of the raw data, compared to 63.6% for the PCR-BS libraries. Importantly, the sequence quality of the ReBuilT

libraries was higher then the PCR-BS, with mode phred scores of 35 versus 31 for read mate 1 (data shown below) .

Library Mate Phred mode % reads Protocol

grm034_REBUiLT_AD04 R1 35 19.33 ReBuilT

grm034_REBUil_T_AD04 R2 34 13.91 ReBuilT

grm035_REBUil_T_AD06 R1 35 21 .62 ReBuilT

grm035_REBUil_T_AD06 R2 34 15.17 ReBuilT

grm036_REBUil_T_AD12 R1 35 22.39 ReBuilT

grm036_REBUil_T_AD12 R2 34 15.34 ReBuilT

grm037_BS_plasb2_AD04 R1 31 13.77 BS

grm037_BS_plasb2_AD04 R2 32 9.56 BS

grm038_BS_plasb2_AD16 R1 31 13.94 BS

grm038_BS_plasb2_AD16 R2 32 10.29 BS The read pairs were subsequently aligned to a chimera P. berghei and M. musculus reference genome, as extracted Plasmodium DNA may be contaminated with some genomic material from the host. Average alignment rates of the two methods were 80.5% and 72.1% for the ReBuilT and PCR-BS samples respectively. However, from these high quality aligned reads 90.9% of ReBuilT and 70.6% of PCR-BS reads were aligned to the P. berghei reference genome, with the remaining reads aligning to the host mouse genome. Following all data

processing, the ReBuilT libraries retain approximately double the percentage of raw data for methylation calling when compared to the PCR-BS libraries. The ReBuilT method, therefore, yields considerably more useable data, which reduces the sequencing power required for methylation analysis.

For optimum whole genome bisulfite analysis it is essential that read depth remains even across the genome. Regions with uneven coverage would otherwise exhibit inaccuracies in apparent

methylation levels. To address this issue, we down-sampled libraries to be of equal size, and examined the read depth distribution. The ReBuilT libraries consistently exhibit a higher normalized median read depth, and a dramatically reduced standard deviation, than the PCR-BS libraries (ReBuilT: 2.7 ± 1.7; PCR-BS: 1.9 ± 5.1. We further addressed this point by plotting the data as a density histogram, and overlaying the Poisson distribution expected in the complete absence of bias (Fig. 13b). While the ReBuilT libraries approximate the expected distribution, the biasing effect of PCR amplification is quite striking. The PCR-BS data is heavily skewed to low read counts, yet has a tail stretching towards very high values. This can be interpreted as many regions being inadequately represented, with a small subset of regions hugely over represented at their expense. This has a negative impact on methylation analysis, as in regions with low coverage there is a reduced ability to confidently detect methylation levels. This effect can be further seen in the

duplication rate (Figure 16) . When sequencing a small genome, a certain proportion of apparent duplicates are inevitable, and for an equal sample of our libraries, the expected duplication rate is approximately 12%. The ReBuilT libraries were found to have an average duplication rate of 16%. Clearly there are no PCR duplicates, as no amplification has been performed; however, this increase in duplicates is a reflection of the imperfect overlap with the Poisson distribution seen in Fig. 13b. The PCR-BS sample has almost double the duplication rate of 30%, which is a cumulative effect of amplification duplicates and extremely uneven coverage. Uneven coverage leads to peaks and troughs in read depth, which will locally raise or lower the expected duplication rate.

We next examined how base composition affects coverage by plotting the normalized read count against local GC content of the reference genome (Figure 17) . Our ReBuilT libraries show a remarkable

insensitivity towards GC context (r = -0.08), even given the extreme composition of the P.berghei genome. The PCR-BS libraries, however, show a clear bias towards a more balanced GC content (r = 0.69), consistent with reports of PCR bias against highly skewed base compositions (Aird, D. et al . Analyzing and minimizing PCR

amplification bias in Illumina sequencing libraries. Genome Biol. 12, R18 (2011) ) . The effect can be severe in certain regions, as demonstrated in Fig. 14a. This GC bias is also the likely

explanation for the difference in alignment rates to the mouse genome (Fig. 13a). Despite extensive purification procedures the P. berghei DNA sample will exhibit minor mouse cell contamination, resulting in the presence of murine sequences in the data.

Accordingly, we performed the sequence alignment against a chimeric reference genome, and discarded non-parasite reads. Although the input DNA was from the same sample of purified DNA, the two methods gave disparate levels of mouse contamination. The ReBuilT libraries averaged 9.1% of reads aligning to the mouse genome, while the PCR- BS samples averaged 29.5% contamination. We suggest this is due to the mouse genome having a more balanced base composition (42% GC) relative to the Plasmodium genome, resulting in preferential amplification during PCR.

We extended this analysis to consider only the informative read depth, as only data originating from the C-strand can be used for methylation calling (Fig. 14b) . The ReBuilT method has almost no correlation between the GC content and informative read count.

Meanwhile the PCR-BS samples exhibit a strong preference for the relatively GC rich windows . The preference for a balanced base composition has the potential to introduce two types of artifacts when analyzing the methylome. Firstly, the quantification of methylation levels will be affected. As 5mC bases are not converted to thymine during bisulfite treatment, DNA fragments containing methylated loci will tend to have a higher GC content. As GC content can clearly affect amplification efficiency, it is no longer correct to determine the methylation level at a site with the (C/C+T) formula. Secondly, certain biological features display

characteristic base compositions. For example, though the P.berghei genome has a GC content of 22.1%, intergenic regions average 19.7% GC and exonic regions average 23.8% GC . Thus the base composition bias may cause overrepresentation of coding regions. Indeed the average coverage profile in and around exons shows how the PCR-BS read depth tracks the GC content, while the ReBuilT read depth remains constant irrespective of the genomic feature (Fig. 14c) . Taken together, these analyses suggest that traditional PCR- dependent bisulfite experiments have poor quantitative power, and may fail to capture methylation sites in certain genomic features .

Methylation in the Plasmodium genome

Using the PCR-free data we found 76,205 methylated loci (FDR corrected P-value P<0.01), representing 1.87% of the total genomic cytosine sites. The global level of unconverted cytosines was 0.70%, with single sites reaching a maximum of 21% methylation. The global value conforms to the low global 5mC we detected by LC-MS/MS (0.33% 5mC/total C) . We were able to confidently quantify such low levels of methylation due to the high depth of sequencing we obtained: a combined 600x depth across replicates (excluding the mitochondria and apicoplast where coverage reached 25000x) . The number of sites, and the global methylation level detected, was substantially higher in the PCR-BS dataset (Fig. 15c) . Furthermore the PCR-BS dataset showed a clear correlation between the percent methylation and read count. Increasing read counts should increase the quantitative power of bisulfite - for example, lower methylation levels can be detected. Conversely, the quantitative power of bisulfite sequencing is low where the read counts are low. (Top) As ReBuilT exhibits even coverage, all regions have sufficient read depth for accurate methylation calling. (Bottom) Due to the uneven coverage of PCR-BS, many regions have low read counts, and in these regions the observed methylation % is suspiciously high.

We found the context of the methylated loci to be 92% CHH (where H represents any nucleotide except G) , with the remaining sites being 3.6% CG, 2.2% CHG, and 2.1% CC (Fig. 15a) . Within the CHH set there was a strong preference for CAH, with 68.7% of all methylation loci being found in this context. The genomic context of all cytosines shows a preference for adenine in the +1 position, but this

preference is significantly increased for methylated loci. In contrast, cytosine and guanine bases are generally depleted around methylated loci, in agreement with previously reported data from P. falciparum (Ponts, N. et al. Genome-wide mapping of DNA

methylation in the human malaria parasite Plasmodium falciparum. Cell Host Microbe 14, 696-706 (2013)).

The genomic location of methylated loci is shown in Figure 15b. We found the majority of methylated loci (42%) were located in exonic regions. However, as exons make up 55% of the Plasmodium genome, and are relatively cytosine rich elements, methylation is in fact underrepresented . This is confirmed when visualizing the methylation profile in and around exons across the genome (Fig. 15c) .

Interestingly, there is a change in methylation levels across intron/exon boundaries, which by analogy to mammalian systems could be involved in transcript splicing (Gelfman, S., Cohen, N., Yearim, A. & Ast, G. DNA-methylation effect on cotranscriptional splicing is dependent on GC architecture of the exon-intron structure. Genome Res. 23, 789-99 (2013) ) .

Apicomplexan parasites, of which Plasmodium is one, have a non- photosynthetic relict plastid called the apicoplast that codes for proteins that participate in lipid biosynthesis and iron metabolism. This organelle contains multiple copies of a 35 kb genome, and it has been suggested is unmethylated (Ponts, N. et al . Genome-wide mapping of DNA methylation in the human malaria parasite Plasmodium falciparum. Cell Host Microbe 14, 696-706 (2013) ) . We determined the average number of genome copies to be 5.5, and detected significant methylation along its sequence.

Discussion

The data generated from the ReBuilT method provides compelling evidence for the key benefits of PCR-free methylation analysis. We show that this approach results in increased uniformity of coverage, a lower duplication rate and substantially reduced sequence context biases as compared to a standard BS approach that employs PCR amplification. Consequently, the methylation calls more accurately represent the true methylome of the organism. We were able to generate amplification free methylomes from 50 ng input quantities of P. berghei genomic DNA. The data generated was of high quality, with a greater fraction of raw reads surviving bioinformatic processing than for a comparable traditional BS-seq data set.

By employing our method in conjunction with high depth next- generation sequencing, we have confidently quantified low levels o methylation in the P. berghei genome. We found global methylation levels were low, yet occurred predominantly in the asymmetric CAH context. The methylation profile is similar to those seen when non CG methylation is studied in other eukaryotes, in that levels are low and the context is asymmetric. Additionally, there was a clear decrease in methylation levels across intron-exon boundaries .

In conclusion, our approach enables the study of methylomes

previously intractable to BS-seq, as exemplified by the malarial parasite P. berghei.

Experiment 1

Initial studies were made on a synthetic single stranded oligomer that was biotinylated on the 3' end and contained primer extension sites (see sequence below, primer sites are highlighted in bold). 5' -

CTCACCCACAACCACAAACAGGCCGCTCAATTGGTCGTAGACAGCTCTAGCACCGCT TAAACGCACGT ACGCGCTGTTTAACCGCCAAGGGGTTGGATGGTAGATGGTGA [BtnTag] -3 '

SEQ ID NO: 8

2ug of the oligomer was treated with bisulfite (Cambridge Epigenetix TrueMethyl oxidative bisulfite kit) and then split into two samples, each containing 200ng of total DNA in a volume of 15uL. One sample was then directly analysed by qPCR to assess the quantity of amplifiable DNA fragments, which require the entire intact strand with primer sites on both the 5' and 3' ends. Therefore the sample was diluted 200 fold and luL (=lng total DNA) was used per

measurement. The other sample was used to recover fragmented DNA that still contained the 3' biotinylated primer site by doing a primer extension step that resulted in 5' blunt end double stranded DNA. After end repair and A-tailing, the 5' adaptor was ligated onto the DNA fragments. Starting from the primer extension step, all the steps were done on streptavidin coated magnetic beads facilitating the purification steps and thus minimizing loss of DNA. After the last wash step, the beads were suspended in 15uL water similarly to the control sample. qPCR was performed on luL of a 200 fold diluted sample (Figure 7) and post qPCR DNA was run on a Tapestation (Figure 8 ) to evaluate the fragment distribution.

The Ct value (Figure 7) can be used as a measurement of the quantity of amplifiable fragments in the sample, i.e., fragments that contain intact primer sites. The control bisulfite sample has a

significantly higher (15.63 cycles) Ct value than the modified protocol sample (3.57 cycles), which shows that the control contains ~1000 fold less amplifiable fragments. N.B. only the amplifiable fragments are sequenceable (i.e. will generate clusters on the flowcell ) .

Rather than seeing one sharp peak due to only undamaged DNA

fragments (100 bp), which is what would be seen for a standard bisulfite sample, a range of fragments can be seen below 100 bp (Figure 6) . These are generated by the repair of fragments that would otherwise not be amplifiable.

Experiment 2

500ng of human genomic DNA was prepared and BS treated using the PCR free method described above. 35 bp paired ends were sequenced on an Illumina Miseq instrument.

25 million reads passed filter and 13 million reads were left after trimming. Of these, ~50% of the reads aligned uniquely to the reference. ~40% aligned but did not map uniquely to the reference. These were good reads but mapped to multiple places in the genome, for example because of repeat regions etc. ~10% of the filtered and trimmed reads did not map to the reference at all.

The approximately 12 million reads that pass filter, but not trimming, resulted from fragmentation in the 3' adaptor region.

These fragments were then recovered during our recovery prep. This shows that the most significant damage induced by PCR-free bisulfite treatment is damage to the ends of the DNA fragments, which are the 3' adaptor regions. It is possible to prevent these fragmented adaptors in the library occurring with the use of a differently designed primer in the single primer extension step. Furthermore, the prevalence of DNA damage to adaptors may be reduced during bisulfite treatment through the use of adaptors comprising modified nucleotides or nucleotide mimetics, as described herein.

Experiment 3

500 ng of E. coli genomic DNA was prepared and BS treated i) using the PCR free method described above and ii) using a standard BS preparation with 15 cycles PCR to produce indexed libraries . The two libraries of 50 bp paired ends were sequenced in one run using an Illumina Miseq instrument.

The small E coli genome was chosen to exemplify the drastic changes in sequence coverage that are induced by PCR amplification . Changes in coverage across the genome are important because biases affect the quantitative power of BS-sequencing . BS quantitation is

determined by the 5mC basecalls as a percentage of the total reads covering that base. Therefore, uneven coverage leads to incorrect quantitation .

Figures 9 and 10 illustrate the advantages of PCR free librar preparation as described herein in improving genomic coverage compared to standard BS treatment methods .

The boxed sequence in Figure 9 has a highly skewed base composition This bisulfite-converted region is classified as hugely AT-rich, du to having 91% AT composition. The average coverage in this AT-rich region was over lOx higher in the PCR free sample. AT-rich regions are known to be poorly amplified by PCR; the greater the AT skew, the poorer the observed amplification and hence the fewer reads in the PCR amplified library.

The boxed post-bisulfite sequence in figure 10 has a highly skewed base composition of 88% AT. A 20 base pair region in the centre of the box has zero aligned reads in the traditional BS prep; however, there is little change from the average coverage in the new recovery protocol .

Coverage was summed in 100 base pair windows along the E coli genome, and normalized to correct for differences in mapped reads. Figure 9 shows the log 2 ratio of (old method/new method) . It is evident from Figure 11 that there are many more gaps in the negative region, corresponding to gaps in the traditional bisulfite sample. Measurement of methylation levels in these regions will be

inaccurate and not truly quantitative.

Experiment 4

The degradation of DNA adapted with Y-shaped and hairpin adaptors by bisulfite treatment was compared. The Y shaped adapted DNA was almost completely degraded by bisulfite treatment, as evidenced by the very faint bands observed on a high sensitivity TapeStation gel (Figure 12B) . The hairpin adapted DNA showed significantly stronger bands post bisulfite, indicating that hairpin adapted DNA suffered less damaged from bisulfite treatment. Sequences

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

SEQ ID NO: 1 Universal Adaptor for Illumina sequencing (Truseq™)

GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXXXATCTCGTATGCCGTCTTC TGCTTGTCT

SEQ ID NO: 2 Adaptor for Illumina sequencing (Truseq ) with index sequence underlined.

Experiment 5

Comparison of GC and AT rich genomes vs prior art PCR based methods . P. berghei culture conditions and DNA extraction

6-10 week old Thieler' s original mice were injected

intraperitoneally with 0.2 mL of 6 mg/mL phenylhydrazine, and three days later infected with 5xl0 7 parasites {Plasmodium berghei ANKA strain, clone 233) . From day 6 onwards tail smears were taken to assay parasitaemia . Mice were bled by cardiac puncture and parasitic DNA extracted following standard protocols (Doolan, D. L. Methods in Molecular Medicine Volume 72: Malaria Methods and Protocols ed.

Humana Press, Inc., Totowa, New Jersey, USA, pp. 25-40) . Sonication of genomic DNA

500 ng DNA (10 mM Tris-HCl pH 8, 1 mM EDTA) was sheared by

sonication with the Covaris M220 Focused-ultrasonicator to give an average fragment length of 250 base pairs (Peak incident power 50 W, Duty Factor 20%, 200 Cycles per Burst, 120 s treatment time) . The amount of DNA was quantified using the Qubit dsDNA BR assay and the fragmentation confirmed with the Agilent 2200 Tapestation using D1000 screentapes and reagents.

DNA digestion and LC-MS/MS analysis

250 ng of genomic DNA was digested using DNA degradase (Zymo research) according to manufacturer's instructions, with stable isotope labelled nucleotides (dC + 3, m 5 C + 3, hm 5 dC + 3 and N 6 m dA + 3) spiked in at 25 nM final concentration. A dilution series (0.0125 - 15000 nM) of the unlabelled reference standards (dC, m 5 C, m 5 hmC and N 6 mA; Sigma Aldrich, Carbosynth Ltd) were mixed with the stable isotope labelled nucleosides.

Quantitative LC-MS/MS analysis was carried out using an Agilent 1290 Infinity UHPLC coupled to a Thermo Q-exactive mass spectrometer. LC was performed on a Waters Acquity UPLC HSS T3 column (100 x 2.1 mm, 1.8 μιη particle size) kept at 50°C, applying a gradient starting at 100% of 0.1% formic acid in water followed by increasing proportions of 0.1% formic acid in acetonitrile up to 30%, at a flow rate of 350 L/min over 3 minutes. The MS was operated in positive ion mode.

Generating synthetic oligomers and adapters

Oligonucleotide sequences :

GCTCTTCCGATC (ddT)

SEQ ID NO: 3 ODNla

GAT5GGAAGAG5A5A5GT5TGAA5T55AGT5ACTGA55AAT5T5GTATG55GT5TT5 TG5TTG- (biotin)

SEQ ID NO: 4 ODNlb

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC T SEQ ID NO: 5 ODN2a

GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT SEQ ID NO: 6 ODN2b

CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTT CCGATCT SEQ ID NO: 7 ODN3

5 = 5-methylcytosine Adapter pair 1 = ODNla + ODNlb

Adapter pair 2 = ODN2a + ODN2b

ODNla was obtained by employing terminal deoxynucleotidyl

transferase (NEB) to end label a 12-mer (5'-GCT CTT CCG ATC-3' ) with dideoxythymidine triphosphate (following the manufacturer's

protocol) . The resulting 3' blocked oligomer was purified by ethanol precipitation, and resuspended in 10 mM Tris, 50 mM NaCl. The underlined six-nucleotide portion of oligomer ODNlb was varied to give different adapter barcodes. All cytosines in ODNlb were replaced with 5mC to retain the adapter sequence following bisulfite conversion. Adapter pairs were annealed in a thermocycler (95 °C for 10 minutes, cooling to 70 °C over 10 minutes, holding at 70 °C for 10 minutes and then slowly cooling to RT at 0.1 °C s _1 ) to give 25 μΜ solutions in 10 mM Tris-HCl pH 7. , 50 mM NaCl. Annealing ODNla and ODNlb generated adapter pair 1; annealing ODN2a + ODN2b generated adapter pair 2.

ReBuilT library prep method

50 ng of P. berghei DNA was end repaired (NEBNext End Repair Module) and dA-tailed (NEBNext dA-tailing module) , before ligation of custom adapter pair A (NEBNext Quick Ligation Module) . Bisulfite conversion was achieved with the Zymo EZ DNA Methylation-Gold kit, following the manufacturers instructions. To recover damaged fragments 5 of 10 mM (5'-CAA GCA GAA GAC GGC ATA CGA GAT TGG TCA GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC T-3' ), 200 μΜ dNTPs, 10 ]iL VeraSeq Buffer II (Enzymatics) and 1U Veraseq Ultra ( Enzymatics ) was added. Following incubation at 95 °C for 3 minutes and annealing at 54 °C for 45 seconds, extension at 72 °C was carried out for 30 minutes. The reaction mixture was incubated with 60 g of streptavidin coated magnetic beads (Magnasphere Paramagnetic Particles, Promega) in 2x binding buffer (10 mM Tris-HCl pH 7.4, 1 mM EDTA, 2 M NaCl, 0.1% Tween 20) for 20 minutes at room temperature. Beads were washed three times with 400 binding buffer before being end repaired. Beads were again washed three times with 400 binding buffer before dA-tailing, and a further three times with 400 binding buffer before ligation of adapter pair B. Finally, three washes with 400 binding buffer were followed by elution of the A,T,G,C strand with 50 mM NaOH at 60 °C for 15 minutes.

Sequencing

Libraries were quantified for sequencing with the KAPA Universal Library Quantification Kits on a BioRad CFX384 Touch Real-Time PCR Detection System. The PCR-free samples were eluted with NaOH (50 mM) , so did not require denaturation, but direct dilution to appropriate concentrations (3 pM for NextSeq 500) with HT1 buffer. PCR-amplified samples were diluted with ultrapure water, denatured with NaOH, neutralized with 200 mM Tris-HCl pH 7 and diluted to working concentration with HT1 buffer. Paired-end 75 or 100 base reads were obtained on an Illumina NextSeq 500.

Sequence alignment

Raw reads were trimmed to remove adapter contamination and low quality bases using trim galore version 0.3.7

(bioinformatics.babraham.ac.uk/projects/trim galore) and cutadapt version 1.4.2 with option --stringency 3 and other arguments as default (Martin, M. Cutadapt removes adapter sequences from high- throughput sequencing reads. EMBnet . j ournal 17, 10 (2011)) . Trimmed reads were aligned using bwameth.py (Pedersen, B. S., Eyring, K. , De, S., Yang, I. V. & Schwartz, D. A. Fast and accurate alignment of long bisulfite-seq reads. (2014)) . The reference sequence for alignment was mouse genome version mm9 concatenated to Plasmodium berghei genome version 11. After alignment, the mapping quality of reads mapped with more then 10% of mismatches was reset to 0 using resetHighMismatchReads . py ( code . google . com/p/bioinformatics- misc/source/browse ) . Overlapping read pairs were clipped using clipOverlap in the BamUtil suite version 1.0.12. Genomic data manipulations were facilitated by samtools, BEDTools, Picard

(broadinstitute.github.io/picard/) and deepTools . (Li, H. et al . The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078- 2079 (2009); Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841- 842 (2010); Ramirez, F . , Diindar, F. , Diehl, S., Griining, B. A. & Manke, T. deepTools: a flexible platform for exploring deep- sequencing data. Nucleic Acids Research 42, W187-91 (2014))

Methylation calling

The counts of converted and unconverted cytosine, i.e. the

methylation status, in the P. berghei genome were obtained from the alignment files using bam2methylation . py (code.google.com

/p/bioinformatics-misc/source/browse ) . Only reads with mapping quality 15 or above were considered and read bases with quality less than 13 were excluded from methylation calling. In addition, at each cytosine position the number of mismatches, i.e. the number of reads not A or C, was recorded.

Methylation levels at individual cytosines were assessed

independently for each library. At each position a Fisher test was applied to the test the hypothesis that the count of unconverted cytosines exceeded the number of mismatches found at that position. The p-values from the three PCR-BS and three ReBuilT libraries were combined via Stouffer' s method where the p-values from the

individual libraries were weighted by the respective read depth. The combined p-values thus obtained were corrected for multiple testing by applying the false discovery rate procedure (Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal

Statistical Society. Series B (Methodological) 57, 289-300 (1995)). Data analysis was performed in R 3.1.2 (R Core Team. R: A Language and Environment for Statistical Computing. (2014)).

Segmenting methylation

Runs of methylated cytosines were detected by segmenting the signal of combined p-values . To this end, the vector of combined p-values was first converted to a vector of discrete observations as follows: Λ 0' if p > 0.1, Λ 1' if 0.1 < p < 0.05, Λ 2' if 0.05 < p < 0.001 and Λ 3' if p < 0.001. Then a two state hidden Markov model (HMM) was fitted to the recoded p-values to partition the signal into segments of high and low evidence of methylation. The R package RHmm was used for model fitti (Taramasco, O. & Bauer, S. RHmm: Hidden Markov Models simulati and estimations. (2013 ) ) ·

Various further aspects and embodiments of the present invention will be apparent to those skilled in the art in view of the present disclosure .

Other aspects and embodiments of the invention provide the aspects and embodiments described above with the term "comprising" replaced by the term "consisting of" and the aspects and embodiments

described above with the term "comprising" replaced by the term "consisting essentially of".

It is to be understood that the application discloses all

combinations of any of the above aspects and embodiments described above with each other, unless the context demands otherwise.

Similarly, the application discloses all combinations of the preferred and/or optional features either singly or together with any of the other aspects, unless the context demands otherwise.

Modifications of the above embodiments, further embodiments and modifications thereof will be apparent to the skilled person on reading this disclosure, and as such these are within the scope of the present invention.

All documents and sequence database entries mentioned in this specification are incorporated herein by reference in their entirety for all purposes.

"and/or" where used herein is to be taken as specific disclosure of each of the two specified features or components with or without the other. For example "A and/or B" is to be taken as specific

disclosure of each of (i) A, (ii) B and (iii) A and B, just as if each is set out individually herein.