Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS FOR PREPARING NUCLEIC ACID MOLECULES FOR SEQUENCING
Document Type and Number:
WIPO Patent Application WO/2019/117714
Kind Code:
A1
Abstract:
The invention relates to means and methods for preparing double stranded target DNA molecules for sequencing. In embodiments double stranded backbone DNA molecules comprising 5' and 3' ends are provided that are: ligation compatible with 5' and 3' ends of said target DNA; form a first restriction enzyme recognition site when self-ligated; in a form that enables self-ligation. Methods may comprise providing, if not already present, said target DNA with 5' and 3' ends that are in a form that prevents self-ligation and that are ligation compatible with said backbone DNA 5' and 3' ends. Methods may further comprise ligating said target DNA to said backbone DNA in the presence of a ligase and a first restriction enzyme that cuts said first restriction enzyme recognition site, thereby producing at least one DNA circle comprising a backbone DNA molecule and a target DNA molecule. Linear DNA may be removed at this time and subsequently a concatemer DNA molecule comprising an ordered array of copies of said at least one DNA circle through rolling circle amplification is produced that can be sequenced.

Inventors:
KLOOSTERMAN WIGARD (NL)
DE RIDDER JEROEN (NL)
MARCOSSI ALESSIO (NL)
Application Number:
PCT/NL2018/050831
Publication Date:
June 20, 2019
Filing Date:
December 11, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UMC UTRECHT HOLDING BV (NL)
International Classes:
C12Q1/6806
Domestic Patent References:
WO2000015779A22000-03-23
Foreign References:
US20160376647A12016-12-29
US20100069263A12010-03-18
Other References:
GOODWIN ET AL., NATURE REVIEWS I GENETICS, vol. 17, 2016, pages 333 - 351
MOHSEN; KOOL, ACC CHEM RES., vol. 49, no. 11, 2016, pages 2540 - 2550
KELMAN ET AL., STRUCTURE, vol. 6, 1998, pages 121 - 125
BLANCO ET AL., J. BIOL. CHEM., vol. 264, no. 15, 1999, pages 8935 - 40
LOOSE ET AL.: "Real-time selective sequencing using nanopore technology", NATURE METHODS, 2016
BACK, THOMAS: "Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms", 1996, OXFORD UNIVERSITY PRESS
COELLO COELLO; CARLOS A.; GARY B. LAMONT: "Applications of Multi-Objective Evolutionary Algorithms", 2004, WORLD SCIENTIFIC
HWANG; GI-HYUN; WON-TAE JANG: "An Adaptive Evolutionary Algorithm Combining Evolution Strategy and Genetic Algorithm (Application of Fuzzy Power System Stabilizer", ADVANCES IN EVOLUTIONARY ALGORITHMS, 2008
LOBO, F. J.; CLAUDIO F. LIMA; ZBIGNIEW MICHALEWICZ: "Parameter Setting in Evolutionary Algorithms", 2007, SPRINGER SCIENCE & BUSINESS MEDIA
MENCONI; GIULIA; ANDREA BEDINI; ROBERTO BARALE; ISABELLA SBRANA: "Global Mapping of DNA Conformational Flexibility on Saccharomyces Cerevisiae", PLOS COMPUTATIONAL BIOLOGY, vol. 11, no. 4, 2015, pages e1004136
RAN, F. ANN; PATRICK D. HSU; JASON WRIGHT; VINEETA AGARWALA; DAVID A. SCOTT; FENG ZHANG: "Genome Engineering Using the CRISPR-Cas9 System", NATURE PROTOCOLS, vol. 8, no. 11, 2013, pages 2281 - 2308, XP002772991, DOI: doi:10.1038/nprot.2013.143
ARBEITHUBER; BARBARA; KATERYNA D. MAKOVA; IRENE TIEMANN-BOEGE: "Artifactual Mutations Resulting from DNA Lesions Limit Detection Levels in Ultrasensitive Sequencing Applications", DNA RESEARCH: AN INTERNATIONAL JOURNAL FOR RAPID PUBLICATION OF REPORTS ON GENES AND GENOMES, vol. 23, no. 6, 2016, pages 547 - 59
BELGRANO; FABRICIO S.; ISABEL C. DE ABREU DA SILVA; FRANCISCO M. BASTOS DE OLIVEIRA; MARCELO R. FANTAPPIE; RONALDO MOHANA-BORGES: "Role of the Acidic Tail of High Mobility Group Protein B1 (HMGB1) in Protein Stability and DNA Bending", PLOS ONE, vol. 8, no. 11, 2013, pages e79572
BRESLAUER, K. J.; R. FRANK; H. BLOCKER; L. A. MARKY: "Predicting DNA Duplex Stability from the Base Sequence", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 83, no. 11, 1986, pages 3746 - 50, XP002034050, DOI: doi:10.1073/pnas.83.11.3746
DEPRISTO; MARK A.; ERIC BANKS; RYAN POPLIN; KIRAN V. GARIMELLA; JARED R. MAGUIRE; CHRISTOPHER HARTL; ANTHONY A. PHILIPPAKIS ET AL.: "A Framework for Variation Discovery and Genotyping Using next-Generation DNA Sequencing Data", NATURE GENETICS, vol. 43, no. 5, 2011, pages 491 - 98, XP055046798, DOI: doi:10.1038/ng.806
DIAZ-CANO, SALVADOR J: "Are PCR Artifacts in Microdissected Samples Preventable?", HUMAN PATHOLOGY, vol. 32, no. 12, 2001, pages 1415
DOWTHWAITE; GARY; JO PICKFORD: "PCR-Based DNA Enrichment Enhances Detection of Mutations in Oncology", MLO: MEDICAL LABORATORY OBSERVER, vol. 47, no. 11, 2015, pages 18,20
EDGAR, ROBERT C: "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput", NUCLEIC ACIDS RESEARCH, vol. 32, no. 5, 2004, pages 1792 - 97, XP008137003, DOI: doi:10.1093/nar/gkh340
"MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity", BMC BIOINFORMATICS, vol. 5, August 2004 (2004-08-01), pages 113
HOOGSTRAAT; MARLOUS; MIRJAM S. DE PAGTER; GEERT A. CIRKEL; MARKUS J. VAN ROOSMALEN; TIMOTHY T. HARKINS; KAREN DURAN; JENNIFER KREE: "Genomic and Transcriptomic Plasticity in Treatment-Naive Ovarian Cancer", GENOME RESEARCH, vol. 24, no. 2, 2014, pages 200 - 211
KIELBASA; SZYMON M.; RAYMOND WAN; KENGO SATO; PAUL HORTON; MARTIN C. FRITH: "Adaptive Seeds Tame Genomic Sequence Comparison", GENOME RESEARCH, vol. 21, no. 3, 2011, pages 487 - 93
KIVIOJA; TEEMU; ANNA VAHARAUTIO; KASPER KARLSSON; MARTIN BONKE; MARTIN ENGE; STEN LINNARSSON; JUSSI TAIPALE: "Counting Absolute Numbers of Molecules Using Unique Molecular Identifiers", NATURE METHODS, vol. 9, no. 1, 2011, pages 72 - 74, XP055401382, DOI: doi:10.1038/nmeth.1778
KOU, RUQIN; HAM LAM; HAIRONG DUAN; LI YE; NARISRA JONGKAM; WEIZHI CHEN; SHIFANG ZHANG; SHIHONG LI: "Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations", PLOS ONE, vol. 11, no. 1, 2016, pages e0146638, XP055469818, DOI: doi:10.1371/journal.pone.0146638
LI, CHENHAO; KERN REI CHNG; ESTHER JIA HUI BOEY; AMANDA HUI QI NG; ANDREAS WILM; NIRANJAN NAGARAJAN: "INC-Seq: Accurate Single Molecule Reads Using Nanopore Sequencing", GIGASCIENCE, vol. 5, no. 1, 2016, pages 34, XP021268736, DOI: doi:10.1186/s13742-016-0140-7
NEWMAN, AARON M.; ALEXANDER F. LOVEJOY; DANIEL M. KLASS; DAVID M. KURTZ; JACOB J. CHABON; FLORIAN SCHERER; HENNING STEHR ET AL.: "Integrated Digital Error Suppression for Improved Detection of Circulating Tumor DNA", NATURE BIOTECHNOLOGY, vol. 34, no. 5, 2016, pages 547 - 55, XP055464348, DOI: doi:10.1038/nbt.3520
QUACH, NANCY; MYRON F. GOODMAN; DARRYL SHIBATA: "In Vitro Mutation Artifacts after Formalin Fixation and Error Prone Translesion Synthesis during PCR", BMC CLINICAL PATHOLOGY, vol. 4, no. 1, 2004, XP021006451, DOI: doi:10.1186/1472-6890-4-1
SARAI, A.; J. MAZUR; R. NUSSINOV; R. L. JERNIGAN: "Sequence Dependence of DNA Conformational Flexibility", BIOCHEMISTRY, vol. 28, no. 19, 1989, pages 7842 - 49
SHORE, D.; J. LANGOWSKI; R. L. BALDWIN: "DNA Flexibility Studied by Covalent Closure of Short Fragments into Circles", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 78, no. 8, 1981, pages 4833 - 37, XP055278377
SHULDINER, ALAN R.; AJAY NIRULA; JESSE ROTH: "Hybrid DNA Artifact from PCR of Closely Related Target Sequences", NUCLEIC ACIDS RESEARCH, vol. 17, no. 11, 1989, pages 4409 - 4409
WONG, KOON HO; YI JIN; ZARMIK MOQTADERI: "Current Protocols in Molecular Biology", 2013, article "Multiplex Illumina Sequencing Using DNA Barcoding"
Attorney, Agent or Firm:
JANSEN, C.M. (NL)
Download PDF:
Claims:
Claims

1. A method for preparing double stranded target DNA molecules for sequencing comprising

- providing double stranded backbone DNA molecules comprising 5’ and 3’ ends that are:

- ligation compatible with 5’ and 3’ ends of said target DNA;

- form a first restriction enzyme recognition site when self-ligated;

- in a form that enables self-ligation; and

- providing, if not already present, said target DNA with 5 and 3’ ends that are in a form that prevents self-ligation and that are ligation compatible with said backbone DNA 5’ and 3’ ends;

said method further comprising

- ligating said target DNA to said backbone DNA in the presence of a ligase and a first restriction enzyme that cuts said first restriction enzyme recognition site, thereby producing at least one DNA circle comprising a backbone DNA molecule and a target DNA molecule;

- optionally removing linear DNA;

- producing a concatemer DNA molecule comprising an ordered array of copies of said at least one DNA circle through rolling circle amplification; and

- sequencing said at least one concatemer.

2. The method of claim 1, wherein said concatemers are sequenced by long read sequencing.

3. The method of claim 1 or claim 2, wherein two or more backbones are provided.

4. The method of claim 3, wherein at least two backbones comprise a unique identifier sequence (barcode).

5. The method of claim 1 or claim 2, wherein said backbones comprise a linker.

6. The method of claim 5, wherein said linker comprises a sequence of 20 - 900 nucleotides.

7. The method of claim 5 or claim 6, wherein the sequence of said linker does not have a repeated DNA motif of more than 5 nucleotides; or does not have a self complementary motif of more than 6 nucleotides separated by less than 10 nucleotides or a combination thereof.

8. The method of any one of claims 1-7, wherein said specific ligation compatible 5’ and 3’ ends are blunt ends.

9. The method of claim 8, wherein the ligation of an end of a target DNA to an end of a backbone creates a target-backbone junction with a sequence that cannot be recognized/cut by the restriction enzyme that cuts the restriction enzyme site that is formed by self-ligation of a backbone.

10. The method of any one of claims 1-9, wherein said form that prevents self- ligation is a 5'-hydroxyl of one DNA terminus and 3'-hydroxyl of another and said form that allows self-ligation is a 5’-phosphate group of one DNA terminus and 3'- hydroxyl of another.

11. The method of any one of claim 1-10, wherein the target DNA is 20-400 base pairs.

12. A method to assess the insert capture efficiency of a backbone comprising the steps of claim 1 and further comprising comparing insert capture efficiency between different backbones.

13. A collection of linear DNA molecules (backbones) of a length of 20 - 1000 nucleotides that comprise 5 ends that comprise a part of a first restriction enzyme recognition site at the extreme end and 3 ends that comprise the other part of a first restriction enzyme recognition site at the extreme end, and which 5’ and 3’ ends are ligation compatible with each other and form a restriction enzyme recognition (first restriction enzyme) site when self-ligated and wherein each of said backbones comprises:

a linker;

an identifier sequence that differs from the sequence of identifiers of other backbones in the collection (barcode); and

optionally a restriction site for a nicking enzyme.

14. The collection of backbones of claim 13, wherein the backbones further comprise a restriction enzyme site for a type II restriction enzyme that can create non-palindromic overhangs (Golden- Gate cloning site).

15. The collection of backbones of claim 13 or claim 14, wherein the linker comprises a sequence of 30 - 900 nucleotides; has a high overall complexity; does not have a repeated DNA motif of more than 5 nucleotides; or does not have a self complementary motif of more than 3 nucleotides separated by less than 10 nucleotides or a combination thereof.

16. The collection of backbones of any one of claims 13-15, further comprising nucleic acid molecules (captured nucleic acid molecule) in said first restriction site.

17. The collection of backbones of any one of claims 13-16, comprising a library of captured nucleic acid molecules.

18. A method for determining the sequence of a collection of nucleic acid molecules comprising

- providing double stranded target DNA molecules that have a recombinase recognition site specific for a target site specific recombinase at the 5’ and the 3’ ends;

- providing a backbone comprising said recognition sites separated by DNA comprising a linker;

- incubating said target DNA molecules with said backbones in the presence of said target site specific recombinase, preferably a Cre recombinase, a FLP recombinase or a bacteriophage lambda integrase, thereby producing DNA circles comprising a backbone and a target DNA molecule;

- optionally removing linear DNA;

- producing concatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

- sequencing said concatemers.

19. A kit comprising a collection of linear DNA molecules of any one of claims 13-17.

20. The kit of claim 19, further comprising a polymerase with high processivity and optionally one or more polymerization primers.

21. The kit of claim 19 or claim 20, further comprising a ligase and said first restriction enzyme; and/or said target site specific recombination enzyme.

22. The kit of any one of claims 19-21, further comprising a DNA exonuclease.

Description:
Title: Methods for preparing nucleic acid molecules for sequencing

The invention relates to means and methods for determining the sequence of nucleic acid molecules. In particular the invention relates to methods that leverage rolling circle amplification of the nucleic acid molecules of which the sequence is to be determined.

Sequencing methods have evolved over time. The old Sanger sequencing method has been replaced by the now common next generation sequencing (NGS) methods. These methods have recently been review in Goodwin et al (2016; Nature Reviews I Genetics Volume 17:pp 333-351: doi: 10.1038/nrg.2016.49). The most common NGS methods rely on the sequencing of short stretches of DNA.

Sequencing techniques for short stretches of DNA suffer from inherent error profiles. Errors are reduced by independently sequencing multiple copies of the same target sequence. However, for each individual sequence read it is impossible to determine whether a change represents an error or a true mutation. The cumulative evidence across several independent sequence reads allows for the filtering of mutations introduced during amplification and errors in sequencing. Longer target DNAs can also be sequenced with short read methods. This is typically done by sequencing overlapping fragments that can be aligned to create an assembled longer sequence. This so-called short read paired end technique has been very successful in the sequencing of large target nucleic acid and has been instrumental in the various genome projects. The genome projects have revealed that genomes are highly complex with many long repetitive elements, copy number alterations and structural variations. Many of these elements are so long that short-read paired-end technologies are insufficient to resolve them. Long-read sequencing delivers reads in excess of several kilobases and allows for the resolution of these large structural features in whole genomes. Two popular platforms for long read sequencing are the Pacific Biosciences systems (the RSII and the Sequel) and the Oxford Nanopore systems (MK1 MinlON and

PromethlON). Both are single -molecule sequencers. Both platforms allow reads in excess of 55 kb and longer. However, these systems have even higher error rates than next (second) generation sequencers. These errors can be reduced by increasing the number of times the same target nucleic acid is sequenced (Goodwin et al 2016; doi: 10.1038/nrg.2016.49).

The present invention provides novel solutions for the preparation of nucleic acid molecules for sequencing.

SUMMARY OL THE INVENTION

An embodiment the invention provides a method for preparing double stranded target DNA molecules for sequencing, comprising - providing double stranded backbone DNA molecules comprising 5’ and 3’ ends that are:

- ligation compatible with 5’ and 3’ ends of said target DNA;

- form a first restriction enzyme recognition site when self-ligated;

- in a form that enables self-ligation; and

- providing, if not already present, said target DNA with 5’ and 3’ ends that are in a form that prevents self-ligation and that are ligation compatible with said backbone DNA 5’ and 3’ ends;

said method further comprising

- ligating said target DNA to said backbone DNA in the presence of a ligase and a first restriction enzyme that cuts said first restriction enzyme recognition site, thereby producing at least one DNA circle comprising a backbone DNA molecule and a target DNA molecule;

- optionally removing linear DNA;

- producing a concatemer DNA molecule comprising an ordered array of copies of said at least one DNA circle through rolling circle amplification; and

- sequencing said at least one concatemer.

Also provided is a collection of DNA molecules (backbones) of a length of 50

- 1000 nucleotides that comprise 5’ ends that comprise a part of a first restriction enzyme recognition site at the extreme end and 3’ ends that comprise the other part of a first restriction enzyme recognition site at the extreme end, and which 5’ and 3’ ends are ligation compatible with each other and may form a restriction enzyme recognition (first restriction enzyme) site when self-ligated and wherein each of said backbones comprises:

a linker;

optionally an identifier sequence that differs from the sequence of identifiers of other backbones in the collection (barcode);

optionally a second identifier that is unique for a collection of backbone molecules;

and optionally a restriction site for a nicking enzyme.

Further provided is a method for determining the sequence of a collection of nucleic acid molecules comprising

- providing double stranded target DNA molecules that have 5’ and 3’ ends with a protruding adenine residue at the 3'-end of both strands of the DNA molecules;

- providing a collection of double stranded backbone DNA molecules that comprise 5’ and 3’ ends that are ligation compatible with the 5’ and 3’ ends of the target DNA;

said method further comprising

- ligating said target DNA to said backbones in the presence of a ligase, thereby producing DNA circles comprising a backbone and a target DNA molecule;

- optionally removing linear DNA; - producing concatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

- sequencing said concatemers.

Further provided is a method for determining the sequence of a collection of nucleic acid molecules comprising

providing double stranded target DNA molecules that have a recombinase recognition site specific for a target site specific recombinase at the 5’ and the 3’ ends;

providing a backbone comprising said recognition sites separated by DNA comprising a linker;

incubating said target DNA molecules with said backbones in the presence of said target site specific recombinase, preferably a Cre recombinase, a FLP recombinase or a bacteriophage lambda integrase, thereby producing DNA circles comprising a backbone and a target DNA molecule;

optionally removing linear DNA; and;

producing concatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

sequencing said concatemers. In a preferred embodiment the backbone is a circle comprising two recombinase recognition sites separated on one side by DNA comprising a linker and separated on the other side by DNA coding for a restriction enzyme recognition site, and wherein said restriction site is the only recognition site for said restriction enzyme in said backbone. In this embodiment the method preferably further comprises digesting said DNA after said recombination with said restriction enzyme and subsequently removing linear DNA, prior to producing said concatemers.

Further provided is a method for determining the sequence of a collection of nucleic acid molecules comprising

- providing double stranded target DNA molecules that have a recombinase recognition site specific for a target site specific recombinase at the 5’ and the 3’ ends;

- providing a collection of double stranded circular backbone DNA molecules that comprise said recombinase recognition site and a linker;

said method further comprising

- incubating said target DNA with said backbones in the presence of a target site specific recombinase for said recognition sites, thereby producing DNA circles comprising a backbone and a target DNA molecule;

- optionally removing linear DNA;

- producing concatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

- sequencing said concatemers.

Further provided is a kit comprising one or more backbones. DETAILED DESCRIPTION OF THE INVENTION

Means and methods as described herein can determine the sequence of the same target DNA molecule multiple times. This can be used as a means to correct errors. This is different from classical second generation sequencing methods, which correct errors by sequencing multiple independent molecules covering the same genomic locus. In such cases each read typically represents one sequencing event of one molecule. With a method of the invention a single (target) molecule is copied over and over so one read represents multiple sequencing events of the same molecule.

Target nucleic acid is typically double stranded DNA. Single stranded DNA or RNA of which the sequence needs to be determined can easily be converted into double stranded DNA by methods known in the art. Such methods include but are not limited to cDNA synthesis; reverse-transcriptase (RT) polymerase chain reaction (PCR); PCR; random prime extension and the like. The target DNA is linear or is made linear prior to performing the method.

A backbone is typically double stranded DNA. In methods that utilize a restriction enzyme to ligate target DNA into the backbone the backbones are typically linear or are made linear prior to or during the method. In methods that utilize a target site specific recomb inase to insert target DNA into the backbone the backbones can be linear or are made circular prior to or during the method.

Self-ligation is herein defined as ligation of the 5’ end to the 3’ end of one and the same nucleic acid molecule.

The 5’ and 3’ ends of target DNAs are chosen such that they are ligation compatible with the 5’ and 3’ ends of backbones used in the reactions. With ligation compatible is meant that ligation of the ends to each other yields a double stranded DNA with correctly paired nucleotides without nicks in the ligation junction. Nicks can of course be introduced later to allow initiation of the RCA reaction. Blunt ends are ligation compatible with other blunt ends. DNA with sticky (also referred to as ‘cohesive’) ends are ligation compatible with other sticky ends if the protruding strands of DNA may be annealed together without leaving unpaired bases. Such is typically the case when the ends have a complementary sequence.‘Ligation compatible ends” are in the art also referred to as“compatible ends” or“compatible cohesive ends” or“compatible sticky ends”.

Double stranded target DNA molecules comprise the sequence of the nucleic acid molecules of which the sequence is to be determined. The nucleic acid molecules of which the sequence is to be determined can already be double stranded DNA with 5’ and 3’ ends that are ligation compatible with the 5 and 3’ ends of the backbone(s) to be used. Sometimes the nucleic acid needs to be made double stranded DNA, for instance in the case of cDNA; or mRNA. The target DNA molecules can already have suitable 5’ and 3’ ends, for instance a variety of polymerases produce blunt-end fragments. Such blunt end fragments are ligation compatible with backbones that have blunt 5’ and 3’ ends. The target nucleic acid can also be provided with suitable 5’ and 3’ ends, for instance through digestion with an appropriate restriction enzyme or enzymes, or by addition of

deoxynucleotides through terminal transferase. Suitable 5’ ends and 3’ ends can also be introduced through the insertion of a restriction enzyme site, recombinase recognition sites and/or homology regions. For instance by ligating an adaptor containing the site(s) to the target DNA or by amplifying the target DNA with primers that contain the restriction enzyme site, recombinase recognition site and/or homology region.

Enzymes are available that leave ends that are ligation compatible with the ends of the backbone but that differ in the nucleotide(s) in the region immediately adjacent to the protruding ends. In this embodiment it is preferred that the recognition site(s) of the enzyme(s) is/are not the same as the restriction enzyme site of said first restriction enzyme. In this way ligation of the compatible ends does not yield a site that can be cut by said first restriction enzyme. If a restriction enzyme is used to provide the target nucleic acid with appropriate ends, it is preferred that the enzyme is a blunt end producing enzyme. In one embodiment the target DNA molecules are provided with 5’ and 3’ ends that are ligation compatible with the 5’ and 3’ ends of the backbones to be used, by digestion with one or more restriction enzymes.

In one embodiment the ligation of an end of a target DNA to an end of a backbone creates a target-backbone junction with a sequence that cannot be recognized/cut by the restriction enzyme that cuts the (first) restriction enzyme site that is formed by self-ligation of a backbone.

In a preferred embodiment said form that prevents self-ligation is a 5'- hydroxyl of one DNA terminus and 3'-hydroxyl of another and said form that allows self-ligation is a 5’-phosphate group of one DNA terminus and 3'-hydroxyl of another. Ligation requires the presence of a 5’-phosphate group. Removal by an appropriate phosphatase on both 5’ ends of a nucleic acid molecule prevents self ligation and ligation to other DNA molecules similarly treated. Ligation is prevented even if the ends have ligation compatible ends.

In one embodiment the backbone comprises a recognition site for a nicking enzyme.

Target DNA molecules have 5’ and 3’ ends that are in a form that prevents self-ligation. Preferably the target DNA is in a form that prevents ligation to other target DNA molecules. Both requirements can be met by providing the ends in dephosphorylated form or by addition of nucleotides (3’ overhang) at the 3’ end of said target DNA molecules

Self-ligation is inherently prevented when the 5’ end 3’ ends of the target DNA are ligation incompatible. Also in these cases, however, it is preferred that ligation to other target molecules is prevented. Thus also in these circumstances it is preferred that the ends are provided in dephosphorylated form. Incompatible ends are for instance but not limited to blunt ends and overhang ends or overhang ends wherein the protruding nucleotides (overhangs) of the ends are not

compatible.

Prevention of self- ligation and/or prevention of ligation to other target DNA molecules does not have to be absolute. The processes can/will occur at some level. This can be tolerated in a method of the invention. Good reads can be obtained even with low ligation efficiencies.

The 5’ and 3’ end of the backbone DNA can be ligation compatible with each other. In such embodiments it is preferred that the 5’ and 3’ ends of the target DNA are also ligation compatible with each other. It is preferred that self-ligation of the ends of a backbone is not prevented. It is preferred that the 5’ ends of the target DNA are dephosphorylated. It is preferred that the ligation is performed in the presence of a restriction enzyme that recognizes and cuts said first restriction enzyme site.

In embodiments where double stranded target DNA is captured the backbone is a double-stranded nucleic acid molecule. Such backbones comprise 5’ and 3’ ends that are ligation compatible with the 5’ and 3’ ends of the target DNA. The 5’ and 3’ ends of the backbone may also be ligation compatible with each other.

In embodiments a backbone includes one or more of the following parts:

A 5’ end coding for a first part of a first restriction site, preferably a first half of a first restriction site (see for instance 1 in the schematic example below),

One or more sites that allow nicking of the double-stranded backbone sequence (see for instance 2 below),

One or more type 1 or type2 restriction sites (see for instance 3 below),

A secondary cloning site (see for instance 4 below),

A flexible DNA stretch that enables efficient circularization (bending) of the backbone molecule, 5 below)

A unique molecular barcode (identifier) sequences to tag each individual backbone molecule (see for instance 6 below)

A 3’ end coding for the other part, preferably the other half of the mentioned first restriction site.

Phosphorylation at the 5’ ends of the backbone molecule and a hydroxyl group at the 3’ ends of the backbone. A secondary barcode sequence that can be used to identify individual samples.

Schematic example of a double stranded backbone sequence:

(1) (2) (3) (4) (5) (6) (1)

5' -GGGC .. CCTCAGC .. ATTTAAAT .. GTCTTCGAGAAGAC .. CATACTATCATG .. (N) .. GCCC-3' 3' -CCCG .. GGAGTCG.. TAAATTTA.. CAGAAGCTCTTCTG .. GTATGATAGTAC .. (N) ..CGGG-5'

The dots represent 0 nucleotides; 1 nucleotide; 2 nucleotides or more.

The sequences GGGC and GGGC stand for halves of a restriction enzyme site. The sequence constitutes an Srfl site but another restriction enzyme site will also work. In the case of Srfl (GCCC I GGGC) and advantage is that it is a blunt end site. Another advantage is that it recognizes an 8-bases-long site while most of the commercially available alternatives recognize 6-basesdong sites.

It is preferred that the first restriction enzyme site does not occur in elsewhere in the backbone sequence.

Ligation of ligation compatible ends can create a restriction site. This is the case if the ends and flanking sequences (if any) code for the restriction enzyme site when ligated to each other. As an example; the end of a double stranded DNA molecule that has a single stranded end with the sequence 5’-AATT.... is ligation compatible with a double stranded DNA molecule that has a single strand end with the sequence . TTAA-5 , where the dots indicate the double strand part and the indications 5’ or 3’ the free end of the respective molecules. Ligation of the two ends yields a molecule with the double stranded sequence:

...AATT...

...TTAA...

The overhang is identical to the overhang that is created by the EcoRI restriction enzyme. Ligation creates the restriction site EcoRI only in some of the cases, i.e. in the case where the nucleotides in bold have the indicated bases:

...GAATTC...

...CTTAAG...

EcoRI cannot cut when the nucleotides in bold have different bases. The following sequences are for instance not cut by EcoRI:

...CAATTC... ; or ...AAATTC... ; o r ...GAATTA...

...CTTAAG... ; ...TTTAAG... ; ...CTTAAT... The sequence of the ends of the target DNA thus determines whether the ligation junction formed by ligation of compatible ends can be digested by the enzyme that cuts said first restriction enzyme site.

In embodiments a backbone can be optimized for insert capture efficiency, wherein greater efficiency is reflected by greater efficiency in circularization and rolling circle amplification (RCA) product formation. Insert capture efficiency of a backbone can be estimated by the amount of multimers that can be formed.

In methods for sequencing target DNA as described herein it is preferred that the ligation of target DNA to backbone DNA does not yield a first restriction enzyme recognition site in the target/backbone DNA junction. In the present invention it is preferred that self-ligation of the backbone yields a first restriction enzyme site and ligation of the backbone to target DNA does not yield said site. A preferred first restriction enzyme site is an enzyme that allows for the most sequence variation in the ligation junction. As the sequence of the backbone has one part, and preferably a half, of the recognition sequence of the first restriction enzyme site, the variation comes from the sequence of the target end. In case the first restriction enzyme site is an EcoRI site the backbone sequence that codes for the first restriction enzyme site has a 5’ end with the sequence 5’-AATTC. The junction with target DNA can have 1 of four different sequences depending on the base of the nucleotide that flanks the overhang in the target DNA. Only when the target sequence has an end with the sequence 5’-AATTC.. is the ligation junction with the backbone digestible with EcoRI. Junctions with other sequences are not digestible with EcoRI. Variation in junctions is improved by selecting enzymes that create small or no overhangs and by selecting enzymes that require more specific bases in the recognition site. The first restriction enzyme site preferably comprises 6 and more preferably 8 and preferably more bases. The enzyme that cuts said first restriction enzyme site is therefore preferably at least a 6 cutter, more preferably at least a 7 cutter, more preferably an 8 cutter. The number indicates the number of bases in the recognition site of the enzyme. For example, EcoRI is a 6-cutter;

Alul recognizing AGCT is a 4-cutter. There are also 5-cutters (e.g. Avail), 7-cutters (e.g. BbvCI), 8-cutters (e.g. Notl), and even other restriction enzymes. Together with the preference of a small or no overhang, this ensures a high potential for sequence variation in the ligation junction and which lowers the chance that the junction of a target sequence with a backbone sequence is a first restriction enzyme site. First restriction enzymes with more nucleotides in the recognition site are preferred also because such enzymes can allow for bigger target nucleic acid inserts. The methods are suitable for a large variety of target nucleic acid sources. Methods of the invention can be performed with two or more backbones that have different first restriction enzyme sites. In this way more target molecules can be captured into DNA circles. In case a target DNA has two first restriction enzyme sites that are close together, the intervening sequence can efficiently be sequenced, for instance by capturing it with the backbone with the other first restriction enzyme site. The reference to first, in the context of the restriction site, refers to the position of the (halves of the) site on the backbone. Restriction enzyme recognition sites at other positions in the backbone will be referred to as second, third etc. restriction enzyme recognition sites.

Preferred first restriction enzyme recognition sites are sites for the restriction enzymes Srfl (GGGC I GCCC); Pmel (GTTT | AAAC) and Swel (ATTT I AAAT). A particularly preferred first restriction enzyme site is the site for the restriction enzyme Srfl.

A 5’ end of a backbone comprises a part of the first restriction enzyme recognition site at the extreme end. It can but does not need to contain additional nucleotides on the inside. The number of nucleotides of the end may vary. A 5’ end typically has between 2-15 nucleotides, preferably 2-10, preferably 2-8, more preferably 2, 3, 4, 5, 6, 7 or 8 nucleotides. In some embodiments the 5’ end is 3 or 4 nucleotides.

A 3’ end of a backbone comprises a part of the first restriction enzyme recognition site at the extreme end. It can hut does not need to contain additional nucleotides on the inside. The number of nucleotides of the end may vary. A 3’ end typically has between 2-15 nucleotides, preferably 2-10, preferably 2-8, more preferably 2, 3, 4, 5, 6, 7 or 8 nucleotides. In some embodiments the 3’ end is 3 or 4 nucleotides.

5 and 3’ ends of target DNA are preferably blunt ends. They can also be sticky ends that can be ligated together if self-ligation is not otherwise prevented. The 5’ end and 3’ ends of target DNA are preferably provided in dephosphorylated form to prevent self-ligation. The 5’ and 3’ ends of target DNA can also be sticky ends that cannot be ligated together, such as adenine overhangs added by terminal transferase enzymes.

Ligation is preferably performed in the presence of a ligase and a restriction enzyme (first restriction enzyme) that cuts said first restriction enzyme site.

Ligation of the ends of a backbone to the ends of a target DNA creates double stranded DNA circle. Self- ligation of backbones is often not prevented in methods of the invention. In the presence of a ligase, ligation of the two ends of the backbone to each other or to ends of other backbones can hamper the capture of target nucleic acid by the backbones. Ligation of backbones ends is counteracted by the present of the first restriction enzyme. As such ligations typically (re)create the first restriction enzyme site, the backbone is linearized and/or deconcatemerized. The ligation reaction is performed using buffer conditions that support both efficient ligation and efficient cutting by first restriction enzyme.

Methods of the invention are particularly suited to produce DNA circles with one backbone and one target nucleic acid. In embodiments of the invention, linear DNA, if any, is preferably removed prior to the rolling circle amplification. Performing a rolling circle amplification after removal of linear DNA typically produces more high molecular weight coneatemers of backbone and target DNA.

Methods include subjecting DNA circles that are produced in the ligation reaction to rolling circle amplification (RCA). Rolling circle amplification produces an ordered array of copies of at least two of said DNA circles. Rolling circle amplification produces DNA molecules of high molecular weight. Which is suited for sequencing, particularly for long read sequencing.

Rolling circle amplification has recently been reviewed by Mohsen and Kool (2016) Acc Chem Res. Vol 49(11): pp 2540-2550; Published online 2016 Oct 24. doi: 10.1021/acs. accounts.6b00417. The terms rolling circle amplification and rolling circle replication are sometimes used interchangeably in the art. In other instances rolling circle replication is used to refer to replication of naturally occurring plasmid and virus genomes. The terms refer to a similar underlying principle, i.e. the repeated copying of the same circular DNA producing a longer nucleic acid molecule with an ordered array of backbone-target nucleic acid copies. Present techniques for rolling circle amplification enable the production of large arrays containing many copies of the produced DNA circles. Coneatemers can have 2 or more copies, preferably 4 or more copies of the produced circles.

Rolling circle amplification is performed by a polymerase and requires the usual priming sequence to generate the start. Particular polymerases with high processivity are available to produce coneatemers of considerable length.

Polymerases with high processivity are polymerases that can polymerize a thousand nucleotides or more without dissociating from the DNA template. They can preferably polymerize a two, three, four thousand nucleotides or more without dissociating from the DNA template. Polymerases with high processivity are among others discussed in Kelman et al; 1998: Structure Vol 6; pp 121-125. Rolling circle amplification can yield very high molecular weight coneatemers using polymerases with high processivity and strand-displacement capacity such as phi29 polymerase. This polymerase can polymerize 10 kb or more. High

processivity polymerases are therefore preferably polymerases that polymerize 10 kb or more without dissociating form the DNA template (Blanco et al; 1999. J. Biol. Chem. 264 (15): 8935-40). The polymerization can be started on a nick in the double strand DNA or the DNA can be melted and annealed in the presence of one or more suitable primers. Examples of suitable primers are random hexamer primers, one or more backbone specific primers, one or more target nucleic acid specific primers or a combination thereof. Random primers are typically preferred when target nucleic acid sequences are not known or when a variety of target nucleic acid sequences are to be sequenced. One or more specific primers can he used to sequence specific target nucleic acids of which the basis sequence is known. A variant is one or more primers that are specific for the backbone. Such primers can be used in different situations, such as but not limited to high throughput systems with optimized backbones.

An advantage of having double-stranded circular DNA is that one of the strands can he used as a template for the rolling circle amplification. For example by using a strand specific primer to initiate the RCA reaction. Data analysis of Oxford nanopore sequencing results allowed to determine the base-calling and variant calling accuracy for each of the strands separately. In particular, we noticed that C and Abases are often difficult to distinguish due to the similar intensity of their raw current signal. However the current signal coming from a T is substantially different from all the other bases and easy to he correctly classified. For example, if an A is expected to be mutated in the forward strand, sequencing of the reverse strand would lead to much cleaner results since the A in the forward strand could be miss called as a G. Thus, specific enrichment of the reverse strand would be advantageous in such a scenario. Thus in a preferred embodiment the rolling circle initiation primer is a strand selective primer.

Further optimizations in obtaining strand-specific sequences may involve the (additional) use of real-time selective sequencing methods, such as those described in prior work (PMID: 27 54285) (Loose et al. 2016. Nature methods. Real-time selective sequencing using nanopore technology).

Backbones are preferably 20 - 1000 nucleotides long, preferably 20-800, preferably 50-800; more preferably 100 - 600 nucleotides, preferably 200- 600 nucleotides. Target nucleic acid is preferably 40-15000 nucleotides long depending on the application.

DNA that circulates free or that is associated to cellular particles in the blood or other bodily fluid samples is typically smaller than 400 nucleotides. Target nucleic acid molecules of such lengths are particularly suited in methods of the invention. Other samples with relatively small nucleic acid molecules are some types of forensic samples, fossil samples, samples of nucleic acid isolated from environments that are inherently hostile to nucleic acid molecule integrity such as stool samples, surface water samples, and other samples rich in microbial organisms. For small target DNAs (smaller than 100 nucleotides) it is preferred to use the larger backbones as disclosed herein. Target nucleic acid can also be double-stranded circulating tumor DNA (ctDNA) or cell free DNA (cfDNA) present in liquid biopsies including but not limited to blood, saliva, pleural fluid or ascites fluid. Target nucleic acid can also be double-stranded or single-stranded cDNA derived from messengerRNA microRNA, CRISPR RNA, non-coding RNA, viral RNA, or other sources of RNA. Target nucleic acid can also be double-stranded DNA derived from genomic DNA, PCR products, plasmid DNA, viral DNA, or other sources of double-stranded DNA. The means and methods of the present invention are particularly suited to capture small DNA. Preferably 400 base pairs or smaller. target DNA is captured in a backbone of the invention. This captured DNA is also called an insert or target DNA. Target DNA is preferably 400 base pairs or less, more preferably 300 base pairs or less, more preferably 200 base pairs or less, more preferably 150 or less. The lower limit of the target DNA is preferably 20 base pairs, more preferably 30 base pairs, more preferably 40 base pairs and more preferably 50 base pairs. Any lower limit can be combined with any upper limit.

The size of DNA fragments is given in nucleotides here. This refers of course to the number in one strand. The size could also be given in base pairs for double stranded DNA. So a DNA that is 400 nucleotides is 400 base pairs long.

Produced concatemers can be sequenced with a variety of different methods. Of these the long read sequencing methods are preferred. Various long read sequencing methods are available to the skilled person. They all share the feature that molecules of more than 200 nucleotides are produced in the sequencing reactions. Typically more than 500 nucleotides and even several thousands of nucleotides long. Two presently available platforms for long read, real-time, single molecule sequencing are the Pacific Biosciences systems (the RSII and the Sequel) and the Oxford Nanopore systems (MK1 MinlON, GridlON and PromethlON). These allow reads in excess of 55 kb and longer (Goodwin et al 2016; doi:

l0.1038/nrg.20l6.49). Long-read systems are preferably single -molecule real-time sequencing systems. Single molecule systems do not rely on a clonal population of amplified DNA fragments to generate detectable signals. These systems fix the sequence determining protein at a specific location and allow the strand of nucleic acid to progress through the protein. The present Pacific Biosciences systems use a polymerase whereas the Oxford Nanopore systems presently use a membrane channel protein. In a preferred embodiment the sequencing method is a single molecule real-time (SMRT) sequencing method. Produced concatemers have an ordered array of copies of at least one of said DNA circles, preferably at least two, three, four or preferably at least 5 of said DNA circles.

In some embodiments of the invention backbones have identifiers. Such identifiers are also referred to as barcodes. The identifiers or barcodes are stretches of nucleic acid of which the sequence can vary between backbones. Barcoding can be used to group sequencing results of particular DNA circles. A barcode can identify a DNA circle. The barcode can be used to group sequencing results of fragments of the ordered array of concatemers produced by RCA of a DNA circle. The barcode as such can be used to identify particular DNA circles. Methods using backbones with barcodes typically have one or more collections of backbones wherein backbones in a collection have unique barcodes in otherwise similar or identical backbones. Two or more collections of backbones can be used, for instance to accommodate the different first restriction enzyme sites mentioned herein above, or to identify sequencing results of different samples. Barcodes between collections can be identical because sequence differences in other parts of the backbones identify the collections. Backbone collections may comprise more than one copy of a particular barcode containing backbone. The combination of a barcode with a particular overall target sequence can also positively identify a nucleic acid as being derived from a particular DNA circle, for instance, when the target nucleic acid is complex and/or the number of identical barcodes is low in a collection of backbones. Sequencing results of a group of sequences of a DNA circle can be used to filter out errors, such as amplification or polymerase errors. This is exemplified schematically in figure 1A and IB. Backbones used in a method as disclosed herein preferably comprise at least two backbones with unique identifiers.

The DNA circles are produced in the ligation step. Longer molecules are typically more efficiently circularized. Flexible molecules are more easily circul arized than rigid molecules. Small target nucleic acid (20-200 nucleotides) can be captured more efficiently by larger backbones. For small target nucleic acids backbone preferably have 200 or more nucleotides, preferably 300 or more, more preferably 400 or more nucleotides, preferably between 450-650 nucleotides. The smaller backbones typically allow for more concatemers per DNA circle. The average length of the target nucleic acid and the length of the backbone(s) in a DNA circle is preferably 90- 16.000 nucleotides, preferably 200- 12.000 nucleotides; preferably 300-8.000 nucleotides, preferably 400-4.000 nucleotides, preferably 500- 2.000 nucleotides. The average length of target nucleic acid plus backbone nucleic acid is preferably about 1.000 nucleotides.

A backbone DNA molecule preferably comprises the sequence of:

>BB1 (199bp)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCA

CGTCGTCATAGCTGTCGAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTAT

TTAAATCTACGTAGAGTACGACTGCGCAGATGTGATCAGTGACTACGTGACAC

TGTACATCAGCACGATCGATGACTAGATGCTGCATGACATAGCCC;

>BB2 (259bp)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCA CGTCGTCATAGCTGTCGAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTAT TTAAAT CT AC GTC AC C GGGT CTT C GAGAAGAC CT GTTTAGAGT AC GACTGC AA ATGGCTCTAGAGGTACCCGTTACATAACTTACGCAGATGTGATCAGTGACTAC GTGACACTGTACATCAGCACGATCGATGACTAGATGCTGCATGACATAGCCC;

>BB2_100 (341)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCA

CGTCGTCATAGCTGTCGAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTAT

TTAAATCTACGTCACCATATATATGGATATATATATGGATATATATATATATGG

ATATATGGATATATATATATATATATGGATATGTATGGATATATATATATATGG

ATATGGATGTTTAGAGTACGACTGCAAATGGCTCTAGAGGTACCCGTTACATA ACTT AC G C AG AT GT GAT C AGT G ACT AC GT GAC ACT GT AC AT C AG C AC GAT C G A

TGACTAGATGCTGCATGACATAGCCC; or

>BBpX2 (557bp)

GCG( ATCCACACATGTACACGAA( CCCAGCAACGCGCCCTTTTTACGG TTCX TGGCXITTTTGCTGGCX TTTTGCTCACATGTGAGGGCCTATTTCCCATGAT TCCTTCATATTTGCATATACGATACAAGGCTGTTAGAGAGATAATTGGAATTAA TTT G ACT GT AAAC AC AAAG AT ATT AGT AC AAAAT AC GT GAC GT AG AAAGT AAT AATTTCTTGGGTAGTTTGCAGTTTTAAAATTATGTTTTAAAATGGACTATCATA TGCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGA AAGGACGAAACACCGGGTCTTCGAGAAGACCTGTTTTAGAGCTAGAAATAGCA AGTTAAAATAAGGCTAGTCXX4TTATCAACTTGAAAAAGTGGCACCGAGTCGGT GCTTTTTTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTTT TAGCGCGTGCGCCAATTCTGCAGACAAATGGCTCTAGAGGTACCCGTTACATA ACTTATAGATGCTGCATGACATAGCCC.

>BB100_1 (143 bp)

GGGCATGCACAGATGTACACGATTCCCAACACACCGTGCGGGCCATCGACCTA

TGCATACCGTACATATCATATATAAATCACATAATTTATTATACGTATGTCGCG

CGGGTGGCTGTGGGTAGATGCTGCATGACATAGCCC

>BB100_2 (143 bp)

GGGCATGCACAGATGTACACGCACTACATGCCAATGCCCAAGCAGTGCGCATA

TCACGTATCATATCTAATATATTATAATATTATGATAATGAGTATTTATTTAATT

TGTTTGTGTGAGGTAGATGCTGCATGACATAGCCC

>BB100_3 (143 bp)

GGGCATGCACAGATGTACACGCATTGGCCGTCTGTGCTGTCCATGGATCGTCT

GATTGATATGATATCATATATTATAATTATACAGTAAGGTGATTGGGTATTGAG

GGTTGTGTGGTTGGTAGATGCTGCATGACATAGCCC

>BB100_4 (145 bp)

GGGCATGCACAGATGTACACGGTAGACATGCGAAGCGTGCGATGACAATCGA

TGTGGACATCATGCATATATATGTTGTATAATTAAACAAATATGTGTAGTGTGT

GAGGTGGGTGTAGGAAGTAGATGCTGCATGACATAGCCC

>BB100_5 (143 bp)

GGGCATGCACAGATGTACACGTTGTCATGGGAATTTGTGGTTATGAAATGAGT

ATGCGACGAATATGTATACATATATATTAAATTATAGAGTGATGTATGAGTTTG

TGATGTGTGGTGTATAGATGCTGCATGACATAGCCC

>BB200_1 (243 bp)

GGGCATGCACAGATGTACACGGCGGCGCAAGATGATGTGCCGAACCTGACAT

GGCATCGACTGGTATGGATCAATACTGATGCGATATCGATACCGGATAAATCA TAT AT GC ATAAT AT C AC ATT AT ATTAATT AT AAT AC AT C GGC GT AC AT AT AC AC GTACGCATCATTTCACTATCTATCGGTACTATACGTAGTGCCGGTCTGTTGGC CGGGCGACATAGATGCTGCATGACATAGCCC

>BB200_2 (244 bp)

GGGCATGCACAGATGTACACGTGACGCAACGATGATGTTAGCTATTTGTTCAA TGACAAATCTGGTATGATCAATACCGATGCGATATTGATATCTGATAACTCATA TAT GT AG AAT AT C AC ATT AT ATTT ATT AT AAT AC ATC GT C GAAC AT AT AC ACAA TGCATCTTATCTATACGTATCGGGATAGCGTTGGCATAGCACTGGATGGCATG AC C CT GATT AG AT G CT G CAT GAG AT AG C C C

>BB200_3 (244 bp)

GGGCATGCACAGATGTACACGAGACCGCAAGATGATGTTCATTCTTGAACATG

AGATCGGATGGGTATGGATCAATACCGATGCGATATGATAACTGATAAATCAT

ATATCTATAATATCACATTATATTAATTATAATACAGGATCGTTACATGCATAC

ACAATGTATACTATACGTATTCGGTAGTTAGTGTACGGTCGGAATGGAGGTGG

TGGCGGTGATAGATGCTGCATGACATAGCCC

>BB200_4 (243 bp)

GGGCATGCACAGATGTACACGAATCCCGAAGATGTTGTCCATTCATTGAATAT

GAGATCTCATGGTATGATCAATATCGGATGCGATATTGATACTGATAAATCAT

ATATGCATAATCTCACATTATATTTATTATAATAAATCATCGTAGATATACACA

ATGTGAATTGTATACAATGGATAGTATAACTATCCAATTTCTTTGAGCATTGGC

CTTGGTGTAGATGCTGCATGACATAGCCC

>BB200_5 (243 bp)

GGGCATGCACAGATGTACACGAATCCGTGAGATGACTATCTTATTTGTGACAT TCATCGATCTGGATATGATCAATACCATGCGATATTGATTACTGATAAATCATA TAT GT AG AAT AT C AC ATT AT ATT AATT AT AAT AAAT C GT C GT AC AT AT AC AT C C ACAATTAGCTATGTATACTATCTATAGAGATGGTGCATCATCGTACTCCACCAT TCCCACTAGATGCTGCATGACATAGCCC

>BB300_1 (348 bp)

GGGCATGCACAGATGTACACGCATAAGACCACAGGGTGCAAATCTGGATTGC

GGCATGGATGATTCATCATCGTGGCATATTCGCTATGGATATATCCATCATAAT

ACATTGATACGTCATGCGTATAATCGCATTATATGTCGATATTGGTCATAGGG

ATACATCCGTGTATACTATCGTATATGCGTGCAATGTAGCCATGTTAATCATGC

TATAACCATAACATAAATATAATATATACAGATGGTGTATCTCTACTTATGTAT

GCTTGTATAGTAATGTCGATACTGATGGGTCTCCGGCCCACTACACCACCTGG

CCGCTCTAGATGCTGCATGACATAGCCC

>BB300_2 (343 bp)

GGGCATGCACAGATGTACACGGGCAATCCGCCAGGGTTCAAATATGGATATGT

GATGATCGATTCAACATGCACATATGCACGATATCATATATTACTCCAGATGTC ATCATCGTCGTGCGTATATGAGATATGTATTTATGCATATAATCCACCATACAT

GGTAGCGATATTATAGTGCGATTATGTGTATATGACTATCATGGCTATTGTTAA

TATATAAATCATAACCATACCACTTCCACGCCTGGTATGGCGTATAGTATAGA

GATATTGTGTGATGCCCTATGTCGACCATGATGTGCCGTTGTACTGCCAATCC

TAGATGCTGCATGACATAGCCC

>BB300_3 (344 bp)

GGGCATGCACAGATGTACACGTATCCATGCAGCTTATTGTAACTAGCGCATGC

AC GT GGT GATT CATC AC AT CT AT AT AT AC GAT AT GATAT ATT AC AC ATATTT GC ATAGTATCATCCGGTGTGATATCATCCGATATGCTCATACTTATTCATTGGTAG CATTGCATTGATGGATCAATAGTTATTATGACATCATGGCATGTACAATTATAA ATAATACAACATACATAAATATACTATACACATCGTGTATGTGTTATACAGATC TGTGTGATGTATGATAATGTAATGGCGTCGAACACCACAAGGCAGTCCTATAA TAGATGCTGCATGACATAGCCC

>BB300_4 (344 bp)

GGGCATGCACAGATGTACACGGTCCATTACAATCGAATCTATATCCCAATGTG

T ATC GATTAT C AC C AC AATGAC AT AAT AC GATAT CAT AT ATT ACT C CAT AT GC C

TTACGTCAGATCGTTATATGAGATATGTATTCATGCATATGATATCCCACAGTA

CACGTCGTCTAATGCCATCATGAATGTATGACATATCTAGTCGATTATACATAA

TATAACATACCAATATAACAATATCTATACACATTTGATGGCGTATAGTATAAA

GATATTGTGGCAATGCCCATACACCACTGACTGTCGCCGATCATTCCTACCAC

TAGATGCTGCATGACATAGCCC

>BB300_5 (344 bp)

GGGCATGCACAGATGTACACGACCGACCGTGAAAGTGATTCAGAATGATGTGC

ATGAATGTTATCATGACATGATTTATGATGCACTGATATATGCATATTATAATA

TTGTACAATGTCGTATATACGACATATCTATACTATGAATTATGGCATCATGGA

CAATAGATGGTAAGGTATAGTACGATCTATATAGCATGTTGAAATGGGATATA

AATTATCATAAACATACATACTTAACTAATATCAAGATGATATGTGTATGACAT

CAGAATGATAGTAGTAATGAGTATTGTCAGATGTATGTACGAATATCACACGA

TTAGATGCTGCATGACATAGCCC

A backbone DNA molecule most preferably comprises the sequence of:

>BB200_4 (243 bp)

GGGCATGCACAGATGTACACGAATCCCGAAGATGTTGTCCATTCATTGAATAT

GAGATCTCATGGTATGATCAATATCGGATGCGATATTGATACTGATAAATCAT

ATATGCATAATCTCACATTATATTTATTATAATAAATCATCGTAGATATACACA

ATGTGAATTGTATACAATGGATAGTATAACTATCCAATTTCTTTGAGCATTGGC

CTTGGTGTAGATGCTGCATGACATAGCCC

Flexibility of backbones of a fixed length can be modulated by tailoring the sequence of the backbone. Different DNA molecules have different flexibilities depending on the particular sequence of the molecules. Different sequences can be provided by choosing different first restriction enzyme sites, different barcode sequences and different sequences for other elements in the backbone. The flexibility is preferably adjusted by tailoring the sequence of a dedicated part of the backbone sequence. Such a dedicated part is further referred to as“the linker”. The linker preferably comprises 20-900 nucleotides, preferably 25-900 nucleotides, preferably 30-900, preferably 30-800, preferably 50-700; preferably 100-600, preferably 150-500 nucleotides. A linker can be one consecutive sequence or divided into two, three, four or more consecutive sequences in the backbone. A linker is preferably one consecutive sequence or divided into two, three, four consecutive sequences, preferably one, two or three, preferably one or two, and more preferably one consecutive sequence in the backbone.

The free energy values of each base-pair (Breslauer et al. 1986) and the deviation of the twist angle (degrees) (Sarai et al. 1989) can be used to compute the flexibility of any given DNA sequence. An example of such a calculation is:

Flexibility calculation

A python implementation of the TwistFlex algorithm

(http ://m r gaht.h uji . ac.il/ TwistFlex/) (Menconi et al. 2015) can be used to compute DNA flexibility at the twist angle of the input sequence. The flexibility of each individual dinucleotide is calculated based on the following table of angular degrees:

Subsequently, the mean flexibility of the entire sequence is considered for the selection in the evolutionary algorithm for backbone optimization. The mean flexibility of a DNA sequence is calculated as the sum of all dinucleotide angular degrees divided by the total number of dinucleotides. The flexibility score for suitable backbones is 10 or more, preferably 11 or more, preferably 12 or more, preferably 12,5 or more (dinucleotide angular degrees/dinucleotides) in the backbone. Flexibilities of more than 14 are usually not required. Entropy calculation for determining sequence complexity

The Shannon entropy of a string is defined as the minimum average number of bits per symbol required for encoding the string. The formula to compute the Shannon entropy is:

where p is the probability of character number i appearing in the sequence. The calculation can also be performed through:

http: / /ivivw. shannonentropy.netmark.pl/

The above formula was implemented with the following python code:

def quick_entropy(sequence) :

alphabet = set(sequence) # list of symbols in the sequence

# Frequency of each symbol in the sequence

frequencies = [ ]

for symbol in alphabet:

fre quencies . append(se quence .count(symbol) / len(se quence))

# Shannon entropy as in

https://en.wiktionary.org/wiki/Shannon_entropy

ent = 0.0

for freq in frequencies:

ent -= freq * math.Iog(freq, 2)

return ent

Preferred backbones have a Shannon entropy value of 1.5 Sh or higher. Preferably 1.5 or higher, preferably 2.5 or higher, more preferably 3.5 or higher.

Self-complementarity

Backbone core sequences preferably do not have 8 or more consecutive bases self-complementary in the same strand. The exception is the intentional insertion of one or more restriction enzyme sites or one or more other functional sequences. Such sequences can occasionally introduce self-complementary bases in the same strand. If possible more than 8 of such bases are avoided, but they can be tolerated in functional backbones. Nevertheless, in designing new backbones such sequences are preferably avoided if possible. The same is true for the kmers discussed herein below.

Absence of repeated motifs (kmers)

Backbone core sequences preferably do not have motifs of 6 bases repeated more than twice in the sequence. The flexibility of the backbone and the Shannon index of a backbone can be modulated by including a linker in the backbone. The influence of a particular sequence of the linker on the flexibility and complexity scores of the backbone can be easily be calculated.

A linker preferably has one or more of the following features: (i) the overall complexity of the linker sequence is preferably high. The above mentioned

Shannon entropy formula is a method to determine a value for the complexity of a given sequence; (ii) duplications of DNA motifs longer than 5 bases are preferably not present more than twice in the linker sequence, preferably no more than once. In a preferred embodiment the linker does not comprise a duplication of a DNA motif longer than 5 bases (i.e. the linker sequence does not contain a repeated motif where the motif is more than 6 consecutive bases ); (iii) a linker preferably does not comprise more than two, preferably not more than one and preferably no self complementary sequence of more than 6 nucleotides (an inverted repeat) separated by less than 10 nucleotides. The mentioned criteria aid in avoiding, in general, the presence of a complex secondary structure in a single stranded version of the linker. The likelihood and the strength of the secondary structure can also be calculated by other means.

A backbone preferably comprises a GC content of 30-60%; preferably 40 - 60%; preferably 40 - 50%, preferably 45 - 55%.

A backbone preferably has one or more of the following features. Said first restriction enzyme site is preferably for a restriction enzyme that produces blunt ends. It has been observed that this improves the capture of target nucleic acid.

The backbone preferably comprises a recognition site for a DNA nicking enzyme that is used to generate the priming site for rolling circle amplification. Additional restriction enzyme recognition sites can be used to perform sequential ligation of multiple short DNA molecules into one circular DNA. The backbone preferably comprises a molecular identifier that enables the discrimination of original captured nucleic acids and their subsequent sequencing reads.

A method of the invention can be used for the ordered capture of two or more target nucleic acids per backbone. A single capture step can, on occasion capture two target nucleic acid molecules at the same time. The chance of this happening is intentionally low because of the measures that are taken to prevent self-ligation. The ordered capture of two or more target nucleic acids can be a desired feature. Additional restriction enzyme sites can be incorporated into the backbone. Once a first target nucleic acid is captured, the method can be repeated by adding a restriction enzyme that cuts the additional restriction enzyme site. The DNA circle is cut and linearized by the second restriction enzyme and ready to be ligated to the target nucleic acid. If the second restriction enzyme produces the same type; of ends as said first restriction enzyme (for example blunt ends), the reaction can be continued to capture target nucleic acid not captured in the first iteration of the method. Alternatively, the DNA circles can be purified (for instance by removing linear DNA) and new target nucleic acid with ends that are ligation compatible to the ends of the backbone produced by the second restriction enzyme can be added. This step can, of course, he repeated for the ordered capture a third, a fourth and so forth target nucleic acid by adding further restriction enzyme sites to the backbone. When more than one target nucleic acid is to be captured it is preferred that the second and first restriction enzymes sites are sites for enzymes that cut infrequently. Such enzymes are preferably 8-cutters or more. The enzymes are preferably blunt end producing enzymes. This ordered capture allows the simultaneous sequencing of more than one target nucleic acid. The different target nucleic acids can be identified on the basis of their location in the backbone, i.e. on the basis of the flanking backbone sequences into which they are inserted.

Additionally, the backbones serve as a control sequence during data analysis. Based on the backbone sequence reads, the error-rate of each sequencing read can be inferred, enabling accurate estimation of the likelihood of genetic variations within captured nucleic acid sequences.

Side products can be produced in a method of the invention. The amount of single backbone, single target DNA containing DNA circles is influenced for instance by the b ckbone/sample molar ratio: ratio, which should promote the formation of molecules with backbone and insert, rather than unwanted side products (Figure 2), such as (i) linear DNA formed by random coneatemerization of backbone and sample DNA, (ii) circular DNAs containing only backbone or only sample DNA, (iii) circular DNA containing excess of backbones or sample DNAs.

In embodiments of the invention the molar ratio of backbone molecules to target nucleic acid molecules preferably range from 1:10 to 10:1. Preferably a ratio range of 1:5 to 5:1 is maintained, preferably a ratio of 1:2 to 2:1 is maintained. An average a ratio of 1:1 is preferred.

The methods as described herein including the rolling circle amplification are preferably performed without switching containers. Produced concatemers can be sequenced in the same container or a different container.

A method of the invention preferably produces concatemers as long (>10Kb) linear dsDNA formed by multiple units consisting of target nucleic-acid-backbone copies. The concatemerization/multimerization of such a unit is advantageous to discriminate the detection of a real genetic variation from a sequencing error. In fact, in the case of rare genetic variations, that occur in less than 1% frequency within a pool of DNA molecules, direct sequencing, e.g. short-read sequencing cannot be applied anymore, because the sequencing error rate is higher than the mutation frequency. Using a method as described herein, the same rare sequence (genetic variation) is represented multiple times in long concatemers, which provides high confidence about mutation presence, even if the mutation frequency is low in the original pool of nucleic acid molecules.

A backbone comprises a 3’ sequence coding for one part of a first restriction enzyme recognition site (restriction enzyme site) and a 5’ sequence coding for the other the other part of the first restriction site. A backbone may contain further elements. Such as one or more of (i) one or more sites that allow nicking of the double-stranded backbone sequence; (ii) one or more Typel or Type2 restriction sites; (iii) A secondary cloning site; (iv) a flexible DNA stretch (linker) that enables efficient circularization (bending) of the backbone molecule; and (v) a unique molecular barcode sequence to tag each individual backbone molecule.

A backbone preferably has 5’-phosphorylation at both ends of the backbone molecule.

The invention also provides a collection of linear DNA molecules

(backbones) of a length of 20 - 1000 nucleotides that comprise 5’ ends that comprise a part of a first restriction enzyme recognition site at the extreme end and 3’ ends that comprise the other part of a first restriction enzyme recognition site at the extreme end, and which 5’ and 3’ ends are ligation compatible with each other and form a restriction enzyme recognition (first restriction enzyme) site when self-ligated and wherein each of said backbones comprises:

a linker;

an identifier sequence that differs from the sequence of identifiers of other backbones in the collection (barcode); and

optionally a restriction site for a nicking enzyme.

The backbones are backbones that are preferred in a method as described herein. Said first and said second part of said first restriction site together form a complete recognition site for said first restriction site and are in positions on the molecule that allows operable linkage of the two parts to form said first restriction site. Operable linkage in this context refers to availability for cutting by said first restriction enzyme. The backbones preferably further comprise a second restriction site which is a type I or type II restriction enzyme site. The backbones preferably further comprise a restriction enzyme site for a type II restriction enzyme that can create non-palin dromic overhangs (Golden- Gate cloning site). The linker is preferably a linker as described herein above. The backbones preferably comprise a nucleic acid molecule (captured nucleic acid molecule) in said first restriction site. The backbones preferably comprise a library of captured nucleic acid molecules.

Further provided is a kit comprising a backbone as described herein.

The kit preferably comprises a collection of backbone molecules as described herein. The preferably further comprises a polymerase with high proeessivity and optionally one or more polymerization primers. The kit preferably further comprises a ligase and said first restriction enzyme; and/or said target site specific recombination enzyme. The kit preferably further comprises a DNA exonuclease. The latter enzyme is suitable for removing linear DNA prior to producing coneatemers of said DNA circles.

In one aspect the invention provides a method for determining the sequence of a collection of nucleic acid molecules the method comprising

- providing double stranded target DNA molecules that have 5’ and 3’ ends with a protruding adenine residue at the 3 '-end of both strands of the DNA molecules;

- providing a collection of double stranded backbone DNA molecules that comprise 5’ and 3’ ends that are ligation compatible with the 5’ and 3’ ends of the target DNA;

said method further comprising

- ligating said target DNA to said backbones in the presence of a ligase, thereby producing DNA circles comprising a backbone and a target DNA molecule;

- optionally removing linear DNA;

- producing coneatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

- sequencing said coneatemers.

Ends that are ligation compatible with a protruding 3 adenine are ends that have a 5’ protruding thymidine base or analogue thereof. The method is different from the methods described herein above in that inter- or intra-target molecule ligation is inherently inhibited as all ends have a’-protruding adenine base. Self- ligation of target nucleic acid or ligation of one end to another target nucleic molecule is thus inherently not possible. A protruding base or bases are nucleotides that are at the end of a nucleic acid molecule and that are not base paired with a base on an opposing strand. There is no opposing base for the protruding base.

Such protrusions are also referred to as sticky ends, or cohesive ends. The same is true for the backbones. They are inherently prevented from self-ligation. In this embodiment the backbones do not have to have parts of a first restriction enzyme site at the extreme end. The ends thus not ligate to create a first restriction enzyme site. Thus, the ligation does not have to be performed in the presence of said first restriction enzyme. The remainder of the steps and the definitions can be the same as described elsewhere herein.

Further provided is a method for determining the sequence of a collection of nucleic acid molecules comprising

providing double stranded target DNA molecules that have a recombinase recognition site specific for a target site specific recombinase at the 5’ and the 3’ ends;

providing a backbone comprising said recognition sites separated by DNA comprising a linker; incubating said target DNA molecules with said backbones in the presence of said target site specific recombinase, preferably a Cre recombinase, a FLP recombinase or a bacteriophage lambda integrase, thereby producing DNA circles comprising a backbone and a target DNA molecule;

optionally removing linear DNA; and;

producing concatemers comprising an ordered array of copies of at least two of said DNA circles through rolling circle amplification; and

sequencing said concatemers. In a preferred embodiment the backbone is a circle comprising two recombinase recognition sites separated on one side by DNA comprising a linker and separated on the other side by DNA coding for a further restriction enzyme recognition site, and wherein said further restriction site is the only recognition site for said restriction enzyme in said backbone. In this embodiment the method preferably further comprises digesting said DNA after said recombination with said restriction enzyme and subsequently removing linear DNA, prior to producing said concatemers. Said further restriction site is preferably a 6 or more cutter, preferably a 7 or more cutter, preferably an 8 cutter. The ends produced by the digestion do not have to be blunt ends. In a preferred embodiment the further restriction enzyme is not a blunt end cutter.

Target site specific recombinases

A target site specific recombinase is a genetic recombination enzyme. Target site specific DNA recombinases are widely used in multicellular organisms to manipulate the structure of genomes, and to control gene expression. These enzymes, derived from bacteria and fungi, catalyze directionally sensitive DNA exchange reactions between short (30-40 nucleotides) target site sequences that are specific to each recombinase. These reactions enable four basic functional modules, excision/insertion, inversion, translocation and cassette exchange. Non- limiting examples of recombinases are Cre recombinase; Hin recombinase; Tre recombinase and FLP recombinase. Cre-recombinase was one of the first widely used recombinases. It is a tyrosine recombinase enzyme derived from the PI Bacteriophage. The enzyme uses a topoisomerase I like mechanism to carry out site specific recombination events. The enzyme (38kl)a) is a member of the integrase family of site specific recombinase and it is known to catalyze the site specific recombination event between two DNA recognition sites (LoxP sites). This 34 base pair (bp) loxP recognition site consists of two 13 bp palindromic sequences which fl nk an 8bp spacer region. The products of Cre-mediated recombination at loxP sites are dependent upon the location and relative orientation of the loxP sites. Two separate DNA species both containing loxP sites can undergo fusion as the result of Cre mediated recombination. DNA sequences found between two loxP sites are said to be "floxed".

Red/ ET recombination

Recombineering exploits the phage derived protein pairs, either RecE/RecT from the Rac phage or Reda/Redb from the l phage, to assist in the cloning or subcloning of fragments of DNA into vectors without the need of restriction enzyme sites or ligases. The RecE/RecT, Reda/Redh and other similar protein pairs are herein further referred to as Red/ET protein pairs A limitation of the original homologous recombination technique was due to the fact that bacterial RecBCD nuclease degrades linear DNA and initially the event had to be studied in RecBCD- deficient strains (7). This was overcome by the discovery that Reda and Red IS were assisted by Redy, which inhibits RecBCD nuclease activity making it possible to use the technique in E. coli and other commonly used bacterial strains. In addition, the recombination efficiency was increased 10-100 times. The combination of these three enzymes (a, IS and y, or E, T and y) in one vector was named Red/ET recombination and the basic principles of the method are that it requires two homology regions of >15, preferably >20, preferably >30 and preferably >42 bp in a linear fragment, double strand breaks (DSBs) in both ends, and another linear or circular plasmid in order for recombination to take place. Directional insertion is possible using two different homology regions to flank the target DNA and the insertion site. DSBs are essential so that RecE or Reda can bind and degrade one chain of the DNA (5' to 3') and at the same time load RecT or Redh to the single strand chain that is exposed. The single DNA strand loaded with the RecT or Rech recombinase finds a perfect match sequence and joins the two sequences by either chain invasion or annealing.

Insertion of homology regions (HRs) is typically achieved by including them in the oligonucleotides that are used for amplification of the products used as linear substrates for the recombination event. If longer fragments of DNA are needed for the procedures then the HRs may be inserted with conventional restriction/ligation techniques using plasmids or adaptors.

Restriction enzyme recognition site

A restriction enzyme recognition sites are often also simply referred to as restriction enzyme site; restriction site or restriction recognition site. They are locations on a DNA molecule containing specific sequences of nucleotides, which are recognized by restriction enzymes. These are generally palindromic sequences. A particular restriction enzyme may cut the sequence between two nucleotides within its recognition site, or somewhere nearby. The enzymes typically cut both strands of the DNA molecule which is typically followed by separation of the ends. So called nicking enzymes also recognize restriction sites but cut only one of the two strands. The resulting DNA molecule remains associated but one of the two strands has a nick.

Restriction enzyme types

Naturally occurring restriction endonucleases (restriction enzymes) are categorized into four groups (Types I, II III, and IV) based on their composition and enzyme cofactor requirements, the nature of their target sequence, and the position of their DNA cleavage site relative to the target sequence. DNA sequence analysis of restriction enzymes however show great variations, indicating that there are more than four types. All types of enzymes recognize specific short DNA sequences and carry out the endonucleolytic cleavage of DNA to give specific fragments with terminal 5'-phosphates.

Type I enzymes (EC 3.1.21.3) cleave at sites remote from a recognition site and require both ATP and S-adenosyl-L-methionine to function. They are multifunctional in that they have both restriction and methylase (EC 2.1.1.72) activities.

Type II enzymes (EC 3.1.21.4) cleave within or at short specific distances from a recognition site. Most type II enzymes require magnesium. They typically have a single function (restriction).

DNA phosphoi yla/ion

Single- or double-stranded DNA with a 5'-hydroxyl terminus has to have a 5’ phosphate group for efficient ligation. 5’ ends without such phosphate groups can be phosphorylated prior to ligation. A number of polynucleotide kinases, including T4 PNK (NEB #M0201) and T4 PNK (3’ phosphatase minus) (NEB #M0236), can be used to transfer the g-phosphate of ATP to a 5' terminus of DNA.

DNA Dephosphorylation

Digested DNA typically possesses a 5' phosphate group that is required for ligation. In order to prevent self-ligation, the 5' phosphate can be removed prior to ligation. Dephosphorylation of the 5' end prohibits self- ligation, enabling the artisan to manipulate the DNA as desired before re-ligating. Dephosphorylation can be accomplished using any of a number of phosphatases, including the Quick Dephosphorylation Kit (NEB #M0508), Shrimp Alkaline Phosphatase (rSAP) (NEB #M0371), Calf Intestinal Alkaline Phosphatase (CIP) (NEB #M0290) and Antarctic Phosphatase (NEB #M0289).

DNA Ligation

Ligation of DNA is a central step in many modern molecular biology workflows. DNA ligases catalyze the formation of a phosphodiester bond between the 3' hydroxyl and 5' phosphate of adjacent DNA residues. In the lab, this reaction is used to join dsDNA fragments with blunt or cohesive ends to form recombinant DNA plasmids, to add bar-coded adapters to fragmented DNA during next- generation sequencing and many other applications. The DNA ligase from bacteriophage T4 is the ligase most-commonly used. It can ligate cohesive or "sticky" ends of DNA, oligonucleotides, as well as RNA and RNA-DNA hybrids. It can also ligate blunt-ended DNA with great efficiency. Single stranded DNA can be ligated efficiently with CircLigase™ II ssDNA Ligase* (epicenter). This is a thermostable enzyme that catalyzes intramolecular ligation (i.e. circularization) of ssDNA templates having a 5'-phosphate and a 3'-hydroxyl group. CircLigase II ssDNA Ligase ligates ends of ssDNA in the absence of a complementary sequence. The enzyme is therefore useful for making circular ssDNA molecules from linear ssDNA. Circular ssDNA molecules can be used as substrates for rolling-circle replication or rolling-circle transcription.

For the purpose of clarity and a concise description it is here mentioned that where a step is performed on or with one or more substrate(s) and which step is catalyzed by one or more enzymes, this step is performed by contacting the substrate(s) with the enzyme(s). This is typically done by adding the enzyme(s) to the substrate(s) in an appropriate buffer.

For the purpose of clarity and a concise description features are described herein as part of the same or separate embodiments, however, it will be

appreciated that the scope of the invention may include embodiments having combinations of all or some of the features described.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1. A) Schematic representation of a method of capturing small nucleic acid molecules and producing concatemers by using a backbone and rolling circle amplification. B) Schematic representation of a sequencing reaction using short reads and without a backbone and long read sequencing using backbones.

Figure 2. Examples of possible linear and circular byproducts of the

circularization reaction indicated in Figure 1. The shading of the big circle in the top left of the circular byproduct figure indicates backbone sequences. The other shadings are all target sequences.

Figure 3. Schematic example of a double stranded backbone sequence.

(1) Indicates a 5’ end and 3’ end sequence that together code for a first restriction enzyme recognition site.

(2) Is a restriction site for the nicking enzyme BbvCI. Any other nicking site would work as well, an advantage of using BbvCI however is that two forms of that enzyme are commercially available, one nicks the DNA at the plus strand and the other at the minus strand. A nicked DNA is a valid priming site for a Rolling Circle Amplification (RCA) reaction. Depending on the case, we may want to use nicked DNA instead of DNA-primers to initiate the polymerization.

(3) Is an accessory blunt restriction site, in the example case it is the recognition site of Swel. A second blunt restriction site allows the capture of a second DNA fragment in a further circularization reaction.

(4) Is a cloning site, a double -inverted Bbsl site in the example, that can be used for easy extension of the backbone via Golden-Gate or other types of cloning.

(5) Represents a flexible DNA stretch (linker). It can vary in length and aids efficient cireul ar iz ation . (6) The capital N indicates a stretch of nucleic acids that code for a unique identifier. It is a barcode-like sequence. It can code for one or more (random) barcodes of any suitable size.

The elements (1) are located at the extremities, the elements 2-6 can have any order and can be present or not depending on the case.

Figure 4. We have developed a method to detect gene fusions based on targeted cDNA synthesis, single-stranded DNA circularization (ssDNA) and targeted rolling circle amplification. The cDNA that is produced by a reverse transcriptase step is greyshaded in the left hand panel. The bottom cDNA is a fusion gene DNA and has two shades indicating the part from one gene and the part of another gene. It is clear that the RCA assay yields concatemers of the fusions. The method can of course also be used to determine the sequence of one or more cDNA that are not the result of a fusion of genes.

Figure 5. Schematic representation of the MIP probes and method

Figure 6. The DNA content per hand was plotted as well as the predicted value.

Figure 7. Determining the efficiency of circularization: A) comparison of insert before and after the reaction. B) comparison of circularized product and unreacted product.

Figure 8. Results of proof-of-concept experiment. (A) Gel picture indicating the RCA product that was used for subsequent sequencing on nanopore MinlON. (B) Nanopore read length distribution of MinlON R9.4 run that was performed with the sample indicated in (A) as input. (C) Pattern score distribution for 2,083 reads larger than lOkb. (D) Schematic outline of a nanopore sequence read with alternating insert (green) and backbone (red), alignment of the insert sequences and generation of a consensus from the aligned inserts. (E) consensus accuracy.

Figure 9. Circularization of Backbone 2 (BB2) and Backbone 3 (BB3) with insert 17.2 at 3:1 ratio. A) comparison of BB2 and BB3. Red asterisk: correct circularized product. Multiple bands in condition 3: linear versus circular products, ligation of multiple backbones. Additional band in condition 4: circularized backbone. B) Successful circularization using Backbone 2 (BB2). Yellow asterisks: correct circularized product. In this gel the whole reaction was loaded on each lane. Lane 1 and 2 represent the circularization of BB2 and insert 17.2 at 1:1 ratio before and after PlasmidSafe digestion. Lane 3 and 4 represent the same circularization reaction using BB2 and insert 17.2 at 3:1 ratio.

Figure 10. Efficiency of backbone circularization with varying backbone-insert ratios. Ligation products of varying backbone (BBS) to insert (17.2) ratios were examined qualitatively and quantitatively. (A) Agarose gel displaying circularization input and reaction products. PlasmidSafe digestion was used to remove remaining linear products after the circularization reaction. Red asterisks: correct circularized product. Yellow asterisks: remainder of the insert input. (B) Quantification of the circularization efficiency. Quantification of the circularization efficiency was defined as P/I* 100, where P is the amount of correct backbone-insert product (in moles, red asterisks) and I is the amount of input insert (in moles). The intensity and surface area of the bands were measured using the software Imaged (http¾://en.wikipedia org/wiki/Imaged). The data was normalized using the

GeneRuler 50bp DNA ladder as a reference. See also Materials and Methods, section 10.

Figure 11. The efficiency of circularization of BB2_100 (orange bars) with and without addition of Srfl and ITMGBl. Ligation was performed using

backbone:! user! ratios. The blue bars represent the control experiments with BB2 and BB3 ligated with the same insert without addition of Srfl or HMGB1.

Circularization efficiency was quantified as described above (Figure 10 legend).

Figure 12. Visual display of reaction products of the circularization with backbone BB2_100 and insert 17.2. Red asterisk: correct product. Orange box: predicted position of residual insert after circularization. The insert was completely ligated as shown by the full disappearance of the insert band after ligation (Circularized 1). Circularized 1: before Plasmid Safe DNAse treatment. Circularized 2: after Plasmid Safe DNAse treatment.

Figure 13. The effect of the addition of the restriction enzyme Srfl in the circularization reaction. A circularization reaction was performed using BB3 together with insert 17.2. The reaction was performed in presence and absence of Srfl and plasmid safe DNAse.

Figure 14. Barcoding strategies useful with the described technology. (A) Use of unique molecular identifiers to tag individual DNA molecules for improving mutation discovery. (B) Use of sample-specific barcodes to label individual samples for pooling on a sequencing run.

Figure 15. RCA products using a variety of DNA templates. RCA was performed using circular DNA templates derived from a variety of sources. (A) cell-free DNA circularized with backbone BB2; (B) plasmid pX_Zeo; (C) ss-cDNA self-circularized using CircLigase II (Epicentre #CL9021K); (D) the PCR product 17.2 cloned into the plasmid pJET. As a reference, a long-range 1Kb ladder was used. The higher band of the ladder is 10Kb long, the RCA products are estimated to be between 20 and 100 kb long. Figure 16. Number of reads containing 17.1 and 17.2. The ratio between the reads containing 17.1 and the one containing 17.2 is 1:14, indicating a stark enrichment of the target region due to site-directed RCA.

Figure 17. Overview of reaction products of steps from a one-pot reaction design. An insert (17.2) and backbone (BB2_100) were circularized, yielding the products indicated with (1). Linear DNA products were digested using Plasmid Safe DNAse as indicated by (2). An RCA reaction product is formed based on (2) as input, as indicated by (3).

Figure 18. Overview of consensus calling methods for short read sequencing (left panel) and long read sequencing (right panel).

Figure 19. (A) Example of mapped inserts with TP53 mutation derived one Cyclomics sequencing read (red box). (B) Plot showing fraction of inserts that support a non-reference allele for 588 reads with >4 inserts. Four reads show a high fraction of non-reference allele and these contain inserts with the expected chrl7:7578265, A->T mutation.

Figure 20. Capture of DNA with a target site specific recombinase. The recognition sites for the target site specific recombinase are indicated by the letters A and B. The target DNA is indicated by the wording“insert”. The sites A and B can be introduced in various ways such by ligating adaptors with the sites to the insert DNA or by amplifying the insert with primers that comprise a sequence coding for said sites A and B. The backbone is indicated by the term“backbone”. In the figure the backbone is a circular molecule comprising DNA between the two sites A and B. This intervening DNA comprises a restriction site that is unique to the entire backbone. The arrows indicate that the insert and the backbone are first recombined by adding the recombinase and that subsequently the restriction enzyme is added. The restriction enzyme will cut only unreacted backbone and backbones in which the linker is replaced by the insert. Linearized DNA can be removed by adding an appropriate exonuclease.

Figure 21. Comparison between ligation reaction products and efficiency of different backbone designs. Left-side: Ligation of different backbones with a 250bp PCR amplicon. Right-side remaining circular product after digestion of linear DNA with plasmid-safe DNAse. Ligation to all members of the BB200 series showed a high circularization efficiency, as demonstrated by the formation of circular product consisting of PCR product and backbone.

Figure 22. Comparison of ligation efficiency of backbones from the BB200 series. Left-side: gel showing the ligation product of the 3 backbones. BB200_4 showed brighter bands indicating more product formed during the reaction. Right-side: measurements of the brightness of the bands. From top to bottom: BB200_2, BB200_4 and BB200_5.

Figure 23. Differences in number of sequencing reads derived from RCA products formed by ligation with backbones of the BB200 series.

Percent of reads coming from different backbones in two independent experiments (red and blue). The backbones were initially mixed at 1:1:1 ratio. The higher number of reads having BB200_4 is consistent with the higher ligation efficiency shown in Figure 2.

Figure 24. Base inference of a particular position of the gene TP53 (GRCh37 17:7577518).

The Y-axis represents the distance (median fit score) between a modeled nanopore signal corresponding to a reference sequence and the signal derived from an experimental sequence. The greater the distance the more difficult it is to infer the correct base. On the X-axis are the number of inserts found in a read-segment. The inferred bases are indicated with different colors. The signal coming from the forward strand is less clear than the one measured on the reverse strand. This makes it difficult to distinguish the correct base (A, in blue) versus other possible bases even when the calculated distance is low.

Figure 25. Agarose gel depicting the product of a circularization reaction (S) between backbone and insert. The negative control is designated as C-. The hand corresponding to the Circular BB-I product was isolated from gel.

Figure 26. Agarose gel showing example product after rolling circle amplification.

Figure 27. BB200_4 (243bp, indicated as BB in the figure) and S1_WT (158bp, indicated as I in the figure) were circularized and amplified by RCA. When digesting eoneatemers made by BB-I we expect a hand around 400bp, while if the concatemer consists of only BB, the resulting band should be around 250bp.

Concatemers formed by only I would not be digested leaving the RCA band visible.

EXAMPLES Example 1

Materials and Methods

Methods are described herein that allow detection of genetic variations by sequencing of a nucleic acid sequences in a pool of nucleic acid molecules (DNA) like ctDNA (circulating tumor DNA), cfDNA (cell-free DNA), genomic DNA, RNA, products of the polymerase chain reaction (PCR) or other products. In this process, a "product" obtained is a long (>10Kb) linear dsDNA formed by multiple units consisting of nucleic-acid-backbone copies. The

concatemerization/multimerization of such a unit is necessary to discriminate the detection of a real genetic variation from a sequencing error. In fact, in the case of rare genetic variations, that occur in less than 1% frequency within a pool of DNA molecules, direct sequencing, e.g. short-read sequencing cannot be applied anymore, because the sequencing error rate is higher than the mutation frequency. Using the method described above, the same rare sequence (genetic variation) is represented multiple times in long concatemers, which provides very high confidence about mutation presence, even if the mutation frequency is low in the original pool of nucleic acid molecules.

The design of backbones to capture nucleic acids brings several advantages for obtaining nucleic acids with high efficiency and specificity and is crucial for computational analysis of sequencing data.

Preferred features of the backbone molecules are:

1) blunt restriction sites coded at the extremities of the backbone serve for improved ligation efficiency of the short DNA.

2) recognition sites for DNA nicking enzymes that are used to generate the "single-stranded template" for rolling circle amplification.

3) restriction enzyme target sites that can be used to perform sequential

ligation of multiple short DNA molecules into one circular DNA.

4) molecular identifiers that enable to discrimination of original captured

nucleic acids and their subsequent sequencing reads

Additionally, the backbones serve as a control sequence during data analysis.

Based on the backbone sequence reads, the error-rate of each sequencing read can be inferred, enabling accurate estimation of the likelihood of genetic variations within captured nucleic acid sequences.

2 Materials and Methods for Cvclomics technology

2.1 Backbones Design

We went through an iterative approach of alternating backbone design followed by experimental testing to find the best backbones that would allow most efficient circularization, i.e. capturing of double-stranded nucleic acid molecules.

The basal design of our backbones can include one or more of the following parts:

1) A 3’ sequence coding for half restriction site. 2) One or more sites that allow nicking of the double -stranded backbone sequence

3) One or more Typel or Type2 restriction site

4) A secondary cloning site

5) A flexible DNA stretch that enables efficient circularization (bending) of the backbone molecule

6) Unique molecular barcode sequences to tag each individual backbone

molecule

7) A 5 sequence coding for the other half of the same blunt restriction site used in 1

8) Phosphorylation at the 3’ and 5’ end of the backbone molecule

The flexible DNA stretches have been designed with the help of a custom-made evolutionary algorithm imposing various selection criteria among which:

1) a high overall complexity of the sequence

2) absence of repeated DNA motifs longer than 5 bases

3) absence of self-complementary sequences of more than 5 nucleotides

4) at each design iteration cycle the most flexible sequences are selected

Following the above design, each sequence was manually checked using the mFold server (htt : //un atol d.rna.aibany.e dn/) and modified to reduce as much as possible the formation hairpins and, in general, complex secondary structure.

2.2 Prepara tion of DNA templa tes for Rolling Circle Amplification

The following protocols are meant for the preparation of a template suitable for an RCA reaction. Any circular DNA is a suitable template and we different protocols are available and known in the art that can handle either dsDNA or ssDNA.

Examples of dsDNA that we can circularize are: cfDNA, ctDNA, sheared genomic DNA, PCR amplicons.

Examples of ssDNA includes: cDNA, viral DNA.

2,3. dsDNA Circularization Reaction

A depbosphorylated dsDNA molecule here called“insert” is ligated to a

phosphorylated backbone at both ends forming a circular dsDNA product.

The reaction is carried out with the simultaneous use of a DNA ligase and a restriction enzyme in the appropriate buffer conditions.

The buffer conditions have been optimized to allow Ligation, Digestion and PlasmiSafe treatment in a one-pot reaction, without intermediate DNA purification steps. Considering the backbone in the Example, having Srfl half-site at the extremities the components of the reaction are the following:

Buffer IX

50mM Potassium Acetate

20mM Tris-acetate

lOmM Magnesium Acetate

lOOpg/ml BSA

ImM ATP

lOmM DTT

DNA and enzymes

Backbone + Insert in a 3:1 molar ratio

1 unit T4 DNA Ligase

1 unit Srfl

1 unit HMGB1

H2O was added to a final volume of 20 to 50m1 (depending on the DNA load), followed by lh incubation at 22°C and subsequent heat inactivation for 15min at 65°C

The presence of the restriction enzyme increases the overall yield of the reaction avoiding the accumulation of backbone eoneatemers while the concatemerization of the inserts is avoided by preventive dephosphorylation. HMGB1 (high-mobility group protein 1) is used to facilitate bending of short DNA thus increasing circularization efficiency.

The most abundant product of the above reaction is a circular dsDNA containing one backbone and one insert.

Removal of linear DNA

To remove residual linear dsDNA our templates are treated with Imΐ of Plasmid- Safe DNase for 15min at 37°C, followed by heat inactivation for 30min at 70°G.

3. Materials and Methods for detection of fusion genes

We have deveioped a protocol to employ circularization and rolling circle amplification for the detection of of fusion-genes, based on RNA extracted from human cells ln this case, ssDNA (as opposed to dsDNA) is used as input for the circularization and amplification reaction. The protocol can be generalized to sequence any RNA of interest. The first parts of the protocol involve standard procedures for“RNA extraction” and“cDNA amplification”, e.g. Trizol-based RNA isolation followed by polyT primer cDNA synthesis using reverse transcriptase.

After digestion of RNA from RNA-DNA hybrids, we are left with linear ssDNA.

At this point, we use a ssDNA Ligase to self-eireularize the input DNA.

Different ssDNA Ligases are available in the market. We have used CircLigase II (Epicentre) to perform proof-of-principle experiments. Circular ssDNA obtained following the vendor protocol have been successfully used as a template for RCA reaction using specific primers to direct the amplification of the fusion- gene of interest.

The following protocol describe in details all the passages right after RNA isolation.

3.1 Removal of residual DNA

Buffer, enzyme and inactivation reagent were purchased from Thermo Fisher (TURBO DNase kit)

In a 0.5 ml tube mix:

lOx Reaction Buffer Imΐ

Extracted RNA lpg

TURBO DNase 0.5m1

EDO to 10m1 final volume

Next the solution was mixed and incubate for 30min at 37°C. Inactivated by adding 2m1 of inactivation reagent. Mixed for 5min.

3.2 cDNA synthesis

We used Superscript II kit from Invitrogen, any other kit for cDNA transcription may be used instead.

To the previous reaction was added:

Primers 2mM (Random hexamers or specific) Imΐ Primers is phosphorylated dNTPs (lOmM each) Imΐ

Incubated at 65°C for 5min then put on ice for 5min. The primers were annealed to the template during this step.

Next, was added:

5x First Strand buffer 4m1

100 mM DTT Imΐ

Incubated at 42°C for 2min then added Imΐ of SSII enzyme and incubated 42°C for 45min. Finally, we inactivated the reaction at 70°C for 15min.

3.3 Removal of RNA RNaseH was purchased from ThermoFisher.

To the previous reaction Imΐ of RNaseH enzyme was added and incubated at 37°C for 20min, then heat-inactivated at 70°C for lOmin.

3.4 Chelation of divalent metal ions

We added this step to lower the concentration of free Mg + that would otherwise inhibit the ssCircLigase II used in the next reaction.

To complex all free Mg2+, to the previous reaction was added:

50 mM EDTA stock (0,9 EDTA + 50 ml H20) 2m1

3.5 ssDNA circularization

For this reaction we used the ssDNA CircLigase II kit from Epicentre. lOx Reaction Buffer 2m1

MnCk (Manganese (Mn) is not to he confused with magnesium (Mg)) Imΐ

Betaine 4m1

CircLigasell Imΐ

ss-cDNA 10 pmoles

H2O to 20m1 final volume

Incubated at 60°C for 1-2 hours, then heat-inactivated the reaction at 70°C for lOmin.

At this point the reaction was treated with PlasmidSafe (optional) and used as a template for the RCA reaction (following steps).

3.6 Primer annealing

Depending on the cases, we used random primers or backbone specific primer or target-specific primers.

In this step, the template DNA could also be single-stranded circular DNA, as in the case of self-circularized cDNA.

The primers have two 3'-terminal phosphorothioate (PTO) modified nucleotides that are resistant to the 3' 5' exonuclease activity of proofreading DNA polymerases, such as phi29 DNA Polymerase. They also have 5'- and 3'-hydroxyl ends.

If the circular DNA is in water, for example in the case of a previous purification, then add 11% of the volume of 10X Annealing Buffer.

IPX Annealing Buffer

100 mM Tris, pH 7.5 - 8.0 500 mM NaCl

10 mM EDTA

Concentrated primers (50-100mM) were added to the reaction to a final concentration of 5mM.

The reaction was brought to 98°C and subsequently let to cool down slowly to room temperature.

Rollins Circle Amplification

The following volumes were calculated for a 50m1 reaction.

When the template was in Annealing buffer, 20m1 of the template was taken and to it was added:

5m1 Phi29 Buffer (lOx)

Imΐ BSA

Imΐ dNTPs (10mM)

2m1 Pyrophosphatase

Imΐ Phi29 DNA Polymerase

H2O to 50m1

When the template is in Circularization buffer, 46m1 of the template was taken and to it was added:

Imΐ dNTPs (10mM)

2m1 Pyrophosphatase

Imΐ Phi29 DNA Polymerase

Optional, to the reaction was added:

0.5m1 Uracil-DNA glycosylase (to remove any deaminated cytosines from DNA)

0.5m1 Formamidopyrimidine-DNA glycosylase (to remove 8-oxo-guanine products)

Reaction condition

Depending on the amount of input DNA (template) the reaction was run:

» 3h @ 30°C if the template is 10-50ng

» 6h @ 30°C if the template is 5-10ng

» 12h @ 30°C if the template is 0.5-5ng

4 Materials and Methods for targeted Cvclomics

To enable ultra-accurate targeted sequencing of any double stranded DNA molecule, we have designed a workflow based on existing molecular-inversion- probe (MIP) technology. Unique aspects are the design of the MIP capture backbone (minimization of backbone size, addition of unique molecular barcodes, probe specificity and distance) and the combination of the essay with rolling circle amplification. 4.1 Generation of Probes

Amplify off -array oligonucleotides (MIP precursors) using PCR: (2.5 hrs)

1. Array-derived MIP precursor oligonucleotides (mixture of 100-mers

obtained from Agilent) were dissolved to a final concentration of 100 nM in Tris-EDTA buffer with a pH of 8 and 0.1 % Tween.

2. The following 400 pl PCR mix was prepared in a 1.5 ml centrifuge tube.

It was split into 8 x 50 pl reactions in 0.2 ml PCR tubes. One PCR preparation yielded around 1.5 pg of amplified DNA.

3. The following PCR cycling program, was used on a real-time thermocycling instrument such as the Biorad MJ Mini.

1) 98 °C for 30 seconds

2) 98 °C for 10 seconds

3) 60 °C for 30 seconds

4) 72 °C for 30 seconds (read plate)

5) repeat steps 2 to 4 x 25 cycles

6) 4 °C indefinitely 4. PCR reactions were combined and cleaned up on one column using the QIAquick PCR purification kit following the manufacturer’s instructions. Eluted with 90 pl elution buffer.

5. Used a Qubit High Sensitivity dsDNA Assay Kit to quantify 1 pl of the amplified DNA. Capturing Exons with Molecular Inversion Probes

6. Analyzed 1 pl amplified DNA on a 6 % TBE PAGE gel (Invitrogen) to verify amplification. Product appeared as a single band at 110 bp, as the primers added an additional 10 bp.

Digest PCR product with nicking restriction endonucleases to generate 70-rner MIPs HAhrs):

1. Added was lOpl of NEB - 2 (lOx) and 5m1 of Nt.AlwI (10U/pl;NEB) to 85m1 of PCR product (total volume of 100 pl)

2. Mixed and split to two tubes of 50 pl each. Incubate at 37 °C for 3 hours, followed by 80 °C for 20 minutes in a thermocycler

3. The temperature was dropped to 65 °C for at least 1 minute. Added 2.5 pl of Nb.BsrDI (2 U / pl; NEB) to each of the 50 pl reactions

4. Left at 65 °C for 3 hours, followed by 80°C for 20 minutes

5. Purified two 50 pl digestion reactions on one column using reagents from the QIAquick Nucleotide Removal Kit. Eluted each column in 30 pl elution buffer. We have observed yields of 80-90 % for this step.

Quantify usable probe using a denaturing gel (2 hrs):

1. Accurate quantification of usable MIP inside the digested probe mix is

important as it determines how much probe mix to add to the capture reaction.

2. Prepared two-fold dilutions of a NEB 100 bp DNA ladder (we used dilutions from 500 ng to 62 ng).

3. Mixed 2x TBE-Urea sample buffer (Invitrogen) with 1 pl digested probe and the dilutions made above.

4. We denatured DNA by heating to 95 °C for 5 minutes and immediately transferring to ice.

5. Samples were run on a precast 6 % TBE-urea denaturing PAGE gel

(Invitrogen) for 1 hr at 160 V.

6. The amount of usable MIP was quantified in the digested mixture by

comparing the intensity of ladder dilutions with the intensity of the 70 bp band. We used this MIP concentration when determining the volume of probe mix to add to a capture reaction.

4.2 Capturing Exons with Molecular Inversion Probes Hybridize probes to genomic DNA (37 hrs):

1. For each sample to capture, we added the following reagents in a 0.2 ml PCR tube. The final capture reaction volume was 25 pi. Because there is no size selection of the 70 bp MIP, the volume of probe mix to add was based on the concentration of usable MIP.

2. Denatured at 95 °C for 10 minutes.

3. Incubate at 60 °C for at least 36 hours to hybridize MIPs to gDNA.

Circularize captured exons: (1 day)

1. Prepared a mix of ligase and polymerase enzymes to add to each capture reaction:

Prepared this mix on ice, and kept it cold before adding 4.7 pl into the capture reaction.

1. Incubated at 60 °C for an additional 24 hours to allow for gap-fill and ligation to circularize captured re ions.

Exonuclease select for circu larized product: (lhr )

1. Prepared a mix of exonucleases to add to each capture reaction in order to remove uncaptured gDNA, excess probe and blocking oligonucleotide: 2. Reduced the temperature of the capture reaction to 37 °C and allowed it to incubate for at least one minute before adding 4 pl of exonuclease mix.

3. Incubated for 15 minutes at 37 °C.

4. Inactivated exonuclease enzymes by heating reaction at 95 °C for 2 minutes.

5. Used 100 ng of the reaction product as the template for rolling circle

amplification.

4.3 Rolling circle amplification

1. Vacuum dried 20ul 2X annealing buffer to lul

2. Added 40ul circular DNA (around 10 ng)

3. Added 4ul 50uM random primers

4. Incubated the reaction for 5 min at 90C and slowly cool down to room

temperature

5. Added the following reagents: lOul lOx Phi29 buffer, 2ul lOOx BSA, 2ul lOmM dNTPs, 2ul Phi29 polymerase, 4ul Pyrophosphatase O.lU/ul, 40ul water

6. Incubated for 19 hours at 30C, followed by 10 min at 65C

7. Cleaned the reaction products using Ampure XP beads (0.4V)

The cleaned reaction product was used for any long read sequencing protocol.

5. List of backbones designed and tested

>BB1 (199bp)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCACGTCGTCATAGC TGTC

GAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTATTTAAATCTACGTAGAGTACG ACTGCGC

AGATGTGATCAGTGACTACGTGACACTGTACATCAGCACGATCGATGACTAGATGCT GCATGAC

ATAGCCC

>BB2 (259bp)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCACGTCGTCATAGC TGTC

GAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTATTTAAATCTACGTCACCGGGT CTTCGAG

AAGACCTGTTTAGAGTACGACTGCAAATGGCTCTAGAGGTACCCGTTACATAACTTA CGCAGAT

GTGATCAGTGACTACGTGACACTGTACATCAGCACGATCGATGACTAGATGCTGCAT GACATAG

CCC

>BB2_100 (341)

GGGCATGCACAGATGTACACGTACGATCATGTACGTCACGCGAGTGCACGTCGTCATAGC TGTC GAGTACTGTACTGACTGTCTCGAGCCTCAGCGAGTATTTAAATCTACGTCACCATATATA TGGA TATATATATGGATATATATATATATGGATATATGGATATATATATATATATATGGATATG TATG GATATATATATATATGGATATGGATGTTTAGAGTACGACTGCAAATGGCTCTAGAGGTAC CCGT TACATAACTTACGCAGATGTGATCAGTGACTACGTGACACTGTACATCAGCACGATCGAT GACT AGAT GC T GC AT GACATAGCCC >BB3 (514bp)

AACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATG TGAG

GGCCTATTTCCCATGATTCCTTCATATTTGCATATACGATACAAGGCTGTTAGAGAG ATAATTG

GAATTAATTTGACTGTAAACACAAAGATATTAGTACAAAATACGTGACGTAGAAAGT AATAATT

TCTTGGGTAGTTTGCAGTTTTAAAATTATGTTTTAAAATGGACTATCATATGCTTAC CGTAACT

TGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAACACCGGG TCTTCGA

GAAGACCTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAA CTTGAAA

AAGTGGCACCGAGTCGGTGCTTTTTTGTTTTAGAGCTAGAAATAGCAAGTTAAAATA AGGCTAG

TCCGTTTTTAGCGCGTGCGCCAATTCTGCAGACAAATGGCTCTAGAGGTACCCGTTA CATAACT

TA

>BBpX2 (557bp)

GGGCATGCACAGATGTACACGAACGCCAGCAACGCGGCCTTTTTACGGTTCCT

GGCCTTTTGCTGGCCTTTTGCTCACATGTGAGGGCCTATTTCCCATGATTCCTT

CATATTTGCATATACGATACAAGGCTGTTAGAGAGATAATTGGAATTAATTTGA

CT GTAAAC AC AAAG AT ATT AGT AC AAAAT AC GT GAG GTAGAAAGTAATAATTT

CTTGGGTAGTTTGCAGTTTTAAAATTATGTTTTAAAATGGACTATCATATGCTT

ACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGA

CGAAACACCGGGTCTTCGAGAAGACCTGTTTTAGAGCTAGAAATAGCAAGTTA

AAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTT

TTTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTTTTAGCG

CGTGCGCCAATTCTGCAGACAAATGGCTCTAGAGGTACCCGTTACATAACTTA

TAGATGCTGCATGACATAGCCC

6. Additional details on the generation of BB2 100 and BBpX2

BB2 was optimized by inserting flexible sequences using the Bbsl Cloning Site present in BB2. The insert consisted in a lOObp long DNA stretch obtained by in- silico design (see section 7) and Bbsl restriction site (bold in the sequences below) was added at the extremities for cloning purpose. The full insert was obtained by the annealing of two shorter oligonucleotides, (sense and antisense). The oligonucleotides were ordered as single-strand oligonucleotides from IDT DNA Technologies. The forward and reverse strand were annealed and the annealing product, now a dsDNA with sticky ends, was resolved on an agarose gel. Following, insert was cloned into BB2 with a Golden-Gate cloning reaction similar to the one described by (Ran et al. 2013).

The full insert sequence and the oligos used to produce it are the following:

>insert BB2_100

CACCATATATATGGATATATATATGGATATATATATATATGGATATATGGATAT

ATATATATATATATGGATATGTATGGATATATATATATATGGATATGGATGTTT

>sense oligo CACCATATATATGGATATATATATGGATATATATATATATGGATATATGGATAT

ATATATATATATATGGATATGTATGGATATATATATATATGGATATGGAT

>antisense oligo

AAA CAT C CAT AT C CAT AT AT AT AT AT AT C CAT AC AT AT C CAT AT AT AT AT AT AT ATATCCATATATCCATATATATATATATCCATATATATATCCATATATAT

BBpX2 was obtained by addition of the Srfl-half-sites (GGGC) and the rest of the Universal Primer sequences (underlined in the sequences below) at the extremities of a PGR amplicon. BB3 was used as a template for the PCR reaction.

The sequences above have the template-annealing part in lowercase and a flanking region in uppercase. The Srfl-half-sites are highlighted in orange the rest of the uppercase sequence is part of a constant sequence present at the extremities of all our backbone. The constant sequence is not essential for the backbone but is useful to standardize their amplification during the production steps. The following primers are indeed able to amplify any backbone made so far:

> Universal Srfl-BB - F

GGGCAIGCACAGATGTACACG

> Universal Srfl-BB ~ R

GGGCTATGTCATGCAGCATCTA

7. List of insert sequences

>Insert 17.1 (TP53, chrl7:7576971-7577132)

TAACTGCACCCTTGGTCTCCTCCACCGCTTCTTGTCCTGCTTGCTTACCTCGCTTAGTGC TCCC

TGGGGGCAGCTCGTGGTGAGGCTCCCCTTTCTTGCGGAGATTCTCTTCCTCTGTGCG CCGGTCT

CTCCCAGGACAGGCACAAACACGCACCTCAAAG

>Insert 17.2 (TP53, chrl7:7578161-7578394)

CAGTTGCAAACCAGACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGAAAAGTGT TTCT

GTCATCCAAATACTCCACACGCAAATTTCCTTCCACTCGGATAAGATGCTGAGGAGG GGCCAGA

CCTAAGAGCAATCAGTGAGGAATCAGAGGCCTGGGGACCCTGGGCAACCAGCCCTGT CGTCTCT

CCAGCCCCAGCTGCTCACCATCGCTATCTGAGCAGCGCTCAT 8. In-silico design of flexible DNA sequences

Flexible DNA sequences were used to improve the flexibility of BB2 by addition of a sequence of lOObp that was specifically designed using a simple genetic algorithm. The same approach was used to design whole backbone core sequences from scratch. To these backbone core sequences restriction sites, barcodes or primer sites can be added as described elsewhere herein

The optimization of backbone core sequences was done based on an evolutionary selection algorithm that optimizes the sequence for the following components:

1) High molecular flexibility

2) High sequence entropy

3) GC content between 30 and 60 percent, ideally closer to 50%

4) Absence of long, self-complementary stretches

5) Absence of long oligo polymers (NNNNNNN)

6) Absence of repeated motifs (kmers)

Flexibility calculation

A python implementation of the TwistFlex algorithm

(http ://m ar gaht.h uji . ac ii/T wist Flex/) (Menconi et al. 2015) was used to compute DNA flexibility at the twist angle of the input sequence. The flexibility of each individual dinucleotide is calculated based on the following table of angular degrees:

Subsequently, the mean flexibility of the entire sequence was considered for the selection in the evolutionary algorithm for backbone optimization. The mean flexibility of a DNA sequence is calculated as the sum of all dinucleotide angular degrees divided by the total number of dinucleotides. The flexibility threshold for our backbones was a mean of 12.5 angular degrees. Any sequence with a mean flexibility lower than 12.5 angular degrees was discarded. Entropy calculation for determining sequence complexity

The Shannon entropy of a string is defined as the minimum average number of bits per symbol required for encoding the string.

The formula to compute the Shannon entropy is where pi is the probability of character number i appearing in the sequence. The calculation can also be performed through:

http://www.shannonentropy.netmark.pl/

The above formula was implemented with the following python code:

def quick_entropy (se quence) :

alphabet = set(sequence) # list of symbols in the sequence

# Frequency of each symbol in the sequence

frequencies = [ ]

for symbol in alphabet:

frequencies. append(sequence.count(symbol) / len(se quence))

# Shannon entropy as in

https://en.wiktionary.org/wiki/Shannon_entropy

ent = 0.0

for freq in frequencies:

ent -= freq * math.log(freq, 2)

return ent

The minimum entropy value required by our backbone core sequences is 1.5 Sh. Each sequence with a lower entropy value was discarded.

Self -complementarity

The selected backbone core sequences were filtered for the presence of selfcomplementary stretches of 8 bases. A backbone having 8 or more consecutive bases self-complementary in the same strand is discarded.

Absence of repeated motifs (krners)

Backbone core sequences containing motifs of 6 bases repeated more than twice in the sequence, were filtered out.

Evolutinary algorithm for design of newer backbones (beyond BB2 )

Newer backbones were composed by flexible DNA plus a pair of fixed sequences at the extremities (Universal Srfl-BB F/R described in paragraph 6) that serve as primer-annealing sites for PCR amplification of the backbones and to add the half restriction sites. As any genetic algorithm (GA), the Cyclomics’ GA is composed by a main loop where a pool of sequences is scored and selected. The selected ones are then used as input (parents) for the generation of new sequences (children). Both parents and children are grouped in a new pool ready for the next iteration. The pseudocode of such a loop is the following: for each iteration:

filter(pool) #discard unwanted sequences

score (pool) #assign a score to each sequence

parents = select(pool) #select the best sequences

children = mate(parents) #generate new sequences

pool = parents + children #combine parents and children in a new pool

The algorithm is fully implemented in Python, the mating and the mutation operators, as well the main loop, were implemented from scratch following the general guidelines found in the literature (Hwang and Jang 2008; Back 1996;

Coello Coello and Lamont 2004; Lobo, Lima, and Michalewicz 2007). The mate operator act on strings, the sequences, and it performs a single crossing-over at random position. The mutation operator adds random mutations in the parents or child sequences, such mutations may include small deletions and duplications. The filtering step is used to prune the pool from sequences having unbalanced CG content, low sequence entropy, and unwanted repeated kmers before the selection step. The selection itself, simply collect the best sequences scored by flexibility. Selected sequences are used as parents to produce children using the mating operator. To calculate sequence flexibility, existing code (Menconi et al. 2015) was adapted to fit our purposes.

9. Results

PGR was performed using primers 17.2-F (CAGTTGCAAACCAGACCTCA) and 17.2-R (ATGAGCGCTGCTCAGATAG) to obtain a PGR product with length of 234bp covering a coding exon of TP53 (chrl7:7578161-7578394, GRCh37). De PCR product - referred to as 17.2 - was ligated into pJET (Thermo Fisher) according to standard procedures. The ligation products were transformed to E.coli Top 10 cells and one colony was picked for collection of a (clonally propagated) p JET- 17.2 plasmid. The sequence of 17.2 was verified by Sanger sequencing and found to be the same as the reference genome (GRCh37). Phosphothionate (PTO)-modified primer 17.2-R (ATGAGCGCTGCTCAGATA*G*, where * is the PTO modification) (5mM) were annealed to 50ng of pJET-17.2 in the presence of 5mM EDTA in a final volume of 20pL. This reaction mixture was heated to 95°C for 5 min followed by cooling to 4°C.

The 20gL annealing reaction was supplemented with 0.2u inorganic

pyrophosphatase (Thermo Fisher), lOu Phi29 (NEB), lqL of lOmM dNTPs, lqL 100X BSA solution (NEB, 20mg/mL) and 5qL Phi29 10X reaction buffer (NEB). The resulting reaction mixture was incubated at 30°C for 3h following by lOmin at 65°C. The amplified high-molecular weight DNA was purified using Ampure beads (Agencourt), followed by ID nanopore library preparation (Oxford Nanopore Technologies, SQK-LSK108). The resulting library was run on a MinlON flowcell (FLO-MIN106, R9.4 chemistry) for 48h.

10. DNA quantification by gel densitometry

Agarose-gel densitometry is a method used to quantify DNA by image-analysis of gel bands by comparison of pixel brightness 1) between the ladder and the band of interest or 2) between the input band and the product band.

10.1 using the ladder as a reference

Given the picture of an agarose gel containing a known amount of DNA ladder, we have used the software Imaged to estimate the brightness intensity of the bands. The correct circular products were quantified and compared to the input to calculate the efficiency of the circularization. The Measure function of Imaged was used to determine the area and mean intensity of the hands on each gel. The mean intensity of the background of the image, as close as possible to the band in question was determined and subtracted from the band intensity. The resulting intensity was multiplied by the area of the band (referred to as level). To create a reference level, the ratio was also calculated for the band corresponding to 400 base pairs in the GeneRuler 50 bp DNA ladder (Thermo Fisher Scientific) in each image, showing the intensity for fifteen nanograms of DNA. To calculate the DNA content of each hand in nanograms, the calculated level was divided by the reference level and multiplied by fifteen. The DNA content in moles was determined using the Promega DNA conversions tool dsDNA: pg to pmol. The efficiency of the

circularization was calculated by dividing the correct product in moles by the input of insert in moles and multiplying by 100 percent. To validate this approach, the DNA content of the other bands in the DNA ladder was calculated and compared to the predicted DNA content. The DNA content per hand was plotted for the predicted as well as the calculated value (figure 6).

10.2 direct comparison of input and product bands

An alternative procedure to estimate the efficiency of circularization via gel-image analysis requires to have at least two lines on the gel, one with the input DNA and one with the product. Using ImageJ we can estimate the ratio between the insert before the reaction (input) and the one left after the reaction (unreacted).

The brightness of the bands inside the yellow rectangle (input DNA on the left band and unreacted DNA on the right one) is measured and compared. In this case, the ratio between input and unreacted is 66:33. We can conclude that 50% of the initial DNA have reacted (Figure 7A).

Next we compare the products band (Figure 7B). We know that the top band is the one representing the desired product while the band underneath is the non-circularized product. From the ratio between this two bands and we can establish that the circularization efficiency, defined as the amount of input DNA that was correctly circularized into the final product. If A is the ratio of reacted input and B is the ratio of correct product, then the efficiency is given by A*B. In this case, 50% * 50% = 25%.

References

- Back, Thomas. 1996. Evolutionary Algorithms in Theory and Practice:

Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press on Demand.

- Coello Coello, Carlos A., and Gary B. Lamont. 2004. Applications of Multi- Objective Evolutionary Algorithms. World Scientific.

- Hwang, Gi-Hyun, and Won-Tae Jang. 2008.“An Adaptive Evolutionary Algorithm Combining Evolution Strategy and Genetic Algorithm

(Application of Fuzzy Power System Stabilizer).” In Advances in

Evolutionary Algorithms.

- Lobo, F. J., Claudio F. Lima, and Zbigniew Michalewicz. 2007. Parameter Setting in Evolutionary Algorithms. Springer Science & Business Media.

- Menconi, Giulia, Andrea Bedini, Roberto Barale, and Isabella Sbrana. 2015. “Global Mapping of DNA Conformational Flexibility on Saccharomyces Cerevisiae.” PLoS Computational Biology 11 (4): el004136.

- Ran, F. Ann, Patrick D. Hsu, Jason Wright, Vineeta Agarwala, David A.

Scott, and Feng Zhang. 2013.“Genome Engineering Using the CRISPR- Cas9 System.” Nature Protocols 8 (11): 2281-2308.

Example 2

Sequencing of concatenated DNA molecules

We cloned a PCR product covering position 17:7578265 of the TP53 gene into pJET (Materials and Methods, section 9). Following bacterial transformation, a single colony was picked and plasmid DNA was isolated to confirm the presence of the TP53 insert (data not shown). Next, we performed a rolling-circle amplification (RCA) on the isolated plasmid using phi29 polymerase and random hexamer primers (Materials and Methods, section 9). We obtained a high-molecular weight RCA product with a size > 20kb as estimated by gel-electrophoresis (Figure 8A).

The product was used as input for a ID library preparation for sequencing on the Oxford Nanopore Technologies (ONT) MinlON instrument and the resulting library was sequenced for 48h according to manufacturer’s specifications. A total of 16,248 sequencing reads was generated for this sample, with an average read length of 5.7 kb (Figure 8B) and 2,083 reads longer than 10 kb. Nanopore/MinlON sequence reads were mapped to the human reference genome (GRCh37, augmented with the pJET sequence) using LAST (Kielbasa et al. 2011). The subset of 2,083 reads >10kb were examined for an alternating backbone (pJET) and insert (TP53 fragment) configuration (referred to as BI), which would be expected from the circular template that was used as input. We observed that all of the 2,083 reads have multiple BI copies. For all the reads longer than lOkb we computed a“pattern score”, a number between 1 and 100 representing the regularity of the BI repetitions, calculated as BI/([B + I]/2)*100, where BI is the number of p JET- 17.2 segments, B is the number of pJET segments and I is the number of 17.2 segments in a nanopore read. The majority of the long reads have a pattern score of 100 indicating the correctness of the RCA product, i.e. repeats of BI units (Figure 8C). We used these reads to extract the insert sequences (17.2 - TP53 fragment) and subsequently aligned the inserts from each read using Muscle (Edgar 2004a, [b] 2004) (Figure 8D). For each read, we applied a majority voting scheme to the aligned 17.2 segments to derive a consensus sequence. The consensus was compared to the reference sequence to determine the accuracy as function of the number of BI copies in the read (Figure 8E). This experiment demonstrates proof of concept for obtaining accurate consensus reads based on nanopore sequencing of multiple copies of a DNA molecule (Li et al. 2016). In a next step backbone design improved which reduce the amount of sequencing throughput for sequencing of the backbone (pJET - 3kb - in the above experiment).

Optimization of a linear dsDNA backbone

Backbone design yrincmles

As a first step towards optimizing capture and circularization of short DNA molecules, we tested different designs of a backbone sequence mediating this process. We compared three parameters: the length of the backbones (longer DNA molecules (backbones) are thought to be easier to circularize but this leads to a waste of sequencing information as the majority of each read will then consist of backbone sequences). Second, DNA molecules that are shorter than ~200bp, are thought to be difficult to circularize because of their relative stiffness (Shore, Langowski, and Baldwin 1981). Third, the flexibility of a given DNA molecule also depends on its base composition and sequence.

The first generation of backbones are BB1 and BB2 (Material and Methods, sequences in section 5). They were designed following the general principles highlighted in Materials and Methods, section 2). The aim of designing these backbones was to serve as basic building blocks upon which we could improve. These backbones contain a combination of several elements that can help in capturing of DNA molecules and subsequent amplification and sequencing, such as restriction sites, barcode sequences and/or nicking enzyme sites. We detected circularization using BB1 but is was not very efficient and it was not taken along for further testing.

Shore, Langowski, and Baldwin 1981, proposed that circularization of blunt-ended short DNA molecules can be suboptimal if efficiency is desired. We next generated a longer backbone that would allow more efficient circularization in a short period of time (~lh). The resulting backbone, named BB3, is a 514bp long dsDNA fragment generated by PCR amplification of part of plasmid pX330 (Material and Methods, section 5 and 6). BB3 did not contain a restriction enzyme site at the extremities, and it was used only to test different ligation conditions (see below).

We also generated a modified version of BB3 by adding »SV/7 sites, resulting in BBpX2.

We used the free energy values of each base-pair (Breslauer et al. 1986) and the deviation of the twist angle (degrees) (Sarai et al. 1989) to compute the flexibility of any given DNA sequence. We used a genetic algorithm to generate a population of sequences that are selected for high flexibility, short length and optimization of other parameters like the GC content, the presence of repeated motifs and sequence self-complementarity. A more detailed description of the genetic optimization algorithm as well as details of backbone structures is given in the Materials and Methods, section 8. To improve the circularization efficiency of the backbone while keeping its sequence short, we have designed several stretches of flexible DNA in-silico. We have used such stretches to improve the flexibility of the backbones. Backbone 2 (BB2) was modified by including a 100 bp long flexible sequence in the middle of its sequence. The resulting backbone (BB2_100) is 341bp long and it contains Srfl half-restriction-sites at its extremities.

Measuring reaction products of BB2 and BB3 backbones in circularization reactions We tested BB2 and BB3 in a circularization reaction together with a PCR-product as insert, 17.2 (Materials & Methods, sections 2, 5 and 7). A first experiment was performed to establish the best reaction conditions to achieve optimal

circularization products. The reactions were performed with and without the addition of plasmid safe DNAse, to obtain a clear view on linear and circular reaction products (Figure 9A). BB2 showed consistent results in circularization efficiency but circular reaction product was not very abundant. A circularization reaction with BB3 resulted in a visible circular product (Figure 9A) while a clear reaction product was not visible in case of BB2. However, by running the entire reaction mixture on a gel were able to observe a weak band following plasmid-safe digestion, indicating a correct circularized product consisting of BB2 and insert 17.2 (Figure 9B).

The effect of different backbone-insert ratios on circularization efficiency

Next, we evaluated the effect of different backbone -insert ratios on the specificity and efficiency of formation of a circular backbone-insert product that we aimed for. Therefore, we used BB3 for a circularization experiment together with a 234 bp PCR product (17.2, Materials and Methods section 7) (Figure 10).

The best circularization efficiency was obtained with a 3:1 molar ratio between backbone and insert. This setup was kept to perform further characterization. Note that the strategy used here is profoundly different from the one used for standard plasmid-based cloning. In standard cloning, the plasmid is usually dephosphorylated to avoid self-circularization and an excess of phosphorylated insert is added to the reaction. For Cyclomics technology, the backbone is phosphorylated and in excess while the insert is dephosphorylated. It was observed that this avoided ligation- dependent concatemerization of target and improved backbone target ligation efficiency.

The effect of flexible stretches on backbone circularization

In a subsequent experiment, we tested the effect of the addition of a flexible DNA stretch on backbone circularization. Therefore we compare the circularization efficiencies of BB2 with that of BB2_100 (see above) and BB3 (Figure 11), in a circularization reaction with insert 17.2. The flexible region in BB2_100 is rich in TA repeats but still complex enough to be unambiguously mapped to a reference sequence. During this test, we have also evaluated the effect of HMGB1 and Srfl on the circularization reaction of BB2_100. HMGB1 is a known DNA bending protein, which could potentially improve circularization (Belgrano et al. 2013). We observed improved circularization efficiencies for BB2_100 compared to BB2, particularly when considering backbone dnsert ratios of 3:1. Thus, we conclude that backbone design can be optimized by the addition of flexible DNA stretches, to promote circularization efficiency. A greater circularization efficiency, estimated to be around 26% was achieved with an overnight circularization of BB2_100 and 17.2 (Figure 12) demonstrating that a better reaction performance can be obtained modulating both backbone design and reaction conditions.

ΊΊi< effect of Srfl on formation of backbone circularization products

One essential part of our backbones is the presence of a split restriction site at the extremities. If the backbone self-circularizes without insert, the full restriction site is reconstituted making the backbone susceptible to specific nucleases. In the following example, Srfl (GCCC I GGGC) half-restriction-sites were added at the extremities of the BB3, generating a new backbone that we called BBpX2, and Srfl nuclease was added in the reaction mixture together with T4 Ligase. Srfl has the advantage of recognizing an 8-bases-long site while most of the commercially available alternatives recognize 6-bases-long sites. Other sequences we are evaluating are Pmel (GTTT | AAAC) and Swel (ATTT | AAAT). If the ligation reaction is performed in the presence of Srfl, any self-circularized backbone will be susceptible to restriction enzyme cleavage and thus it will return to the original linear form.

The effect of the restriction enzyme is clearly visible by comparing the first two lanes (Figure 13). When Srfl is present, the linear backbone (thick bold band) is maintained and the overall reaction leads to very few byproducts. In the absence of Srfl (first lane) the majority of the backbone is wasted in the formation of several byproducts. The effect of Srfl can be further appreciated by the effect of Plasmid Safe DNAse treatment (last two lanes) which leads to degradation of linear DNA. If Srfl is added to the reaction (last lane), then only the expected product is formed, in contrast, without the addition of Srfl (third lane), a number undesired circular byproducts are produced.

Dephosphorylation of inserts

To avoid self-polymerizations of the inserts, we perform enzymatic

dephosphorylation using Antarctic Phosphatase that ensures high reactivity at low temperatures and can be fully inactivated at 65°C in just five minutes.

Barcoding strategies

Molecular barcoding is a strategy to tag individual DNA molecules, in order to classify the sequencing reads resulting from the DNA molecules. Barcodes can be used to classify sequencing reads (bioinformatically) by sample, thus allowing the pooling of multiple samples on a single sequencing run (Wong, Jin, and Moqtaderi 2013). In that case only a limited number of unique barcodes are used, one for each sample.

Additionally, barcodes can he used to label each DNA molecule separately and such barcodes are often referred to as unique molecular identifiers (UMIs). In this case, a large number of unique barcodes/UMIs (random sequences) is used to make the chance as low as possible that any two unrelated sequences get the same barcode. UMIs can be used to obtain absolute quantification of individual sequences (Kivioja et al. 2011).

Another application of UMIs is the detection and quantification of low-frequency mutations (Kou et al. 2016), for example in cancer samples. This involves labeling of individual DNA molecules, followed by PCR amplification and deep sequencing. Subsequently, sequence reads can be grouped by UMI sequence and possible mutations can be detected and discriminated from sequencing errors. An elegant application of UMIs for mutation detection in ctDNA is outlined by Newman et al (Newman et al. 2016).

We envision the design of backbone sequences with both sample-specific barcodes and UMIs (Figure 14). Such a strategy enable pooled sequencing of multiple independent samples as well as enhanced mutation detection power. Sample- specific barcodes will be 5-20 nucleotides in length and can be placed anywhere in the backbone sequence, provided that they do not influence backbone flexibility (and thus ligation efficiency). Random strings of 5-20 nucleotides, representing UMIs, will also be added to backbones for labeling of individual DNA molecules. The UMIs can be used to improve mutation detection by requiring at least two or more distinct molecules with a mutation, i.e. both molecules should have a unique UMI.

Rolling circle amplification from circularized DNA molecules

A circular DNA product obtained by the circularization reaction of backbone and insert can serve as a template for the generation of concatemers via rolling circle amplification (RCA). We have tested RCA using DNA (inserts) from very different sources including cfDNA, PCR amplicons, plasmids and cDNA (Figure 15), using random hexamer primers.

Site-directed RCA

In addition to the canonical RCA reaction, that involves random hexamers to initiate the amplification, we devised the use of specific primers to direct the amplification toward the region of interest, this method is called site-directed RCA. Such an approach could be of use in case only specific genes should be sequenced rather than the whole genome. The current way to accomplish this is via PCR enrichment of the gene of interest (Dowthwaite and Pickford 2015). However, PCR amplification is known to add errors in the amplicons (Shuldiner, Nirula, and Roth 1989); (Diaz-Cano 2001) and even a single amplification error occurring early during the PCR reaction can bias the final results (Diaz-Cano 2001; Quach, Goodman, and Shibata 2004); (Arbeithuber, Makova, and Tiemann-Boege 2016).

To test whether we can obtain site-specific enrichment of a target region without the use of PCR, we have coupled the Cyclomics assay with site-directed RCA.

Briefly, two distinct region of the TP53 gene, 17.1 ad 17.2 (Material and Methods section 7), were cloned into the p JET vector. The modified pJET vectors were used in a 1 to 1 molar ratio as a template for an RCA reaction in which a specific primer (17.2-R, Materials and methods section 9) targeting 17.2, but not 17.1, was used instead of the random hexamers. The reaction product was sequenced using a nanopore MinlON instrument and the number of reads containing 17.1 and 17.2 were compared (Figure 16). We observed that sequencing reads containing insert 17.2 occurring at 14x access compared to reads containing 17.1, demonstrating that target selected RCA can be achieved using specific primers.

One-pot reaction design

To enhance the usage of Cyclomics technology, we have focused on development of a streamlined experimental procedure. Thus, we have limited time-consuming and laborious steps like DNA purification, concentration and gel electrophoresis as much as possible. To this end, we have designed a protocol that is made by three simple consecutive steps that can be performed in one single tube limiting the need of performing purification or buffer exchanges.

The steps are: 1) circularization, 2) removal of linear DNA and 3) Rolling Circle Amplification (Figure 17).

The first reaction of the Cyclomics protocol involves the insert DNA (I) and the backbone (BB), that are mixed together in the presence of T4 DNA Ligase and the restriction enzyme Srfl. The mixture is left at room temperature for 1 to 4 hours, followed by heat inactivation of the enzymes at 70 °C for 30 minutes. The second step of the Cyclomics protocol is performed by adding the PlasmidSafe enzyme to the reaction mixture, together with its buffer and 1 mM ATP. The mixture is incubated at 37 °C for 30 minutes and inactivated again. Before proceeding with the rolling circle amplification (reaction 3), RCA-primers are added to the mixture and a quick annealing step is performed by warming up the reaction up to 98 °C for 5 minutes. After the mixture is cooled at room temperature, Phi29,

Pyrophosphatase, and the other components of the RCA reaction are added. The reaction is then incubated at 30 °C for at least 3 hours.

Consensus calling

In order to detect mutations from long reads with concatemers a consensus of the target sequence is produced (Figure 18). To this end, the long reads are split into backbone sequences and target sequences based on a LAST split-read mapping to the reference genome (Kielbasa et al. 2011). Target sequences are passed to the GATK UnifiedGe no typer for variant calling (DePristo et al. 2011). Post-hoc filtering is applied based on variant confidence scores to optimize sensitivity and specificity.

Examples of application of Cyclomics technology

Targeted sequencing of a TP53 mutation in genom ic DNA from ovarian cancer We have tested the Cyclomics method on three tumor biopsies with a known mutation in TP53 (chrl7:7578265, A->T, hgl9) at variable frequency (1%, 9%,

14%), as previously assessed using short read targeted Ion Torrent sequencing (Hoogstraat et al. 2014). In short, we performed PCR on the targeted locus and ligated the resulting products to a specifically designed and optimized backbone that promotes efficient capture of the short DNA products. Subsequent ligation products were amplified and concatenated to form long DNA molecules with repeated copies of target/insert and backbone. Long DNA molecules were sequenced for a few hours using a nanopore MinlON instrument (ID ligation based library prep). We obtained a total of 206,048 sequence reads for all three samples, which were processed by mapping with LAST (Kielbasa et al. 2011) and a custom algorithm for consensus calling (Figure 18). Next, we estimated the mutation frequency from the consensus reads and observed a frequency for the TP53 mutation of 0.5%, 7.6% and 14%, providing proof-of-concept for detection of low- frequency somatic mutations in cancer DNA using Cyclomics technology (Figure 19).

References

Arbeithuber, Barbara, Kateryna D. Makova, and Irene Tiemann-Boege. 2016.

“Artifactual Mutations Resulting from DNA Lesions Limit Detection Levels in Ultrasensitive Sequencing Applications.” DNA Research: An International Journal for Rapid Publication of Reports on Genes and Genomes 23 (6): 547- 59.

Belgrano, Fabricio S., Isabel C. de Abreu da Silva, Francisco M. Bastos de Oliveira, Marcelo R. Fantappie, and Ronaldo Mohana-Borges. 2013.“Role of the Acidic Tail of High Mobility Group Protein B1 (HMGB1) in Protein Stability and DNA Bending.” PloS One 8 (11): e79572.

Breslauer, K. J., R. Frank, H. Blocker, and L. A. Marky. 1986.“Predicting DNA Duplex Stability from the Base Sequence.” Proceedings of the National Academy of Sciences 83 (11): 3746-50.

DePristo, Mark A., Eric Banks, Ryan Poplin, Kiran V. Garimella, Jared R.

Maguire, Christopher Hartl, Anthony A. Philippakis, et al. 2011.“A

Framework for Variation Discovery and Genotyping Using next- Generation DNA Sequencing Data.” Nature Genetics 43 (5): 491-98.

Diaz-Cano, Salvador J. 2001.“Are PCR Artifacts in Microdissected Samples

Preventable?” Human Pathology 32 (12): 1415.

Dowthwaite, Gary, and Jo Pickford. 2015.“PCR-Based DNA Enrichment Enhances Detection of Mutations in Oncology.” MLO: Medical Laboratory Observer 47 (11): 18, 20.

Edgar, Robert C. 2004a.“MUSCLE: Multiple Sequence Alignment with High

Accuracy and High Throughput.” Nucleic Acids Research 32 (5): 1792-97.

- . 2004b.“MUSCLE: A Multiple Sequence Alignment Method with Reduced

Time and Space Complexity.” BMC Bioinfor tics 5 (August): 113.

Hoogstraat, Marlous, Mirjam S. de Pagter, Geert A. Cirkel, Markus J. van

Roosmalen, Timothy T. Harkins, Karen Duran, Jennifer Kreeftmeijer, et al. 2014.“Genomic and Transcriptomic Plasticity in Treatment-Naive Ovarian Cancer.” Genome Research 24 (2): 200-211.

Kielbasa, Szymon M., Raymond Wan, Kengo Sato, Paul Horton, and Martin C.

Frith. 2011.“Adaptive Seeds Tame Genomic Sequence Comparison.” Genome Research 21 (3): 487-93.

Kivioja, Teemu, Anna Vaharautio, Kasper Karlsson, Martin Bonke, Martin Enge, Sten Linnarsson, and Jussi Taipale. 2011.“Counting Absolute Numbers of Molecules Using Unique Molecular Identifiers.” Nature Methods 9 (1): 72-74.

Kou, Ruqin, Ham Lam, Hairong Duan, Li Ye, Narisra Jongkam, Weizhi Chen,

Shifang Zhang, and Shihong Li. 2016.“Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations.” PloS One 11 (1): e0146638.

Li, Chenhao, Kern Rei Chng, Esther Jia Hui Boey, Amanda Hui Qi Ng, Andreas Wilm, and Niranjan Nagarajan. 2016.“INC-Seq: Accurate Single Molecule Reads Using Nanopore Sequencing.” GigaSeience 5 (1): 34.

Newman, Aaron M., Alexander F. Lovejoy, Daniel M. Klass, David M. Kurtz, Jacob J. Chabon, Florian Scherer, Henning Stehr, et al. 2016.“Integrated Digital Error Suppression for Improved Detection of Circulating Tumor DNA.” Nature Biotechnology 34 (5): 547-55.

Quach, Nancy, Myron F. Goodman, and Darryl Shibata. 2004.“In Vitro Mutation Artifacts after Formalin Fixation and Error Prone Translesion Synthesis during PCR.” BMC Clinical Pathology 4 (1). doi: 10.1186/1472-6890-4-1.

Sarai, A., J. Mazur, R. Nussinov, and R. L. Jernigan. 1989.“Sequence Dependence of DNA Conformational Flexibility.” Biochemistry 28 (19): 7842-49.

Shore, D., J. Langowski, and R. L. Baldwin. 1981.“DNA Flexibility Studied by

Covalent Closure of Short Fragments into Circles. Proceedings of the National Academy of Sciences of the United States of America 78 (8): 4833-37.

Shuldiner, Alan R., Ajay Nirula, and Jesse Roth. 1989.“Hybrid DNA Artifact from PCR of Closely Related Target Sequences.” Nucleic Acids Research 17 (11): 4409-4409.

Wong, Koon Ho, Yi Jin, and Zarmik Moqtaderi. 2013.“Multiplex Illumina

Sequencing Using DNA Barcoding.” Current Protocols in Molecular Biology / Edited by Frederick M. Ausubel ... [et Al.] Chapter 7: Unit 7.11.

Example 3

Materials and methods

Circularization and RCA amplification of short PCR oligos.

Materials

Backbone (BB) BB2.4 with barcode 10-50 ng/ul 243-244 bp

Insert (I) blunt PCR amplicon 10-50 ng/ul 100-250 bp

CutSmart Buffer 10 X (supplied with NEB#R0629)

ATP 10 mM (NEB#P0756)

dNTPs 10 mM (ThermoFisher#R0192)

T4 Ligase 400 U/gl (NEB#M0202S)

Srfl (Restr. Enz.) 20 U/gl (NEB#R0629)

Plasmid-Safe Buff. 10 X (Lucigen#E310 IK)

Plasmid-Safe Enz. 10 U/ul (Lucigen#E310 IK)

Annealing Buffer 5 X (50 mM Tris @ pH 7.5-8.0, 250 mM NaCl, 5 mM EDTA)

Phi29 Buffer 10 X (supplied with

ThermoFisher#EP0091)

BSA 10 mg/ml (NEB#B9001)

Pyrophosphatase 0.1 U/gl (ThermoFisher#EF0221)

Phi29 DNA Polym. 10 u/mΐ (ThermoFisher#EP0091)

Exo-Res. RND Primers 500 mM (ThermoFisher#SO 181)

Wizard SV Gel and PCR Clean-Up System (Promega#A9282) The backbone must be phosphorylated, either producing it via PCR using phosphorylated primers or using phosphorylation with PNK of a non- phosphorylated PCR product or synthetic DNA duplex (T4 Polynucleotide Kinase). The insert must be dephosphorylated, either via PCR amplification using non- phosphorylated primers or using Antarctic Phosphatase.

Both, insert and backbone must be blunt. The preferred way is by using Phusion Polymerase (leaves blunt-ended amplicons).

Both insert and backbone must be buffer-free by column or bead purification.

In case the PCR reaction used to produce BB or I yielded more than one product, then gel-purification of the expected product is necessary.

If the template used for the amplification of I or BB is circular (a plasmid for example), then gel-purification of the PCR product is necessary.

Methods

Circularization:

Reaction Mix (IX): (BB:I molar ratio should be 3:1)

CutSmart Buffer (10X) 5 mΐ

ATP (lOmM) 10 mΐ (2mM final concentration)

H20 to 46 mΐ

T4 Ligase 2 mΐ

Srfl (Restr. Enz.) 2 mΐ

TOTAL 50 mΐ

Prepare the above Reaction Mix on ice and in PCR tubes.

Vortex and spin

Put in a Thermocycler and run the following program: (16°C x 10 » 37°C x 10’ ) x 8 » 70°C x 20’

Add 1 mΐ of Srfl and run the following program: (this step is to digest any residual BB-BB) 37°C x 15’ » 70°C x 20’

The suggested max amount of DNA that should be used in this reaction

(considering both I and BB) is 400 ng in a 50 mΐ reaction. The BB:I ratio should not change.

Example ratio calculation: ( len(X) = length of X, in base pairs ), wherein: len(I) = 130 bp; len(BB) = 245 bp; len(BB)/len(I) = 245/130 = 1.88; starting with 50 ng of I, then 50*1.88*3 = 282 ng of BB are needed to reach the 3:1 ratio.

Linear DNA removal

Take 4 mΐ of the circularization reaction out as a negative control for the gel that will he run later.

To the rest of the circularization reaction (46 mΐ) add: - ATP 10 mM 6 mΐ

- Plasmid-Safe Buffer 10X 6 mΐ

- Plasmid-Safe Enzyme 2 mΐ

Incubate at 37°C for 30’

Inactivate at 70°C for 30’

Run the whole reaction (S), together with the negative control (C-) in a 1.7% agarose gel.

Gel-purify the band corresponding to the Circular BB-I (Figure 25).

Elute twice with 30 mΐ of H20

Rolling circle amplification

To the purified Circular BB-I (around 50 mΐ at this point) add:

- Annealing Buffer (5X) 12 mΐ

- Exo-Res. RND Pri ers (500 mM) 1 mΐ

Heat the solution at 98°C for 5’, then let cool down slowly at R.T.

Add:

- Phi29 Buffer (10X) 10 mΐ

- BSA 2 mΐ

- dNTPs 10 mΐ

- Pyrophosphatase 4 mΐ

- Phi29 Polymerase 2 mΐ

- H20 to 100 mΐ

Incubate the reaction at 30°C for at least 3h.

Inactivate at 70°C for 10’

Run 5 mΐ in a 0.5% agarose gel.

Running the RCA reaction overnight will yield more product. However, it is not yet clear if the quality of the concatemers will be affected.

Quality check

The following procedure allows for a rough estimation of the amount of BB-I vs BB- only monomers present in the RCA product. Leveraging the presence of a restriction site (BglII in the following example) in the backbone, the RCA product can be digested and the resulting band pattern can be used to extrapolate the exact content of the RCA product.

As shown in figure 27, BB200_4 (243bp) and S1_WT (158bp) where circularized and amplified by RCA. When digesting concatemers made by BB-I we expect a band around 400bp, while if the concatemer consists of only BB, the resulting band should be around 250bp. Concatemers formed by only I would not be digested leaving the RCA band visible.

Library prep

DNA purification: - Add an equal volume of Dynabeads, gently mix and incubate for 5 min at room temperature.

- Insert the tube in the magnetic rack, wait 5 min to allow the beads to

cluster on the wall

- Remove the buffer

- Gently wash with 700 mΐ of 70% ethanol

- Remove the ethanol and repeat the washing step once more

- Let residual ethanol evaporate

- Remove the tube from the magnetic rack

- Elute the DNA from the beads with 100 mΐ of ultrapure water

Resolve branched DNA:

- Add 4 mΐ of T7 Endonuclease (NEB#M0302S)

- Incubate at 37°C x lh

Library prep:

- Proceed with Nanopore library prep, either ID ligation prep or rapid prep.

List of Backbone and Insert sequences used

Backbone properties:

- len = backbone length in basepairs

- mean_flex = mean value of the DNA flexibility computed over all

consecutive segments of 50 basepairs contained in the sequence.

- max_flex = is the max DNA flexibility computed for a segment of 50

basepairs in the sequence

- entropy = Shannon entropy of the DNA sequence

- GC% = percentage of GC bases in the backbone

>BB100_1 (len: 143 mean_flex: 12.89 max_flex:14.71 entropy:2.0 GC%:48.25) GGGCATGCACAGATGTACACGATTCCCAACACACCGTGCGGGCCATCGACCTA TGCATACCGTACATATCATATATAAATCACATAATTTATTATACGTATGTCGCG CGGGTGGCTGTGGGTAGATGCTGCATGACATAGCCC

>BB100_2 (len: 143 mean_flex: 13.29 max_flex: 14.95 entropy: 1.96 GC%:37.76) GGGCATGCACAGATGTACACGCACTACATGCCAATGCCCAAGCAGTGCGCATA TCACGTATCATATCTAATATATTATAATATTATGATAATGAGTATTTATTTAATT TGTTTGTGTGAGGTAGATGCTGCATGACATAGCCC

>BB100_3 (len: 143 mean_flex: 12.78 max_flex:14.1 entropy: 1.95 GC%:44.06) GGGCATGCACAGATGTACACGCATTGGCCGTCTGTGCTGTCCATGGATCGTCT GATTGATATGATATCATATATTATAATTATACAGTAAGGTGATTGGGTATTGAG GGTTGTGTGGTTGGTAGATGCTGCATGACATAGCCC

>BB100_4 (len: 145 mean_flex: 12.89 maxjlex: 14.06 entropy: !.95 GC%:44.14) GGGCATGCACAGATGTACACGGTAGACATGCGAAGCGTGCGATGACAATCGA

TGTGGACATCATGCATATATATGTTGTATAATTAAACAAATATGTGTAGTGTGT

GAGGTGGGTGTAGGAAGTAGATGCTGCATGACATAGCCC

>BB100_5 (len: 143 mean_flex: 13.27 max_flex: 14.34 entropy: 1.9 GC%:37.76) GGGCATGCACAGATGTACACGTTGTCATGGGAATTTGTGGTTATGAAATGAGT ATGCGACGAATATGTATACATATATATTAAATTATAGAGTGATGTATGAGTTTG TGATGTGTGGTGTATAGATGCTGCATGACATAGCCC

>BB200_1 (len:243 mean_flex: 13.0 max_flex: 14.9 entropy: 1.99 GC%:44.86)

GGGCATGCACAGATGTACACGGCGGCGCAAGATGATGTGCCGAACCTGACAT

GGCATCGACTGGTATGGATCAATACTGATGCGATATCGATACCGGATAAATCA

TATAT GC ATAAT AT C AC ATT AT ATTAATT AT AAT AC AT C GGC GT AC AT AT AC AC

GTACGCATCATTTCACTATCTATCGGTACTATACGTAGTGCCGGTCTGTTGGC

CGGGCGACATAGATGCTGCATGACATAGCCC

>BB200_2 (len:244 mean_flex: 13.15 max_flex: 14.69 entropy: 1.96 GC%:38.52) GGGCATGCACAGATGTACACGTGACGCAACGATGATGTTAGCTATTTGTTCAA TGACAAATCTGGTATGATCAATACCGATGCGATATTGATATCTGATAACTCATA TATGT AGAATAT C AC ATT AT ATTT ATT AT AAT AC ATC GT C G AAC AT AT AC ACAA TGCATCTTATCTATACGTATCGGGATAGCGTTGGCATAGCACTGGATGGCATG AC C CT GATT AG AT G CT G CAT GAG AT AG C C C

>BB200_3 (len:244 mean_flex: 13.06 max_flex: 14.9 entropy: 1.96 GC%:39.75)

GGGCATGCACAGATGTACACGAGACCGCAAGATGATGTTCATTCTTGAACATG

AGATCGGATGGGTATGGATCAATACCGATGCGATATGATAACTGATAAATCAT

ATATCTATAATATCACATTATATTAATTATAATACAGGATCGTTACATGCATAC

ACAATGTATACTATACGTATTCGGTAGTTAGTGTACGGTCGGAATGGAGGTGG

TGGCGGTGATAGATGCTGCATGACATAGCCC

>BB200_4 (len:243 mean_flex: 13.29 max_flex: 14.44 entropy: 1.93 GC%:34.57)

GGGCATGCACAGATGTACACGAATCCCGAAGATGTTGTCCATTCATTGAATAT

GAGATCTCATGGTATGATCAATATCGGATGCGATATTGATACTGATAAATCAT

ATATGCATAATCTCACATTATATTTATTATAATAAATCATCGTAGATATACACA

ATGTGAATTGTATACAATGGATAGTATAACTATCCAATTTCTTTGAGCATTGGC

CTTGGTGTAGATGCTGCATGACATAGCCC

>BB200_5 (len:243 mean_flex: 13.37 max_flex: 14.52 entropy: 1.94 GC%:35.8) GGGCATGCACAGATGTACACGAATCCGTGAGATGACTATCTTATTTGTGACAT TCATCGATCTGGATATGATCAATACCATGCGATATTGATTACTGATAAATCATA TAT GT AG AAT AT C AC ATT AT ATT AATT AT AAT AAAT C GT C GT AC AT AT AC AT C C ACAATTAGCTATGTATACTATCTATAGAGATGGTGCATCATCGTACTCCACCAT T C C C ACT AG AT G CT GC AT G AC AT AG C C C >BB300_1 (len:348 mean_flex: 13.12 max_flex: 14.77 entropy: 1.98 GC%:41.67)

GGGCATGCACAGATGTACACGCATAAGACCACAGGGTGCAAATCTGGATTGC

GGCATGGATGATTCATCATCGTGGCATATTCGCTATGGATATATCCATCATAAT

ACATTGATACGTCATGCGTATAATCGCATTATATGTCGATATTGGTCATAGGG

ATACATCCGTGTATACTATCGTATATGCGTGCAATGTAGCCATGTTAATCATGC

TATAACCATAACATAAATATAATATATACAGATGGTGTATCTCTACTTATGTAT

GCTTGTATAGTAATGTCGATACTGATGGGTCTCCGGCCCACTACACCACCTGG

CCGCTCTAGATGCTGCATGACATAGCCC

>BB300_2 (len:343 mean_flex: 13.26 maxjlex: 14.34 entropy: 1.98 GC%:40.82)

GGGCATGCACAGATGTACACGGGCAATCCGCCAGGGTTCAAATATGGATATGT

GATGATCGATTCAACATGCACATATGCACGATATCATATATTACTCCAGATGTC

ATCATCGTCGTGCGTATATGAGATATGTATTTATGCATATAATCCACCATACAT

GGTAGCGATATTATAGTGCGATTATGTGTATATGACTATCATGGCTATTGTTAA

TATATAAATCATAACCATACCACTTCCACGCCTGGTATGGCGTATAGTATAGA

GATATTGTGTGATGCCCTATGTCGACCATGATGTGCCGTTGTACTGCCAATCC

TAGATGCTGCATGACATAGCCC

>BB300_3 (len:344 mean_flex: 13.47 max_flex: 14.8 entropy: 1.95 GC%:36.34) GGGCATGCACAGATGTACACGTATCCATGCAGCTTATTGTAACTAGCGCATGC

AC GT GGT GATT CATC AC AT CT AT AT AT AC GAT AT GATAT ATT AC AC ATATTT GC ATAGTATCATCCGGTGTGATATCATCCGATATGCTCATACTTATTCATTGGTAG CATTGCATTGATGGATCAATAGTTATTATGACATCATGGCATGTACAATTATAA ATAATACAACATACATAAATATACTATACACATCGTGTATGTGTTATACAGATC TGTGTGATGTATGATAATGTAATGGCGTCGAACACCACAAGGCAGTCCTATAA TAGATGCTGCATGACATAGCCC

>BB300_4 (len:344 mean_flex: 13.37 max_flex: 14.57 entropy: 1.94 GC%:37.5) GGGCATGCACAGATGTACACGGTCCATTACAATCGAATCTATATCCCAATGTG T ATC GATTAT C AC C AC AATGAC AT AAT AC GATAT CAT AT ATT ACT C CAT AT GC C TTACGTCAGATCGTTATATGAGATATGTATTCATGCATATGATATCCCACAGTA

CACGTCGTCTAATGCCATCATGAATGTATGACATATCTAGTCGATTATACATAA

TATAACATACCAATATAACAATATCTATACACATTTGATGGCGTATAGTATAAA

GATATTGTGGCAATGCCCATACACCACTGACTGTCGCCGATCATTCCTACCAC

TAGATGCTGCATGACATAGCCC

>BB300_5 (len:344 mean_flex: 13.51 max_flex: 14.89 entropy: 1.91 GC%:33.43)

GGGCATGCACAGATGTACACGACCGACCGTGAAAGTGATTCAGAATGATGTGC

ATGAATGTTATCATGACATGATTTATGATGCACTGATATATGCATATTATAATA

TTGTACAATGTCGTATATACGACATATCTATACTATGAATTATGGCATCATGGA

CAATAGATGGTAAGGTATAGTACGATCTATATAGCATGTTGAAATGGGATATA

AATTATCATAAACATACATACTTAACTAATATCAAGATGATATGTGTATGACAT

CAGAATGATAGTAGTAATGAGTATTGTCAGATGTATGTACGAATATCACACGA

TTAGATGCTGCATGACATAGCCC >Insert SI WT (TP53, chrl7:7577450- 7577649)

AGGCTGGGGCACAGCAGGCCAGTGTGCAGGGTGGCAAGTGGCTCCTGACCTG

GAGTCTTCCAGTGTGATGATGGTGAGGATGGGCCTCCGGTTCATGCCGCCCAT

GCAGGAACTGTTACACATGTAGTTGTAGTGGATGGTGGTACAGTCAGAGCCAA

CCTAGGAGATAACACAGGCCCAAGATGAGGCCAGTGCGCCTT

>Insert 17.2 (TP53, chrl7:7578161-7578394)

CAGTTGCAAACCAGACCTCAGGCGGCTCATAGGGCACCACCACACTATGTCGA

AAAGTGTTTCTGTCATCCAAATACTCCACACGCAAATTTCCTTCCACTCGGATA

AGATGCTGAGGAGGGGCCAGACCTAAGAGCAATCAGTGAGGAATCAGAGGCC

TGGGGACCCTGGGCAACCAGCCCTGTCGTCTCTCCAGCCCCAGCTGCTCACCA

TCGCTATCTGAGCAGCGCTCAT

Bioinformatics related to Figure 24.

An expected reference signal for every possible insert (one for every possible basepair at the target position) was generated using Tombo’s DNA model (Fasta -> raw), both forward and reverse (https://github.com/nanoporetech/tombo). A forward and reverse expected signal were created for the backbone as well.

Using Dynamic Time Warping (DTW) the expected backbone signals were mapped to a read. If the expected backbone signals are overlapping in the alignment with the read, the best result is picked, and less optimal results were removed. The read is then cut into segments based on the direction of the fitted backbone.

Subsequently, all possible expected insert signals are mapped to the read using DTW. Again, overlapping results are removed and only the best results are kept. Per read the most optimal fit (lowest DTW error) results are kept. The amount of times a particular insert (representing a specific base at the target position) determines the most likely base for this read at the target position.

Results

Circularization efficiencies of different backbones

To be able to experimentally assess the efficiency of different backbones to circularize short DNA amplicons, a PCR amplicon of 234 bp (Insert 17.2) was ligated with backbones derived from 3 different backbone series: BB100_l/2/3/4/5, BB200_2/4/5 and BB300.

The backbone sequences and physical properties are reported below. The detailed protocol is disclosed in Materials and Methods.

The product of the circularization reaction is shown in Figure 21 (left-side).

Following circularization, the reaction was supplemented with an enzyme blend (Plasmid Safe Lucigen#E3101K) in order to digest the linear DNA. The residual product (circular DNA) is visible in Figure 21 (right-side). The BB200 series showed the best efficiency so far. To further characterize the efficiency of BB200_2/4/5, the 3 backbones were ligated with the same amplicon in absence of the restriction enzyme Srfl. The rationale behind this experiment is that, the ligation efficiency of a backbone can be estimated by the amount of multimers that can be formed in the reaction. As can he observed in Figure 22, BB200_4 shows a remarkably higher ligation efficiency compared to BB200_2 and BB200_5.

The greater efficiency in ligation of BB200_4 is reflected in a greater efficiency in circularization and RCA product formation. In Figure 23, sequencing read counts are plotted, coming from 2 independent experiments (blue and red) in which an equimolar mixture of BB200_2, BB200_4 and BB200_5 was used to produce concatemers. The sequencing results agree with the previous experiment showing that the great majority of the reads sequenced contains BB200_4.

New (optimized) barcode sequences that are better in terms of ligation efficiency BB200_4 is the most efficient backbone tested so far in a circularization reaction.

Strand -specific mutation calling coupled to the possibility for strand -specific rolling circle amplification

The Cyclomics method produces a double-stranded DNA circle. One advantage of having a double-stranded circle is that one of the strands can be used preferentially as a template for the RCA, for example by using a strand-specific primer to initiate the reaction, following known procedures

(https://www.sciencedirect.com/science/article/pii/S00426822 12002814). In this way, the Cyclomics method enables selective amplification of the sense or the antisense sequence of a given DNA sequencing. Such a strand-specific

amplification it is not possible using the smartbell method but has major benefits for obtaining accurate variant calls in an efficient way from nanopore sequencing data.

In Figure 24 we show an example case in which the rate of detection of the correct base is different when analyzing the data coming from two different strands of a DNA molecule. The data are derived from an experiment where a 200bp (Insert SI WT) long amplicon was circularized with BB200_4 and amplified as specified in the reported protocol.

Data analysis of the sequencing results allowed to determine the base-calling accuracy for each of the strands. In particular, we noticed that C and Abases are often difficult to distinguish due to the similar intensity of their raw signal.

However the signal coming from a T is quite different from all the other bases and easy to be correctly classified. For example, if an A is expected to be mutated in the forward strand, sequencing of the reverse strand would lead to much cleaner results since the A in the forward strand could be miss called as a G. Thus, specific enrichment of the reverse strand would be advantageous in such a scenario.

The data highlighted in Figure 24 show one example of differences in discriminating bases on either the forward or the reverse strand. Note how the correct base can be inferred on the reverse strand data by using a simple cut-off over the Y-axis (Y < 0.3). The same approach would not work with the forward strand. Thus, in this case, the amplification and sequencing of both strands would lead to a waste of data and, more problematically, to a misleading mutation detection on that particular position with a high false positive rate. In contrast, a strand-specific enrichment would lead to higher sensitivity (the majority of the reads would come from the best strand) and no false positive calls.