Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
OPTIMISED SET OF OLIGONUCLEOTIDES FOR BULK RNA BARCODING AND SEQUENCING
Document Type and Number:
WIPO Patent Application WO/2023/025784
Kind Code:
A1
Abstract:
The present invention relates generally to the field of nucleic acid sequencing and provides oligonucleotide molecules and barcodes contained therein. These oligonucleotide molecules and barcodes molecules are useful in sequencing to identify and resolve errors.

Inventors:
DAINESE RICCARDO (CH)
ALPERN DANIEL (FR)
GARDEUX VINCENT (FR)
DEPLANCKE BART (CH)
Application Number:
PCT/EP2022/073458
Publication Date:
March 02, 2023
Filing Date:
August 23, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ECOLE POLYTECHNIQUE FED LAUSANNE EPFL (CH)
International Classes:
C12Q1/6853; C12N15/10; C12Q1/6806
Domestic Patent References:
WO2016040476A12016-03-17
WO2015164212A12015-10-29
WO2016010856A12016-01-21
WO2019131470A12019-07-04
Foreign References:
US20200157600A12020-05-21
Other References:
PANU SOMERVUO ET AL: "BARCOSEL: a tool for selecting an optimal barcode set for high-throughput sequencing", BMC BIOINFORMATICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 19, no. 1, 5 July 2018 (2018-07-05), pages 1 - 6, XP021258232, DOI: 10.1186/S12859-018-2262-7
YANG IN SEOK ET AL: "Development of a program for in silico optimized selection of oligonucleotide-based molecular barcodes", PLOS ONE, vol. 16, no. 2, 18 January 2021 (2021-01-18), pages e0246354, XP055915055, DOI: 10.1371/journal.pone.0246354
ZIEGENHAIN ET AL., MOL. CELL, vol. 65, 2017, pages 631 - 643
KILPINEN ET AL., SCIENCE, vol. 342, 2013, pages 744 - 747
WASZAK ET AL., CELL, vol. 162, 2015, pages 1039 - 1050
PRADHAN ET AL., SCI. REP., vol. 7, 2017, pages 42130
HASHIMSHONY ET AL., GENOME BIOL., vol. 17, 2016, pages 77
ISLAM ET AL., NAT. PROTOC., vol. 7, 2012, pages 813 - 828
SOUMILLONET, BIORXIV, 2014, pages 003236
BUSH ET AL., NAT. COMMUN., vol. 8, 2017, pages 105
YE ET AL., NAT. COMMUN., vol. 9, 2018, pages 1 - 9
SHOLDER ET AL., BMC GENOMICS, vol. 21, 2020, pages 64
PANDEY ET AL., NAT. PROTOC., vol. 15, 2020, pages 1459 - 1483
ALPERN ET AL., GENOME BIOL., vol. 20, 2019, pages 71
Attorney, Agent or Firm:
KATZAROV S.A. (CH)
Download PDF:
Claims:
CLAIMS One or more oligonucleotide molecules comprising, from 5' to 3', a) a sequencing adapter, b) a barcode sequence consisting of 9 to 15 nucleotides, preferably 12 nucleotides, and, c) an mRNA capture sequence. The one or more oligonucleotide molecules of claim 1, further comprising d) a unique molecular identifier (UMI). The one or more oligonucleotide molecules of claim 1 or 2, wherein the sequencing adapter comprises, or consists of, CTA CAC GAC GCT CTT CCG ATC (SEQ ID No. 97). The one or more oligonucleotide molecules of claim 1 2 or 3, wherein the barcode sequence is selected from the group comprising

SEQ ID NO. 1 CTCGAGTAGCAG,

SEQ ID NO. 2 CAGCACACGTCA,

SEQ ID NO. 3 ACAGCGATCGAC,

SEQ ID NO. 4 TAGTGTACGACA,

SEQ ID NO. 5 TAGTCGTCTAGC,

SEQ ID NO. 6 CATCAGCTGCAC,

SEQ ID NO. 7 TAGTAGCACGCA,

SEQ ID NO. 8 CAGTCAGCTGAC,

SEQ ID NO. 9 CAGCAGTCTACG,

SEQ ID NO. 10 CAGCTAGAGCAC,

SEQ ID NO. 11 CTAGCATGACGA,

SEQ ID NO. 12 ACTCTACGCGAC,

SEQ ID NO. 13 CTGTCGAGCTGA,

SEQ ID NO. 14 ACAGACGAGTCA,

SEQ ID NO. 15 CTATGATCTACG, SEQIDNO. 16 CTCAGAGCAGAC, SEQIDNO. 17 ACAGAGACTACG, SEQIDNO. 18 CTCTGCACTAGC, SEQIDNO. 19 ACTAGTGACGAC, SEQ ID NO.20 TACGATGCGTAC, SEQIDNO. 21 ACGAGACATCAC, SEQ ID NO.22 CATCACTGCACA, SEQ ID NO.23 CTGACATCACAG, SEQ ID NO.24 TAGTACGACTAC, SEQ ID NO.25 CACGCAGAGTCA, SEQ ID NO.26 CACACGCATAGC, SEQ ID NO.27 ACGTATGTCTAG, SEQ ID NO.28 CATCTCACTAGA, SEQ ID NO.29 ATCGTCATACGA, SEQIDNO. 30 TCTAGCACGTGC, SEQIDNO. 31 TCAGACTGTCAC, SEQ ID NO.32 TCAGTAGTCTAC, SEQIDNO. 33 TACTGACACGAC, SEQ ID NO.34 CATACTCATGAG, SEQIDNO. 35 CACATGCAGTCG, SEQIDNO. 36 CACAGATCGAGC, SEQIDNO. 37 TCGTGACTCAGC, SEQIDNO. 38 CAGCACGATAGC, SEQIDNO. 39 CTACAGCACACG, SEQ ID NO.40 CTGTACGCATGC, SEQIDNO. 41 CTACACACAGCG, SEQ ID NO.42 ACTCGTGCGAGA, SEQ ID NO.43 TAGCGCATGCAC, SEQ ID NO.44 ACGAGCATCGCA, SEQ ID NO.45 CTCGCGTACACA, SEQ ID NO.46 CTGTAGCATCGC, SEQ ID NO.47 TCTACGTATCGC, SEQ ID NO.48 CTGAGCTGTACA, SEQ ID NO.49 CATGCACACAGA, SEQ ID NO. 50 TCATCGACATAG, SEQ ID NO. 51 CACACACTGCGA, SEQ ID NO. 52 CACTGCTAGACA, SEQ ID NO. 53 ACGCGTAGCTAC, SEQ ID NO. 54 CACGTCTATCGC, SEQ ID NO. 55 CAGCATACACGC, SEQ ID NO. 56 CACAGTCGACAC, SEQ ID NO. 57 ACGACGCGTAGA, SEQ ID NO. 58 ATGCGACAGACG, SEQ ID NO. 59 ATCTACTAGTGC, SEQ ID NO. 60 ATCATGAGCAGA, SEQ ID NO. 61 CATCATAGTGAC, SEQ ID NO. 62 CTCGACAGCTCA, SEQ ID NO. 63 ACGCTATCGAGC, SEQ ID NO. 64 CATGCGTCTCAG, SEQ ID NO. 65 ACTGTAGACAGC, SEQ ID NO. 66 TCGCTGCATAGC, SEQ ID NO. 67 CAGTAGCGTGAG, SEQ ID NO. 68 ACTGTCTGTGCA, SEQ ID NO. 69 ACGACAGCATGC, SEQ ID NO. 70 CTCTACATAGAC, SEQ ID NO. 71 CAGCGAGTGACA, SEQ ID NO. 72 CACGACATCAGC, SEQ ID NO. 73 TCATGACGAGAG, SEQ ID NO. 74 ACATCGTAGTCG, SEQ ID NO. 75 CTAGAGATGCGA, SEQ ID NO. 76 TATCAGACATCG, SEQ ID NO. 77 ACACTATGCACA, SEQ ID NO. 78 CAGAGTAGTGCG, SEQ ID NO. 79 CATGACTAGTCG, SEQ ID NO. 80 CTACGTATATGC, SEQ ID NO. 81 CTACGCTCGTAG, SEQ ID NO. 82 CTCGTAGCTAGA, SEQ ID NO. 83 CTACTAGACGCA, SEQ ID NO. 84 TCATGCTCGCGA,

SEQ ID NO. 85 ACACGTAGTGCA,

SEQ ID NO. 86 ACTGAGCAGCGA,

SEQ ID NO. 87 TCACGCGTAGCA,

SEQ ID NO. 88 TCATGCACTGCG,

SEQ ID NO. 89 ACTGCTATCTCG,

SEQ ID NO. 90 CAGACTCTGACG,

SEQ ID NO. 91 ACACGCGATACG,

SEQ ID NO. 92 TCGACGATGACA,

SEQ ID NO. 93 ATACATCGACGA,

SEQ ID NO. 94 CACATCACTGAC,

SEQ ID NO. 95 TCAGCATACTCA,

SEQ ID NO. 96 TATCGTCGCACG, or a combination of one or more thereof. The one or more oligonucleotide molecules of any one of the preceding claims 2 to 4, wherein the UMI consists of a sequence (N)n(V)m, wherein N is any nucleotide selected from A, T, C and G; V is any nucleotide selected from A, C and G; n is an integer selected from 1 to 20, and m is an integer selected from 1 to 20. The one or more oligonucleotide molecules of any one of the preceding claims, wherein the mRNA capture sequence is a poly-T sequence followed by at least one V and one N, wherein N is any nucleotide selected from A, T, C and G and V is any nucleotide selected from A, C and G. A set of oligonucleotide molecules consisting of 96 oligonucleotide molecules, each molecule comprising, from 5' to 3', a) a sequencing adaptor, b) a barcode sequence independently selected from the group consisting in SEQ ID: NO. 1 to 96, c) an mRNA capture sequence and, d)optionally a UMI . A barcode oligonucleotide sequence selected from the group comprising SEQIDNO. 1 CTCGAGTAGCAG SEQIDNO. 2 CAGCACACGTCA SEQIDNO. 3 ACAGCGATCGAC SEQIDNO. 4 TAGTGTACGACA SEQIDNO. 5 TAGTCGTCTAGC SEQIDNO. 6 CATCAGCTGCAC SEQIDNO. 7 TAGTAGCACGCA SEQIDNO. 8 CAGTCAGCTGAC SEQIDNO. 9 CAGCAGTCTACG SEQIDNO. 10 CAGCTAGAGCAC SEQ ID NO. 11 CTAGCATGACGA SEQ ID NO. 12 ACTCTACGCGAC SEQIDNO. 13 CTGTCGAGCTGA SEQIDNO. 14 ACAGACGAGTCA SEQIDNO. 15 CTATGATCTACG SEQIDNO. 16 CTCAGAGCAGAC SEQIDNO. 17 ACAGAGACTACG SEQIDNO. 18 CTCTGCACTAGC SEQIDNO. 19 ACTAGTGACGAC SEQ ID NO.20 TACGATGCGTAC SEQIDNO. 21 ACGAGACATCAC SEQ ID NO.22 CATCACTGCACA SEQ ID NO.23 CTGACATCACAG SEQ ID NO.24 TAGTACGACTAC SEQ ID NO.25 CACGCAGAGTCA SEQ ID NO.26 CACACGCATAGC SEQ ID NO.27 ACGTATGTCTAG SEQ ID NO.28 CATCTCACTAGA SEQ ID NO.29 ATCGTCATACGA SEQIDNO. 30 TCTAGCACGTGC SEQIDNO. 31 TCAGACTGTCAC SEQ ID NO.32 TCAGTAGTCTAC SEQIDNO. 33 TACTGACACGAC SEQ ID NO. 34 CATACTCATGAG SEQ ID NO. 35 CACATGCAGTCG SEQ ID NO. 36 CACAGATCGAGC SEQ ID NO. 37 TCGTGACTCAGC SEQ ID NO. 38 CAGCACGATAGC SEQ ID NO. 39 CTACAGCACACG SEQ ID NO. 40 CTGTACGCATGC SEQ ID NO. 41 CTACACACAGCG SEQ ID NO. 42 ACTCGTGCGAGA SEQ ID NO. 43 TAGCGCATGCAC SEQ ID NO. 44 ACGAGCATCGCA SEQ ID NO. 45 CTCGCGTACACA SEQ ID NO. 46 CTGTAGCATCGC SEQ ID NO. 47 TCTACGTATCGC SEQ ID NO. 48 CTGAGCTGTACA SEQ ID NO. 49 CATGCACACAGA SEQ ID NO. 50 TCATCGACATAG SEQ ID NO. 51 CACACACTGCGA SEQ ID NO. 52 CACTGCTAGACA SEQ ID NO. 53 ACGCGTAGCTAC SEQ ID NO. 54 CACGTCTATCGC SEQ ID NO. 55 CAGCATACACGC SEQ ID NO. 56 CACAGTCGACAC SEQ ID NO. 57 ACGACGCGTAGA SEQ ID NO. 58 ATGCGACAGACG SEQ ID NO. 59 ATCTACTAGTGC SEQ ID NO. 60 ATCATGAGCAGA SEQ ID NO. 61 CATCATAGTGAC SEQ ID NO. 62 CTCGACAGCTCA SEQ ID NO. 63 ACGCTATCGAGC SEQ ID NO. 64 CATGCGTCTCAG SEQ ID NO. 65 ACTGTAGACAGC SEQ ID NO. 66 TCGCTGCATAGC SEQ ID NO. 67 CAGTAGCGTGAG SEQ ID NO. 68 ACTGTCTGTGCA

SEQ ID NO. 69 ACGACAGCATGC

SEQ ID NO. 70 CTCTACATAGAC

SEQ ID NO. 71 CAGCGAGTGACA

SEQ ID NO. 72 CACGACATCAGC

SEQ ID NO. 73 TCATGACGAGAG

SEQ ID NO. 74 ACATCGTAGTCG

SEQ ID NO. 75 CTAGAGATGCGA

SEQ ID NO. 76 TATCAGACATCG

SEQ ID NO. 77 ACACTATGCACA

SEQ ID NO. 78 CAGAGTAGTGCG

SEQ ID NO. 79 CATGACTAGTCG

SEQ ID NO. 80 CTACGTATATGC

SEQ ID NO. 81 CTACGCTCGTAG

SEQ ID NO. 82 CTCGTAGCTAGA

SEQ ID NO. 83 CTACTAGACGCA

SEQ ID NO. 84 TCATGCTCGCGA

SEQ ID NO. 85 ACACGTAGTGCA

SEQ ID NO. 86 ACTGAGCAGCGA

SEQ ID NO. 87 TCACGCGTAGCA

SEQ ID NO. 88 TCATGCACTGCG

SEQ ID NO. 89 ACTGCTATCTCG

SEQ ID NO. 90 CAGACTCTGACG

SEQ ID NO. 91 ACACGCGATACG

SEQ ID NO. 92 TCGACGATGACA

SEQ ID NO. 93 ATACATCGACGA

SEQ ID NO. 94 CACATCACTGAC

SEQ ID NO. 95 TCAGCATACTCA

SEQ ID NO. 96 TATCGTCGCACG, or a combination of one or more thereof. Use of one or more oligonucleotide molecules of any one of claims 1 to 6, or of a set of oligonucleotide molecules of claim 7, or of a barcode oligonucleotide sequence, or a combination of one or more thereof, of claim 8, in a sequencing method.

24

10. A method for providing a cDNA library, the method comprising the steps of a) Providing a plurality of RNA samples obtained from a biological sample; b) Contacting separately each RNA sample with one or more oligonucleotide molecules of any one of claims 1 to 6, or of a library of claim 7, or of a barcode oligonucleotide sequence or a combination of one or more thereof, of claim 8, under annealing conditions; c) Incubating separately each sample under reverse transcription reaction conditions; d) Pooling together all the cDNA:RNA sample; e) Proceeding to second strand synthesis under synthesis conditions; and f) Proceeding with tagmentation and/or end-repair and ligation and amplification under suitable conditions so as to obtain a cDNA library.

11. The method for providing a cDNA library of claim 10, wherein the second strand synthesis is generated by a method selected from the group comprising PCR amplification and nick translation, or a combination thereof.

12. A method for sequencing RNA, the method comprising the steps of c) Providing a cDNA library obtained by the method of claims 10 to 11; and d) Proceeding to the sequencing under suitable conditions.

13. A method for selecting barcode oligonucleotides for multiplexed nucleic acid sequencing, said method comprising selecting one or more barcode oligonucleotides Which has/have a Shannon’s first order entropy of at least 1.5 and a second order entropy of at least 2.5; and wherein

GC content of the barcode oligonucleotides is comprised between 35% to 65%; the first two nucleotides are not G’ s; the last two nucleotides are not T’ s; the hamming distance between two barcodes of the list should be at least 5; and said one or more barcode oligonucleotides contain homopolymers of maximum 2 nucleotides.

14. The method of claim 13, wherein said method is a computer implemented method.

15. A kit comprising

25 i) one or more oligonucleotide molecules of any one of claims 1 to 6, or a set of oligonucleotide molecules of claim 7, or a barcode oligonucleotide sequence, or a combination of one or more thereof of claim 8, ii) a support for sample preparation, such as a 96-well plate, and iii) reagents for sequencing.

16. Use of the kit of claim 15, or of the one or more oligonucleotide molecules of any one of claims 1 to 6, or of a library of claim 7, or of a barcode oligonucleotide sequence or a combination of one or more thereof, of claim 8, in a single-cell RNA profiling method.

26

Description:
Optimised set of oligonucleotides for bulk RNA barcoding and sequencing

FIELD OF THE INVENTION

The present invention relates generally to the field of nucleic acid sequencing and provides oligonucleotide molecules and barcodes contained therein. These oligonucleotide molecules and barcodes molecules are useful in sequencing to identify and resolve errors.

BACKGROUND OF THE INVENTION

RNA sequencing has become the method of choice for genome-wide transcriptomic analyses as its price has substantially decreased over the last years. Nevertheless, the high cost of standard RNA library preparation and the complexity of the underlying data analysis still prevent this approach from becoming as routine as quantitative PCR (qPCR), especially when many samples need to be analyzed.

To alleviate this high cost, the emerging single-cell transcriptomics field implemented the sample barcoding/early multiplexing principle. This reduces both the RNA-seq cost and preparation time by allowing the generation of a single sequencing library that contains multiple distinct samples/cells (Ziegenhain et al., 2017, Mol. Cell 65, 631-643. e4).

Such a strategy could also be of value to reduce the cost and processing time of bulk RNA sequencing of large sets of samples (Kilpinen et al., 2013, Science 342, 744-747; Waszak. et al., 2015, Cell 162, 1039-1050; Pradhan et al. 2017, Sci. Rep. 7, 42130). However, there have been surprisingly few efforts to explicitly adapt and validate the early-stage multiplexing protocols for reliable and affordable profiling of bulk RNA samples.

Early multiplexing protocols designed for single-cell RNA profiling (CEL-seq2, SCRB-seq, and STRT-seq) provide a great capacity for transforming large sets of samples into a unique sequencing library (Hashimshony et al. 2016, Genome Biol., 17, 77; Islam et al., 2012, Nat. Protoc. 7, 813-828; Soumillonet al., 2014, bioRxiv, 003236, doi: 10.1101/003236). This is achieved by introducing a sample-specific barcode during the RT reaction using a “molecular tag” carried by either the oligo-dT or the template switch oligo (TSO). After individual samples have been “tagged”, they are pooled together, and the remaining steps are performed in bulk, thus shortening the time and cost of library preparation. Since the tag is introduced to the terminal part of the transcript prior to fragmentation, the reads solely cover the 3' or 5' end of the transcripts. The 3’DGE approach for bulk RNA profiling, has been adopted in several recent studies, such as PLATE-seq (Bush et al., 2017, Nat. Commun. 8, 105), DRUG-seq (Ye et al., 2018, Nat. Commun. 9, 1-9), 3’POOL-seq (Sholder et al., 2020, BMC Genomics 21, 64), PME-seq (Pandey et al., 2020, Nat. Protoc., 15, 1459- 1483) and BRB-seq (Alpern et al., 2019 Genome Biol. 20, 71). These techniques have two main commonalities: i) using barcoded DNA oligos used to “tag” poly-adenylated RNA molecules during first strand synthesis and ii) pooling together of all the tagged samples in one tube after the barcoding step.

The overarching goal of these techniques is to decrease the costs and increase the throughput associated with mRNA sequencing library preparation of bulk samples.

This is achieved by reducing reagents, consumables and personnel time through pooling in one solution several barcoded samples. In simple terms, it is much more cost-effective and simpler to process e.g. 100 samples in one tube than 100 samples in 100 tubes.

One of the main challenges of RNA barcoding applied to bulk samples is the ability to guarantee a uniform distribution of sequencing reads across all samples. This challenge is due to the fact that the “molecular barcodes” used during the RNA barcoding step are a functional portion of the “reverse transcription primer” and, as such, different barcodes (i.e. barcodes with different sequences) can have significant effect on the efficiency of the overall workflow. For example, empirical experimental evidence highlights the following potential issues: i) barcodes can lead to unwanted secondary structures that interfere or prevent efficient priming ii) completely random barcodes may end up having very similar or repeating sequences which are then difficult to resolve at the sequencing stage, iii) certain barcodes can be preferentially amplified within the same pool and last, but not least, iv) certain barcodes preferentially bind mitochondrial transcripts, which then appear as an unwanted bias in the sequencing results.

Therefore, there is still a need for more accurate, dependable sequencing tools, (i) and methods that can eliminate barcoding sample-to-sample variation and (ii) methods that use them to improve various barcoding approaches, including the barcoding-mediated high-accuracy sequencing method. SUMMARY OF THE INVENTION

The present invention provides one or more oligonucleotide molecules comprising, from 5' to 3', a) a sequencing adapter, b) a barcode sequence consisting of 9 to 15 nucleotides, preferably 12 nucleotides, and, c) an mRNA capture sequence.

Further provided is a set of oligonucleotide molecules consisting of 96 oligonucleotide molecules, each molecule comprising, from 5' to 3', a) a sequencing adaptor, b) a barcode sequence independently selected from the group consisting in SEQ ID:

NO. 1 to 96, and, c) an mRNA capture sequence.

Further provided is a barcode oligonucleotide sequence selected from the group comprising SEQ ID NO: 1 to SEQ ID NO: 96.

Further provided is the use of one or more oligonucleotide molecules of the invention, or of a set of oligonucleotide molecules of the invention, or of a barcode oligonucleotide sequence, or a combination of one or more thereof, of the invention, in a sequencing method.

Further provided is a method for providing a cDNA library, the method comprising the steps of a) Providing a plurality of RNA samples obtained from a biological sample; b) Contacting separately each RNA sample with one or more oligonucleotide molecules of the invention, or of a library of the invention, or of a barcode oligonucleotide sequence or a combination of one or more thereof, of the invention, under annealing conditions; c) Incubating separately each sample under reverse transcription reaction conditions; d) Pooling together all the cDNA:RNA sample; e) Proceeding to second strand synthesis under synthesis conditions; and f) Proceeding with tagmentation and amplification under suitable conditions so as to obtain a cDNA library.

Further provided is a method for sequencing RNA, the method comprising the steps of a) Providing a cDNA library obtained by the method of claims 9 to 10; and b) Proceeding to the sequencing under suitable conditions.

Also provided is a method for selecting barcode oligonucleotides for multiplexed nucleic acid sequencing, said method comprising selecting one or more barcode oligonucleotides

Which has/have a Shannon’s first order entropy of at least 1.5 and a second order entropy of at least 2.5; and wherein

GC content of the barcode oligonucleotides is comprised between 35% to 65%; the first two nucleotides are not G’ s; the last two nucleotides are not T’ s; the hamming distance between two barcodes of the list should be at least 5; and said one or more barcode oligonucleotides contain homopolymers of maximum 2 nucleotides.

Also provided is a kit and use of a in a single-cell RNA profiling method. DESCRIPTION OF THE FIGURES

Figure 1 shows a “bad” example of sequencing read distribution for a suboptimal set of 96 barcodes. As can be seen in the circled area, barcodes can systematically underperform as compared to the others.

Figure 2 shows the read distribution of the optimal set of 96 barcodes of the invention in which all barcodes are functional and obtain a similar number of sequencing reads.

Figure 3 shows a “bad” example of mitochondrial read distribution for a suboptimal set of 96 barcodes. As can be seen in the circled area, barcodes can systematically obtain more mitochondrial reads as compared to the others.

Figure 4 shows the mitochondrial read distribution of the optimal set of 96 barcodes of the invention in which all barcodes are functional and obtain a similar number of mitochondrial reads.

Figure 5 shows a schematic overview of the method described herein.

DESCRIPTION OF THE INVENTION

Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, suitable methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. The publications and applications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that the present invention is not entitled to antedate such publication by virtue of prior invention. In addition, the materials, methods, and examples are illustrative only and are not intended to be limiting.

In the case of conflict, the present specification, including definitions, will control. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as is commonly understood by one of skill in art to which the subject matter herein belongs. As used herein, the following definitions are supplied in order to facilitate the understanding of the present invention.

The term "comprise/comprising" is generally used in the sense of include/including, that is to say permitting the presence of one or more features or components. The terms comprise(s)" and "comprising" also encompass the more restricted ones "consist(s)", consisting" as well as "consist/consi sting essentially of', respectively.

As used in the specification and claims, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise.

As used herein, "one or more" includes"two or more", "three or more", etc. For example, one or more oligonucleotide molecules refers to one oligonucleotide molecule, two oligonucleotide molecules, three oligonucleotide molecules, etc.. . .

The present invention is based on the discovery of an optimal set of 96 barcoded oligonucleotides for multiplexed RNA sequencing. These oligonucleotides contain barcodes that are 12 base pairs long and have been therefore selected from a pool of 4 A 12 = 16’777’216 potential candidates. This large pool has been filtered twice, first computationally and then experimentally as disclosed herein. The goal was to obtain an optimized set of barcodes for further being able to 1) uniquely demultiplex the samples, with error-tolerance for sequencing errors, and

2) adapt the barcodes for potential technical bias such as overrepresentation of polyT sequences due to preferential amplification of certain sequences.

In one aspect, the invention provides one or more oligonucleotide molecules comprising, from 5' to 3', a) a sequencing adapter, b) a barcode sequence consisting of 9 to 15 nucleotides, preferably 12 nucleotides, and, c) an mRNA capture sequence.

In one aspect, the one or more oligonucleotide molecules further comprise a unique molecular identifier (UMI).

“Oligonucleotide” or “polynucleotide,” which are used synonymously, means a linear polymer of natural or modified nucleosidic monomers linked by phosphodiester bonds or analogs thereof. The term “oligonucleotide” usually refers to a shorter polymer, e.g., comprising from about 3 to about 100 monomers, and the term “polynucleotide” usually refers to longer polymers, e.g., comprising from about 100 monomers to many thousands of monomers, e.g., 10,000 monomers, or more. Oligonucleotides and polynucleotides may be natural or synthetic. Oligonucleotides and polynucleotides include deoxyribonucleosides, ribonucleosides, and non-natural analogs thereof, such as anomeric forms thereof, peptide nucleic acids (PNAs), and the like, provided that they are capable of specifically binding to a target genome by way of a regular pattern of monomer-to-monomer interactions, such as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types of base pairing, or the like.

The terms “peptide,” “protein,” and “polypeptide” are used interchangeably to refer to a natural or synthetic molecule comprising two or more amino acids linked by the carboxyl group of one amino acid to the alpha amino group of another.

The term “nucleic acid” refers to a natural or synthetic molecule comprising a single nucleotide or two or more nucleotides linked by a phosphate group at the 3' position of one nucleotide to the 5' end of another nucleotide. The nucleic acid is not limited by length, and thus the nucleic acid can include deoxyribonucleic acid (DNA) or ribonucleic acid (RNA).

“Sequencing” refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA. Many techniques are available such as Sanger sequencing and High Throughput Sequencing technologies (HTS). Sanger sequencing may involve sequencing via detection through (capillary) electrophoresis, in which up to 384 capillaries may be sequence analysed in one run. High throughput sequencing involves the parallel sequencing of thousands or millions or more sequences at once. HTS can be defined as Next Generation sequencing, i.e. techniques based on solid phase pyrosequencing or as Next-Next Generation sequencing based on single nucleotide real time sequencing (SMRT). HTS technologies are available such as offered by Roche, Illumina and Applied Biosystems (Life Technologies). Further high throughput sequencing technologies are described by and/or available from Helicos, Pacific Biosciences, Complete Genomics, Ion Torrent Systems, Oxford Nanopore Technologies, Nabsys, ZS Genetics, GnuBio. Each of these sequencing technologies have their own way of preparing samples prior to the actual sequencing step. Depending on the sequencing technology used, amplification steps may be omitted.

As used herein, the term “barcode” refers to a unique oligonucleotide sequence that allows a corresponding nucleic acid base and/or nucleic acid sequence to be identified. In certain aspects, the nucleic acid base and/or nucleic acid sequence is located at a specific position on a larger polynucleotide sequence (e.g., a polynucleotide covalently attached to a bead). In certain aspects, barcodes can each have a length within a range of from 4 to 150 nucleotides. The barcode technology (or barcoding) has been a particularly powerful technique for studying the genetic and functional variations of the target pool and for high-accuracy target DNA sequencing. Each barcode can comprise deoxyribonucleotides, optionally all of the nucleotides in a barcode region are deoxyribonucleotides. One or more of the deoxyribonucleotides may be a modified deoxyribonucleotide (e.g. a deoxyribonucleotide modified with a biotin moiety or a deoxyuracil nucleotide). The barcodes may comprise one or more degenerate nucleotides or sequences. The barcode regions may not comprise any degenerate nucleotides or sequences.

In one aspect, the barcode sequence of the invention consists of 9 to 15 nucleotides, preferably 12 nucleotides. More preferably, the barcode sequence is selected from the group comprising SEQ ID NO. 1, SEQ ID NO. 2, SEQ ID NO. 3, SEQ ID NO. 4, SEQ ID NO. 5, SEQ ID NO. 6, SEQ ID NO. 7, SEQ ID NO. 8, SEQ ID NO. 9, SEQ ID NO. 10, SEQ ID NO. 11, SEQ ID NO. 12, SEQ ID NO. 13, SEQ ID NO. 14, SEQ ID NO. 15, SEQ ID NO. 16, SEQ ID NO. 17, SEQ ID NO. 18, SEQ ID NO. 19, SEQ ID NO. 20, SEQ ID NO. 21, SEQ ID NO. 22, SEQ ID NO. 23, SEQ ID NO. 24, SEQ ID NO. 25, SEQ ID NO. 26, SEQ ID NO. 27, SEQ ID NO. 28, SEQ ID NO. 29, SEQ ID NO. 30, SEQ ID NO. 31, SEQ ID NO. 32, SEQ ID NO. 33, SEQ ID NO. 34, SEQ ID NO. 35, SEQ ID NO. 36, SEQ ID NO. 37, SEQ ID NO. 38, SEQ ID NO. 39, SEQ ID NO. 40, SEQ ID NO. 41, SEQ ID NO. 42, SEQ ID NO. 43, SEQ ID NO. 44, SEQ ID NO. 45, SEQ ID NO. 46, SEQ ID NO. 47, SEQ ID NO. 48, SEQ ID NO. 49, SEQ ID NO. 50, SEQ ID NO. 51, SEQ ID NO. 52, SEQ ID NO. 53, SEQ ID NO. 54, SEQ ID NO. 55, SEQ ID NO. 56, SEQ ID NO. 57, SEQ ID NO. 58, SEQ ID NO. 59, SEQ ID NO. 60, SEQ ID NO. 61, SEQ ID NO. 62, SEQ ID NO. 63, SEQ ID NO. 64, SEQ ID NO. 65, SEQ ID NO. 66, SEQ ID NO. 67, SEQ ID NO. 68, SEQ ID NO. 69, SEQ ID NO. 70, SEQ ID NO. 71, SEQ ID NO. 72, SEQ ID NO. 73, SEQ ID NO. 74, SEQ ID NO. 75, SEQ ID NO. 76, SEQ ID NO. 77, SEQ ID NO. 78, SEQ ID NO. 79, SEQ ID NO. 80, SEQ ID NO. 81, SEQ ID NO. 82, SEQ ID NO. 83, SEQ ID NO. 84, SEQ ID NO. 85, SEQ ID NO. 86, SEQ ID NO. 87, SEQ ID NO. 88, SEQ ID NO. 89, SEQ ID NO. 90, SEQ ID NO. 91, SEQ ID NO. 92, SEQ ID NO. 93, SEQ ID NO. 94, SEQ ID NO. 95, SEQ ID NO. 96, or a combination of one or more thereof. As used herein, the term “sequencing adapter” refers an oligonucleotide sequence that can be used in subsequent sequencing steps (so-called sequencing adapters). Or primers that are used to amplify a subset of fragments prior to sequencing may contain parts within their sequence that introduce sections that can later be used in the sequencing step, for instance by introducing through an amplification step a sequencing adapter or a capturing moiety in an amplicon that can be used in a subsequent sequencing step. Depending also on the sequencing technology used, amplification steps may be omitted.

Any commercially available sequencing adapter can be used, in one aspect, the sequencing adapter comprises, or consists of, CTA CAC GAC GCT CTT CCG ATC (SEQ ID No. 97).

As used herein, a "unique molecular identifier" or UMI is a complex indices added to sequencing libraries before any PCR amplification steps, enabling the accurate bioinformatic identification of PCR duplicates thus enabling to remove PCR duplicates. In one aspect, the UMI is an oligonucleotide sequence consisting of a sequence (N)n(V)m, wherein N is any nucleotide selected from A, T, C and G; V is any nucleotide selected from A, C and G; n is an integer selected from 1 to 20, and m is an integer selected from 1 to 20. Preferably, the UMI comprises, or consists of, SEQ ID NO. 98 (NNNNNNNNNNNVVVVV).

As used herein, an "mRNA capture sequence" is an oligonucleotide sequence that specifically hybridizes to mRNAs. In one aspect, the mRNA capture sequence is a poly-T sequence (10 to 40 T). In a preferred aspect, the mRNA capture sequence is a poly-T sequence (e.g. comprising 10 to 40 T) followed by at least one V and one N, wherein N is any nucleotide selected from A, T, C and G and V is any nucleotide selected from A, C and G. Preferably, the UMI comprises, or consists of, SEQ ID NO. 99 (TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTVN).

“Multiplex sequencing” refers to a sequencing technique that allows for processing a large number of samples on a high-throughput instrument. For multiplex sequencing, individual “barcode” sequences of the invention are added to each sample so that nucleotide sequences from different samples can be distinguished by the unique barcode sequences embedded in each sample. With this technique, multiple DNA or RNA samples can be pooled, processed, sequenced, and analyzed simultaneously.

The present invention further provides a set of oligonucleotide molecules consisting of 96 oligonucleotide, each molecule comprising, from 5' to 3', a) a sequencing adaptor, b) a barcode sequence independently selected from the group consisting in SEQ ID:

NO. 1 to 96, and, c) an mRNA capture sequence, as described herein.

In one aspect, the set of oligonucleotide molecules consisting of 96 oligonucleotides further comprises a UMI.

Also provided is the use of one or more oligonucleotide molecules of the invention, or of a set of oligonucleotide molecules of the invention, or of a barcode oligonucleotide sequence, or a combination of one or more thereof, of the invention, in a sequencing method.

The one or more oligonucleotide molecules of the invention, or of a set of oligonucleotide molecules of the invention, or of a barcode oligonucleotide sequence, or a combination of one or more thereof, may be linked by attachment to a solid support (e.g. a bead). A solution of soluble beads (e.g. superparamagnetic beads or styrofoam beads) may be functionalized to enable attachment of two or more oligonucleotide molecules of the invention, set of oligonucleotide molecules of the invention, barcode oligonucleotide sequences, or a combination of one or more thereof,. This functionalization may be enabled through chemical moieties (e.g. carboxylated groups), and/or protein-based adapters (e.g. streptavidin) on the beads. The functionalized beads may be brought into contact with a solution of the above-described molecules under conditions which promote the attachment of two or more molecules to each bead in the solution. Optionally, the molecules are attached through a covalent linkage, or through a (stable) non-covalent linkage such as a streptavidinbiotin bond, or a (stable) oligonucleotide hybridization bond. The present invention further encompasses a method for providing a cDNA library, the method comprising the step of

Providing a plurality of RNA samples obtained from a biological sample (step a).

As used herein, the term “biological sample” refers to a tissue (e.g., tissue biopsy), organ, cell (including a cell maintained in culture), cell lysate (or lysate fraction), biomolecule derived from a cell or cellular material (e.g. a polypeptide or nucleic acid), or body fluid from a subject. Non-limiting examples of body fluids include blood, urine, plasma, serum, tears, lymph, bile, cerebrospinal fluid, interstitial fluid, aqueous or vitreous humor, colostrum, sputum, amniotic fluid, saliva, anal and vaginal secretions, perspiration, semen, transudate, exudate, and synovial fluid.

The RNA samples can be obtained from any techniques know in the art. In one aspect, the RNA samples According to a particular aspect, the RNA samples are mRNA samples that can be cell lysates, total DNA/RNA eluate, blood and FFPE tissues.

The method further comprises a step (b) of contacting separately each RNA sample with one or more oligonucleotide molecules of the invention, or of a library of the invention, or of a barcode oligonucleotide sequence or a combination of one or more thereof, of the invention, under annealing conditions.

For examples, RNA samples are thawed on ice, transferred to the corresponding wells of the Oligo-dT primer plat, the plate is then sealed with the AluSeal and placed it in a thermocycler at 65°C for 5 min and immediately put on ice.

The method further comprises a step (c) of incubating separately each sample under reverse transcription reaction conditions.

For example, the RT reaction mix is prepared according to commercial manufacture instruction. Any RT enzyme and buffer commercially available can be used, such as e.g. Lucigen’s ERT12910K, ThermoFisher’s 18064014 and NEB’s M0368S among others. For example:

• Incubate RT reaction mix in thermocycler with the following program:

The method further comprises a step (d) of pooling together all the cDNA:RNA sample.

The method further comprises a step (e) of proceeding to second strand synthesis under synthesis conditions such as. This second strand synthesis can be generated by any method known in the art. In one aspect, the second strand synthesis method is selected from the group comprising PCR amplification and nick translation, or a combination thereof.

The method further comprises a step (f) of proceeding with tagmentation, and/or end-repair and ligation and amplification under suitable conditions such as, e.g. the conditions described in the examples, so as to obtain a cDNA library. The term “biological sample” refers to a tissue (e.g., tissue biopsy), organ, cell (including a cell maintained in culture), cell lysate (or lysate fraction), biomolecule derived from a cell or cellular material (e.g. a polypeptide or nucleic acid), or body fluid from a subject. Nonlimiting examples of body fluids include blood, urine, plasma, serum, tears, lymph, bile, cerebrospinal fluid, interstitial fluid, aqueous or vitreous humor, colostrum, sputum, amniotic fluid, saliva, anal and vaginal secretions, perspiration, semen, transudate, exudate, and synovial fluid.

Examples of RNAs include but are not limited to: mRNA, amplicons, rRNA, tRNA, nRNA, siRNA, snRNA, snoRNA, scaRNA, microRNA, dsRNA, ncRNA (e.g. IncRNA), ribozyme, riboswitch and viral RNA (e.g., retroviral RNA).

The present invention further encompasses a method for sequencing RNA, the method comprising the steps of a) Providing a cDNA library obtained by the method described herein; and b) Proceeding to the sequencing under suitable conditions such as, e.g. those defined by NGS sequencing providers, which are also known in the art.

The present invention further encompasses a method for selecting barcode oligonucleotides for multiplexed nucleic acid sequencing, said method comprising selecting one or more barcode oligonucleotides

Which has/have a Shannon’s first order entropy of at least 1.5 and a second order entropy of at least 2.5; and wherein

GC content of the barcode oligonucleotides is comprised between 35% to 65%; the first two nucleotides are not G’ s; the last two nucleotides are not T’ s; the hamming distance between two barcodes of the list should be at least 5; and said one or more barcode oligonucleotides contain homopolymers of maximum 2 nucleotides. The present method is aimed at enhancing the complexity of the barcode sequences, avoiding having barcodes with repetitive patterns, and reducing internal hairpin propensity.

Moreover, guanidine bases at the beginning of the barcode were removed to avoid the GGC and GGT Illumina sequencing patterns, which are known combinations of nucleotide which are prone to signal to noise decline during sequencing.

Also, thymine bases at the 3’ end were removed to avoid preferential selection/amplification of these barcodes.

Finally, hamming distance between two barcodes of the list was set to at least 5 the enhance the efficacy of later demultiplexing of these barcodes and especially to be able to correct for potential sequencing errors in the barcodes.

Examples of nucleic acids include but are not limited to: DNA, RNA, amplicons, cDNA, dsDNA, ssDNA, plasmid DNA, cosmid DNA, high Molecular Weight (MW) DNA, chromosomal DNA, genomic DNA, viral DNA, bacterial DNA, mtDNA (mitochondrial DNA), mRNA, rRNA, tRNA, nRNA, siRNA, snRNA, snoRNA, scaRNA, microRNA, dsRNA, ribozyme, riboswitch and viral RNA (e.g., retroviral RNA). Preferably, the nucleic acid is an RNA selected form the group comprising mRNA, amplicons, rRNA, tRNA, nRNA, siRNA, snRNA, snoRNA, scaRNA, microRNA, dsRNA, ncRNA (e.g. IncRNA), ribozyme, riboswitch and viral RNA (e.g., retroviral RNA).

In one aspect, the method for selecting barcode oligonucleotides for multiplexed nucleic acid sequencing provided herein is a computer implemented method.

The one or more oligonucleotide molecules of the invention may be linked by attachment to a solid support (e.g. a bead).

Also contemplated is one or more kits for performing one or more methods according to the invention. The one or more kit comprising i) a set of oligonucleotide molecules consisting of 96 oligonucleotide molecules, each molecule comprising, from 5' to 3', a) a sequencing adaptor, b) a barcode sequence independently selected from the group consisting in SEQ ID: 1 to 96, c) optionally a UMI and, d) an mRNA capture sequence, ii) a support for, such as a 96-well plate, and iii) reagents for sequencing.

The kit can comprise various molecular biology reagents, including DNA polymerases, RNA polymerases, Reverse-transcriptases, DNA ligases, RNA ligases, transposases, viral integrase, CRISPR/Cas9, zinc finger nucleases, transcription activator-like effector nucleases, exonucleases, endonucleases, Polynucleotide Kinases, nucleotides, oligonucleotides, modified oligonucleotides and optimized buffers.

Further contemplated is the use of the kit of the invention, or of the one or more oligonucleotide molecules of the invention, or of a library of the invention, or of a barcode oligonucleotide sequence or a combination of one or more thereof, of the invention, in a singlecell RNA profiling method. Preferably, the single-cell RNA profiling method is similar to the the Bulk RNA Barcoding and sequencing (BRB-seq) method described in Alpem et al., 2019 Genome Biol. 20, 71.

Those skilled in the art will appreciate that the invention described herein is susceptible to variations and modifications other than those specifically described. It is to be understood that the invention includes all such variations and modifications without departing from the spirit or essential characteristics thereof. The invention also includes all of the steps, features, compositions and compounds referred to or indicated in this specification, individually or collectively, and any and all combinations or any two or more of said steps or features. The present disclosure is therefore to be considered as in all aspects illustrated and not restrictive, the scope of the invention being indicated by the appended Claims, and all changes which come within the meaning and range of equivalency are intended to be embraced therein. Various references are cited throughout this Specification, each of which is incorporated herein by reference in its entirety. The foregoing description will be more fully understood with reference to the following Examples. EXAMPLES

Second-strand synthesis

Double-stranded cDNA was generated by either PCR amplification (indicated as PCR in the text) or nick translation (indicated as SSS in the text) [24]. The PCR was performed in 50 μL total reaction volume using 20 μL of pooled and Exol-treated first-strand reaction, 1 μL of 10 pM LA oligo (Microsynth) primer, 1 μL of dNTP (0.2mM), 1 μL of with Advantage 2 Polymerase Mix (Clontech, #639206), 5 μL of Advantage 2 PCR buffer, and 22 μL of water following the program (95 °C — 1 min; 10 cycles: 95 °C — 15 s, 65 °C — 30 s, 68 °C — 6 min; final elongation at 72 °C — 10 min). Alternatively, the second stand was synthesized following the nick translation method. For that, a mix containing 2 μL of RNAse H (NEB, #M0297S), 1 μL of Escherichia Polymerase (NEB, #M0209 L), 1 μL of dNTP (0 ,2mM), 10 μL of 5x Second Stand Buffer (100 mM Tris- HC1 (pH 6.9) (AppliChem, #A3452); 25 mM MgC12 (Sigma, #M2670); 450 mM KC1 (AppliChem, #A2939); 0.8 mM β-NAD; 60 mM (NH4)2SO4 (Fisher Scientific Acros, # AC20587); and 11 μL of water was added to 20 μL of Exol-treated first-strand reaction on ice. The reaction was incubated at 16 °C for 2.5 h or overnight. Full-length double-stranded cDNA was purified with 30 μL (0.6x) of AMPure XP magnetic beads (Beckman Coulter, #A63881) and eluted in 20 μL of water.

Library preparation and sequencing

The sequencing libraries were prepared by tagmentation of 1-50 ng of full-length double stranded cDNA. Tagmentation was done either with Illumina Nextera XT kit (Illumina, #FC- 131-1024) following the manufacturer’s recommendations or with in-house produced Tn5 preloaded with dual (Tn5-A/B) or same adapters (Tn5-B/B) under the following conditions: 1 μL (11 pM) Tn5, 4 μL of 5x TAPS buffer (50 mM TAPS (Sigma, #T5130), and 25 mM MgC12 (Sigma, #M2670)) in 20 μL total volume. The reaction was incubated 10 min at 55 °C followed by purification with DNA Clean & Concentrator-5 kit (Zymo Research) and elution in 21 μL of water. After that, tagmented library (20 μL) was PCR amplified using 25 μL NEBNext High-Fidelity 2X PCR Master Mix (NEB, #M0541 L), 2.5 μL of P5 BRB primer (5 pM, Microsynth), and 2.5 μL of oligo bearing Illumina index (Idx7N5 5 pM, IDT) using the following program: incubation 72 °C — 3 min, denaturation 98 °C — 30 s; 10 cycles: 98 °C — 10 s, 63 °C — 30 s, 72 °C — 30 s; final elongation at 72 °C — 5 min. The fragments ranging 200- 1000 bp were size-selected using AMPure beads (Beckman Coulter, #A63881) (first round 05x beads, second 0.7x). The libraries were profiled with High Sensitivity NGS Fragment Analysis Kit (Advanced Analytical, DNF-474) and measured with Qubit dsDNA HS Assay Kit (Invitrogen, #Q32851) prior to pooling and sequencing using the Illumina NextSeq 500 platform using a custom ReadOne primer (IDT) and the High Output v2 kit (75 cycles) (Illumina, #FC-404-2005). The library loading concentration was 2.2 pM. The read 1 sequencing was performed for 6-21 cycles and read2 for 54-70 cycles depending on the experiment.




 
Previous Patent: SMALL MOLECULE MODULATORS OF IL-17

Next Patent: SAFETY SEAT