Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRANSCRIPTIONAL RECORDING BY CRISPR SPACER ACQUISITION FROM RNA
Document Type and Number:
WIPO Patent Application WO/2020/053299
Kind Code:
A1
Abstract:
The present invention relates to a method for recording a transcriptome of a cell, the method comprising the steps of: providing a test cell comprising: a first transgene nucleic acid sequence encoding a fusion protein comprising a reverse transcriptase polypeptide and a Cas1 polypeptide and a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and a third transgene nucleic acid sequence comprising a CRISPR direct repeat (DR) sequence; wherein said CRISPR direct repeat sequence is specifically recognizable by a RT-Cas1-Cas2 complex formed by the expression products of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence, exposing said test cell to conditions under which expression of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence is induced, wherein said RT-Cas1-Cas2 complex formed by expression products of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence acquires protospacers from RNA molecules and integrates spacers into said third transgene nucleic acid sequence yielding a modified third transgene nucleic acid sequence, isolating said modified third transgene nucleic acid sequence from said test cell yielding an isolated third transgene nucleic acid sequence, and sequencing said isolated modified third transgene nucleic acid sequence.

Inventors:
PLATT RANDALL JEFFREY (CH)
SCHMIDT FLORIAN (CH)
Application Number:
PCT/EP2019/074267
Publication Date:
March 19, 2020
Filing Date:
September 11, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ETH ZUERICH (CH)
International Classes:
C12N9/22; C12N15/10; C12N15/62; C12Q1/6806
Domestic Patent References:
WO2017142999A22017-08-24
WO2016205728A12016-12-22
WO2018191525A12018-10-18
WO2018201010A12018-11-01
Foreign References:
US20170275665A12017-09-28
EP18193881A2018-09-11
Other References:
S. SILAS ET AL: "Direct CRISPR spacer acquisition from RNA by a natural reverse transcriptase-Cas1 fusion protein", SCIENCE, vol. 351, no. 6276, 25 February 2016 (2016-02-25), pages aad4234 - 1, XP055543958, ISSN: 0036-8075, DOI: 10.1126/science.aad4234
JEFF NIVALA ET AL: "Spontaneous CRISPR loci generation in vivo by non-canonical spacer integration", NATURE MICROBIOLOGY, vol. 3, no. 3, 1 March 2018 (2018-03-01), pages 310 - 318, XP055593848, DOI: 10.1038/s41564-017-0097-z
SETH L. SHIPMAN ET AL: "Molecular recordings by directed CRISPR spacer acquisition", SCIENCE, vol. 353, no. 6298, 9 June 2016 (2016-06-09), pages aaf1175, XP055406400, ISSN: 0036-8075, DOI: 10.1126/science.aaf1175
SETH L. SHIPMAN ET AL: "CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria", NATURE, vol. 547, no. 7663, 1 July 2017 (2017-07-01), London, pages 345 - 349, XP055593969, ISSN: 0028-0836, DOI: 10.1038/nature23017
RAVI U. SHETH ET AL: "Multiplex recording of cellular events over time on CRISPR biological tape", SCIENCE, vol. 358, no. 6369, 15 December 2017 (2017-12-15), pages 1457 - 1461, XP055587702, ISSN: 0036-8075, DOI: 10.1126/science.aao0958
S. D. PERLI ET AL: "Continuous genetic recording with self-targeting CRISPR-Cas in human cells", SCIENCE, vol. 353, no. 6304, 18 August 2016 (2016-08-18), pages aag0511 - aag0511, XP055309113, ISSN: 0036-8075, DOI: 10.1126/science.aag0511
F. FARZADFARD ET AL: "Genomically encoded analog memory with precise in vivo DNA writing in living cell populations", SCIENCE, vol. 346, no. 6211, 14 November 2014 (2014-11-14), pages 1256272 - 1256272, XP055256180, ISSN: 0036-8075, DOI: 10.1126/science.1256272
SCHMIDT FLORIAN ET AL: "Transcriptional recording by CRISPR spacer acquisition from RNA", NATURE, MACMILLAN JOURNALS LTD., ETC.|, LONDON, vol. 562, no. 7727, 3 October 2018 (2018-10-03), pages 380 - 385, XP036614322, ISSN: 0028-0836, [retrieved on 20181003], DOI: 10.1038/S41586-018-0569-1
SHIPMAN ET AL., SCIENCE, vol. 353, no. 6298, 2016, pages aaf1175
SHIPMAN ET AL., NATURE, vol. 547, 2017, pages 346 - 349
SHETH ET AL., SCIENCE, 10.1126/SCIENCE.AAO0958, 2017
SMITHWATERMAN, ADV. APPL. MATH., vol. 2, 1981, pages 482
NEEDLEMANWUNSCH, J. MOL. BIOL., vol. 48, 1970, pages 443
PEARSONLIPMAN, PROC. NAT. ACAD. SCI., vol. 85, 1988, pages 2444
ALTSCHUL ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 410
Attorney, Agent or Firm:
SCHULZ JUNGHANS PATENTANWÄLTE PARTGMBB (DE)
Download PDF:
Claims:
Claims

1. A method for recording a transcript, particularly for recording a transcriptome, of a cell, the method comprising the steps of:

providing a test cell comprising:

• a first transgene nucleic acid sequence encoding a fusion protein comprising a reverse transcriptase polypeptide and a Cas1 polypeptide and a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and

• a third transgene nucleic acid sequence comprising a CRISPR direct repeat (DR) sequence; wherein said CRISPR direct repeat sequence is specifically recognizable by an RT-Cas1-Cas2 complex formed by the expression products of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence,

in an exposure step, exposing said test cell to conditions under which expression of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence is induced, wherein said RT-Cas1-Cas2 complex formed by expression products of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence

o acquires at least one protospacer, particularly more than one protospacer, from one or more nucleic acid molecules, more particularly one or more RNA molecules, and

o integrates said protospacer as spacer into said third transgene nucleic acid sequence

yielding a modified third transgene nucleic acid sequence comprising at least one integrated spacer,

isolating said modified third transgene nucleic acid sequence from said test cell yielding an isolated modified third transgene nucleic acid sequence, and sequencing said isolated modified third transgene nucleic acid sequence.

2. The method according to claim 1 , wherein said third transgene nucleic acid sequence further comprises a CRISPR leader sequence, wherein said CRISPR leader sequence is specifically recognizable by said RT-Cas1-Cas2 complex formed by the expression products of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence.

3. The method according to claim 1 or 2, wherein said third transgene nucleic acid

sequence does not comprise any further CRISPR direct repeat sequence.

4. The method according to any one of the preceding claims, wherein said test cell additionally comprises

a fourth transgene nucleic acid sequence encoding a sensor, wherein said sensor will be activated when contacted with an analyte molecule yielding an activated sensor, wherein said activated sensor will induce the expression of a record gene inside the cell;

and wherein in said exposure step, if said analyte molecule is present, said activated sensor induces the expression of a record gene inside the cell and RNA derived from said record gene is aquired as a spacer.

5. The method according to any one of the preceding claims, wherein said CRISPR leader sequence and/or said CRISPR direct repeat sequence are specifically recognizable by an RT-Cas1-Cas2 complex of F. saccharivorans, Candidatus accumlibacter, Eubacterium saburreum, Bacteriodes fragiles, Camplyobacter fetus, Teredinibacter turnerae, Woodsholea maritima, Desulfaculus baarsii, Azospirillum lipoferum, Cellulomonospora bogoriensis, Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp, or an RT-Cas1- Cas2 complex originating thereof.

6. The method according to any one of the preceding claims, wherein said test cell is an E. coli cell.

7. The method according to any one of the preceding claims, wherein said third

transgene nucleic acid sequence is comprised within a vector, particularly an expression vector.

8. The method according to claim 7, wherein said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are comprised within said vector.

9. The method according to any one of the preceding claims, wherein said conditions, under which expression of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence is induced, lead to an overexpression of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence.

10. The method according to any one of the preceding claims, wherein said conditions, under which expression of said first transgene nucleic acid sequence and said second transgene nucleic acid sequence is induced,

comprise contacting said test cell with an inducer compound, particularly IPTG, lactose, arabinose, rhamnose or anhydrotetracycline; or

comprise anaerobic conditions and said inducible promoter is an anaerobically inducible promoter.

1 1. The method according to any one of the preceding claims, wherein

said third transgene nucleic acid sequence comprises an endonuclease recognition site sequence downstream or within said CRISPR direct repeat, and said endonuclease recognition site sequence is specifically recognizable by a site-specific endonuclease, particularly a restriction endonuclease, wherein particularly said CRISPR direct repeat and said restriction site sequence are separated by 20 bps to 0 bps, and said site-specific endonuclease is particularly a Type IIS or Type IIG restriction endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI and

said isolated modified third transgene nucleic acid sequence is contacted with said specific endonuclease before said sequencing, wherein said (full length) CRISPR direct repeat (adjacent to said endonuclease site) is cleaved into a truncated CRISPR direct repeat sequence.

12. The method according to claim 11 , wherein said sequencing comprises the use of a PCR primer, wherein said PCR primer comprises a nucleic acid sequence being essentially complementary to part of a full length CRISPR direct repeat sequence, but not fully complementary to said truncated CRISPR direct repeat sequences resulting from said endonuclease cleavage, within said modified third nucleic acid sequence, wherein said full length CRISPR direct repeat sequence results from or is formed by at least one spacer aquisition event.

13. The method according to any one of the preceding claims, wherein said first

transgene nucleic acid sequence encoding a fusion protein comprising a reverse transcriptase polypeptide and a Cas1 polypeptide comprises or essentially consists of a sequence selected from SEQ ID NO 1 , 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 25, 27, 29, and 31 , or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 1 , 3, 5, 7, 9, 11 , 13, 15, 17, 19, 21 , 23, 25, 27, 29, or 31 and the encoding polypeptide having substantially the same biological functionality as the polypeptide encoded by SEQ ID NO 7.

14. The method according to any one of the preceding claims, wherein said second transgene nucleic acid sequence encoding a Cas2 polypeptide comprises or essentially consists of a sequence selected from SEQ ID NO 2, 4, 6, 8, 10, 12, 14,

16, 18, 20, 22, 24, 26, 28, 30, and 32, or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, or 32, and the encoding polypeptide having substantially the same biological functionality as the polypeptide encoded by SEQ ID NO 8.

15. The method according to any one of the preceding claims, wherein said first

transgene nucleic acid sequence and said second transgene nucleic acid sequence together comprise or essentially consist of a sequence of SEQ ID NO 34, or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 034 and encoding polypeptides having substantially the same biological functionality as the polypeptides encoded by SEQ ID NO 034.

16. The method according to any one of the preceding claims, wherein said third

transgene nucleic acid sequence comprising a CRISPR direct repeat (DR) sequence comprises or essentially consists of a sequence selected from SEQ ID NO 35 to 103.

17. An isolated nucleic acid molecule comprising:

a CRISPR direct repeat (DR),

wherein said isolated nucleic acid molecule does not comprise any further CRISPR direct repeat sequence.

18. The isolated nucleic acid molecule according to claim 17 additionally comprising a CRISPR leader sequence.

19. The isolated nucleic acid molecule according to claim 18, wherein said CRISPR

leader sequence and said CRISPR direct repeat sequence are separated by 10 to 0 bp.

20. The isolated nucleic acid molecule according to any one of claims 17 to 19, further comprising an endonuclease recognition site sequence downstream or within of said CRISPR direct repeat, wherein said endonuclease recognition site sequence is specifically recognizable by a site-specific endonuclease, particularly a site-specific restriction endonuclease, and, wherein particularly said CRISPR direct repeat and said restriction site sequence are separated by 20 bps to 0 bps, particularly by 10 bps to 0 bps.

21. The isolated nucleic acid molecule according to claim 20, wherein said site-specific endonuclease is a Type IIS or Type IIG restriction endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI.

22. The isolated nucleic acid molecule according to any one of claims 17 to 21 , wherein said CRISPR leader sequence and/or said CRISPR direct repeat sequence are specifically recognizable by a RT-Cas1-Cas2 complex of F. saccharivorans,

Candidatus accumlibacter, Eubacterium saburreum, Bacteriodes fragiles,

Camplyobacter fetus, Teredinibacter turnerae, Woodsholea maritima, Desulfaculus baarsii, Azospirillum lipoferum, Cellulomonospora bogoriensis, Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp., or an RT-Cas1-Cas2 complex originating thereof.

23. An expression vector comprising the following sequence elements:

a first nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and a second nucleic acid sequence encoding a Cas2 polypeptide, wherein said first nucleic acid sequence and said second nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and

a CRISPR array sequence comprising a CRISPR direct repeat (DR) sequence, wherein said CRISPR direct repeat sequence is specifically recognizable by a RT-Cas1-Cas2 complex formed by the expression products of said first nucleic acid sequence and said second nucleic acid sequence.

24. The expression vector according to claim 23, wherein said CRISPR array sequence further comprises a CRISPR leader sequence, wherein said CRISPR leader sequence and said CRISPR direct repeat sequence are separated by 10 to 0 bp.

25. The expression vector according to claim 23 or 24, wherein said CRISPR array sequence does not comprise any further CRISPR repeat sequence specifically recognizable by said RT-Cas1-Cas2 complex.

26. The expression vector according to any one of claims 23 to 25, further comprising an endonuclease recognition site sequence downstream or within of said CRISPR direct repeat, wherein said endonuclease recognition site sequence is specifically recognizable by a site-specific endonuclease, particularly a site-specific restriction endonuclease, and said CRISPR direct repeat and said restriction site sequence are separated by 10 bps to 0 bps.

27. The expression vector according to claim 26, wherein said site-specific endonuclease is a Type IIS or Type IIG restriction endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI.

28. The expression vector according to any one of claims 23 to 27, wherein said CRISPR leader sequence, said CRISPR direct repeat sequence, said first nucleic acid sequence and said second nucleic acid sequence originate from F. saccharivorans, Candidatus accumlibacter, Eubacterium saburreum, Bacteriodes fragiles,

Camplyobacter fetus, Teredinibacter turnerae, Woodsholea maritima, Desulfaculus baarsii, Azospirillum lipoferum, Cellulomonospora bogoriensis, Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp.

29. The expression vector according to any one of claims 23 to 28, wherein said inducible promoter sequence is operable in E. coli and is particularly selected from T7 promoter, lac promoter, tac promoter, Ptet promoter ,Pc promoter und PBAD promoter.

30. The expression vector according to any one of claims 23 to 29, wherein said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are codon-optimized for £. coli.

31. A cell comprising

a first transgene nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and

a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and

a transgene nucleic acid molecule according to any one of claims 15 to 20, wherein said first transgene nucleic acid sequence, said second transgene and said transgene nucleic acid molecule are

comprised in an expression vector according to any one of claims 23 to 30 or integrated into the genome of said cell.

32. The cell according to claim 31 , additionally comprising

a fourth transgene nucleic acid sequence encoding a fourth transgene product, wherein said fourth transgene product is capable of modulating the expression of a record gene inside the cell, and wherein such modulating the expression of said record gene is dependent on the presence or absence of an analyte molecule.

33. The cell according to claim 32, wherein said fourth transgene product is a sensor which will be activated when contacted with a molecule of interest yielding an activated sensor, wherein said activated sensor will induce the expression of a record gene inside the cell.

34. A method for monitoring of a diet of a patient or for diagnosis of a disease of a

patient, particularly of a digestive or gastrointestinal disorder of a patient, said method comprising the steps of

collecting a cell according to claim 31 to 33 from a feces sample collected from said patient, wherein said cell has been previously applied orally to said patient, isolating the transgene nucleic acid sequence from said cell yielding an isolated transgene nucleic acid sequence, and

sequencing said isolated transgene nucleic acid sequence

thereby recording one or more transcripts of said cell produced in the environment of the gastrointestinal tract.

35. An apparatus for conducting the method of claim 34.

Description:
Transcriptional recording by CRISPR spacer acquisition from RNA

The present invention relates to a method and means for recoding changes in the

transcriptome of a cell.

This application claims the right to the priority of EP application No. EP18193881.2, filed on 1 1 September 2018, the contents of which are incorporated herein by reference.

Background

A central challenge in biology is to understand how the molecular components of a cell function and integrate to enable complex cell behaviors. This challenge has fueled the creation of increasingly sophisticated technologies facilitating detailed intracellular observations at the level of DNA, RNA, protein, and metabolites. In particular, RNA sequencing technologies enable transcriptome quantification within multiple or single cells, revealing the molecular signatures of cell behaviors, states, and types with unprecedented detail. Despite the power of these technologies, they require destructive methods and therefore observations are limited to a few snapshots in time or select asynchronous cellular processes. One provocative solution to this is to introduce synthetic memory devices within cells that enable encoding, storage, and retrieval of transcriptional information.

The bacterial adaptive immune system CRISPR-Cas embodies the ideal molecular recorder. Molecular memories of plasmid or viral infections are stored within CRISPR arrays in the form of short nucleic acid segments (spacers) separated by direct repeats (DRs). New memories are acquired via the action of Cas1 and Cas2, which as a complex integrate new spacers ahead (next to the leader sequence or proximal to the leader sequence) of old spacers within the CRISPR array, thereby providing a temporal memory of molecular events. The prototype Type l-E CRISPR acquisition system from E. coli was recently leveraged to store arbitrary information and quantifiable records of defined stimuli within bacterial populations (Shipman et al, Science, vol. 353(6298), (2016), aaf1 175; Shipman et al, Nature, vol. 547, (2017), 346- 349; and Sheth et al, Science, 10.1 126/science. aao0958, (2017)). These systems elegantly demonstrate the potential of using CRISPR spacer acquisition as a molecular recorder, but they are currently limited by the need to electroporate chemically synthesized nucleotides or, analogous to prior technologies, the availability of inducible promoters. Moreover, these systems acquire spacers derived from DNA but not RNA, and therefore do not globally reflect the transcriptional history of a cell.

Based on this background is the objective of the present invention to provide a method and means for recording changes in the expression pattern of RNAs within the living cell without destroying the cell. This objective is attained by the subject matter of the claims of the present specification. Terms and definitions

The term CRISPR is an abbreviation for clustered regularly interspaced short palindromic repeats.

In the context of the present specification, the term spacer relates to polynucleotides that are inserted into a CRISPR array. The complex of Cas1 and Cas2 cuts the DNA inside the CRISPR array and integrates spacers at that position. Spacers are integrated upstream of a direct repeat sequence.

In the context of the present specification, the term CRISPR array refers to a nucleic acid sequence, in which aquired spacers are inserted or integrated by a Cas1-Cas2 complex.

In the context of the present specification, the term protospacer relates to the precursor of a spacer before being integrated into the CRISPR array as spacer. If the protospacer is a single- stranded RNA, the RNA is first integrated into the CRISPR array and then reverse-transcribed into DNA.

In the context of the present specification, the term transgene or transgeneic relates to a gene or coding sequence, partially or fully originating from a different organism than the host organism, in relation to which the sequence is a transgene sequence.

In the context of the present specification, the term codon-optimized relates a change of nucleotide sequence without changing the amino acid sequence it encodes. Every organism has a certain codon usage and by optimizing the codons with respect to the host organism, the efficiency of expression may be increased.

In the context of the present specification, the term overexpression relates to the expression of an artificially introduced gene, which is higher than the expression of a constitutively expressed gene such as a household gene of the host organisms, particularly two-fold higher, more particular 5-fold higher, even more particular 10-fold higher.

In the context of the present specification, the term transcriptome relates to the set of all RNAs inside the host or test cell, particularly the set of all mRNAs inside the host or test cell.

In context of the present specification, the term leader sequence relates to a nucleic acid sequence that is located immediately before or after the first or last CRISPR direct repeat sequence of a CRISPR array or locus.

In the context of the present specification, the terms sequence identity and percentage of sequence identity refer to a single quantitative parameter representing the result of a sequence comparison determined by comparing two aligned sequences position by position. Methods for alignment of sequences for comparison are well-known in the art. Alignment of sequences for comparison may be conducted by the local homology algorithm of Smith and Waterman, Adv. Appl. Math. 2:482 (1981 ), by the global alignment algorithm of Needleman and Wunsch, J. Mol. Biol. 48:443 (1970), by the search for similarity method of Pearson and Lipman, Proc. Nat. Acad. Sci. 85:2444 (1988) or by computerized implementations of these algorithms, including, but not limited to: CLUSTAL, GAP, BESTFIT, BLAST, FASTA and TFASTA. Software for performing BLAST analyses is publicly available, e.g., through the National Center for Biotechnology-Information (http://blast.ncbi.nlm.nih.gov/).

One example for comparison of amino acid sequences is the BLASTP algorithm that uses the default settings: Expect threshold: 10; Word size: 3; Max matches in a query range: 0; Matrix: BLOSUM62; Gap Costs: Existence 1 1 , Extension 1 ; Compositional adjustments: Conditional compositional score matrix adjustment. One such example for comparison of nucleic acid sequences is the BLASTN algorithm that uses the default settings: Expect threshold: 10; Word size: 28; Max matches in a query range: 0; Match/Mismatch Scores: 1.-2; Gap costs: Linear. Unless stated otherwise, sequence identity values provided herein refer to the value obtained using the BLAST suite of programs (Altschul et al., J. Mol. Biol. 215:403-410 (1990)) using the above identified default parameters for protein and nucleic acid comparison, respectively.

Detailed description of the invention

A first aspect of the invention relates to a method for recording a transcript, particularly a transcriptome, of a cell, the method comprising the steps of:

providing a test cell comprising:

• a first transgene nucleic acid sequence encoding a fusion protein comprising a reverse transcriptase polypeptide and a Cas1 polypeptide and a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein the first transgene nucleic acid sequence and the second transgene nucleic acid sequence are under transcriptional control of an inducible or constitutive promoter sequence, and

• a third transgene nucleic acid sequence comprising a CRISPR direct repeat (DR) sequence; wherein the CRISPR direct repeat sequence is specifically recognizable by an RT-Cas1-Cas2 complex formed by the expression products of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence,

exposing the test cell to conditions under which expression of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence is induced, wherein the RT-Cas1 -Cas2 complex formed by the expression products of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence o acquires at least one protospacer, particularly more than one protospacer, from one or more nucleic acid molecules, particularly one or more

intracellular nucleic acid molecules, more particularly one or more RNA molecules, and

o integrates said protospacer as spacer into said third transgene nucleic acid sequence,

isolating the modified third transgene nucleic acid sequence from the test cell yielding an isolated modified third transgene nucleic acid sequence, and

sequencing the isolated modified third transgene nucleic acid sequence.

Acquisition of protospacers is performed by RT-Cas1 and Cas2 forming a complex which associates itself with nucleic acid molecules, particularly with RNA molecules. RT-Cas1 and Cas2 encoded by the first and second transgene nucleic acid sequence form a stable, functional complex that is able to acquire protospacers, particularly from RNA, integrate them into CRISPR arrays and reverse-transcribe them. Thus, protospacers are transformed into spacers, which are pieces of DNA inside the CRISPR array. These spacers can be isolated and sequenced to elucidate the sequence of the protospacers, which are derived from the transcriptome.

Alternatively, the first and second transgene nucleic acid sequence may be under transcriptional control of a constitutive promoter or a promoter expressed under auxotrophic conditions such as hypoxic or anaerobic conditions.

The protospacer acquired by the RT-Cas1-Cas2 complex encoded by the first and second transgene nucleic acid may originate from endogenous nucleic acids of the host cell or from transgene nucleic acid sequences or from exogenous nucleic acids from horizontal gene transfer or from exogenous synthetic nucleic acids introduced into the host cell.

In certain embodiments, said test cell additionally comprises

a fourth transgene nucleic acid sequence encoding a sensor, wherein said sensor will be activated when contacted with an analyte molecule yielding an activated sensor, wherein said activated sensor will induce the expression of a record gene inside the cell;

and wherein in said exposure step, if said analyte molecule is present, said activated sensor induces the expression of a record gene inside the cell and RNA derived from said record gene is aquired as a spacer.

Thus, in certain embodiments, the host cell further comprises a fourth transgene nucleic acid sequence under transcriptional control of an inducible promoter sequence or a constitutive promoter sequence. The inducible or constitutive promoter sequence may be equal to or different from the inducible or constitutive promoter sequence, which controls the expression of the first and second transgene nucleic acid sequence. Preferably, the fourth transgene nucleic acid sequence is under transcriptional control of a synthetic promoter sequence.

Advantageously, specific arbitrary sequences may be expressed and acquired as

protospacers that are indicative of a specific stimulus (e.g. the inducing compound). For example, an E. coli cell is engineered to express a specific receptor for a biomarker of a human disease present in the gastrointestinal tract. The recording £. coli by the method of the invention records the downstream intracellular events enacted by the sensor (such as the expression of an arbitrary sequence like a transgene). This allows to equip the recording £. coli cells with multiple diagnostic sensors. Adding transcriptional recording on top of the sensors will aid in further distinguishing disease types or states. Non-limiting examples for suitable biomarkers include sfGFP, Rluc, Flue. Additionally, non-limiting examples for suitable biomarkers include arbitrary sequences, that is any composition of DNA nucleotides that are for example optimized to be preferentially integrated by the RT-Cas1-Cas2 complex, that are uniquely paired to the biomarker.

Particularly, the test cell may be a prokaryotic cell or a eukaryotic cell, particularly depending on the environment or conditions, which impact shall be determined on the transcription of the test cell.

In certain embodiments, the third transgene nucleic acid sequence further comprises a CRISPR leader sequence, wherein the CRISPR leader sequence is specifically recognizable by the RT-Cas1-Cas2 complex formed by the expression products of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence. Particularly, the CRISPR direct repeat sequence and the CRISPR leader sequence are in immediate vicinity to each other, e.g. separated by not more than 10 to 0 bp.

Direct repeat sequences and leader sequences may appear in both possible orientations. Accordingly, the third transgene nucleic acid sequence comprising the direct repeat sequence and optionally the leader sequence may be on the sense or anti-sense strand of the DNA of the host organism, irrespective whether the third transgene nucleic acid is integrated in the genome of test cell or the third transgene nucleic acid is comprised within a vector.

In certain embodiments, the third transgene nucleic acid sequence does not comprise any further CRISPR direct repeat sequence.

In certain embodiments, the CRISPR leader sequence and/or the CRISPR direct repeat sequence are specifically recognizable by a RT-Cas1-Cas2 complex of F. saccharivorans, Candidatus accumlibacter (particularly sp. BA-91 or sp. SK-02), Eubacterium saburreum (particularly DSM 3986), Bacteriodes fragiles (particuraly strain S14), Camplyobacter fetus (particularly subspecies Fetus), Teredinibacter turnerae (particularly T8412), Woodsholea maritima, Desulfaculus baarsii (particularly DSM 2075), Azospirillum lipoferum (particularly 4B), Cellulomonospora bogoriensis (particularly 69B4), Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp. (particularly PCC 7116), or a RT-Cas1-Cas2 complex originating thereof.

Particularly, an RT-Cas1-Cas2 originating from any one of the above mentionded species encompasses also funtional equivalent polypeptides (RT-Cas1 and Cas2) having an amino acid or nucleic acid sequence identity of at least 70%, 75%, 80%, 85%, 90%, 95% and 99% to any on of the RT-Cas1-Cas2 complex of the above mentioned species. Likewise, an RT- Cas1-Cas2 originating form any one of the above-mentioned species also encompasses polypeptides with identical amino acid sequences but codon-optimized nucleic acid sequences encoding RT-Cas1 and/or Cas2.

In certain embodiments, the first and second transgene nucleic acid sequence comprise or essentially consit of one of the nucleic acid sequences characterized by SEQ ID NO 1 to 34, respectively, or a nucleic acid sequence encoding a functional equivalent with an identity of at least 70%, 80%, 85%, 90%, 95% or 98% to one of SEQ ID NO 1 to 34.

In certain embodiments, the third transgene nucleic acid sequence comprises or essentially consits of a nucleic acid sequence characterized by SEQ ID NO 35 to 103 to or a nucleic acid sequence encoding a functional equivalent with an identity of at least 70%, 80%, 85%, 90%, 95% or 98% to one of SEQ ID NO 35 to 103.

In certain embodiments, the test cell is an E. coli cell. In certain embodiments, the test cell is an £. coli K12 strain or an E.coli B strain. In certain embodiments, the test cell is an E.coli strain selected from the list of BL21 (DE3), BL21AI, NovaBlue(DE3), BW25113, Stbl3, MG1655, JM83, Top10, Nissle 1917, and NGF-1.

In certain embodiments, the third transgene nucleic acid sequence is comprised within a vector. In certain embodiments, said first transgene nucleic acid sequence and said second transgene nucleic acid sequence are comprised within the vector, particularly an expression vector.

Alternatively, the third transgene nucleic acid sequence (CRISPR array) and/or the first and second trangsgene nucleic acid sequence (RT-Cas1-Cas2) can be integrated in the genome of the test cell.

In certain embodiments, the conditions, under which expression of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence is induced, result in an overexpression of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence.

In certain embodiments, the conditions, under which expression of the first transgene nucleic acid sequence and the second transgene nucleic acid sequence is induced,

comprise contacting the test cell with an inducer compound and said inducible promoter is a promoter inducible by the inducer compound; or comprise anaerobic conditions and said inducible promoter is an anaerobically inducible promoter.

In certain embodiments, the inducer compound is IPTG, lactose, arabinose, rhamnose or anhydrotetracycline.

When a promoter is used that is only active in the oxygen poor (anaerobic) environment of the gut, and not the oxygen rich environment outside of the body, the promoter is called anaerobically inducible promoter.

Alternatively, the inducible promoter may be induced by changes in the evironment surrounding the test cell or by a changed environment, such as for example temperature, pH value, inflammation, micronutrients, macronutrients, or occurring hypoxic or anaerobic conditions.

In certain embodiments, the third transgene nucleic acid sequence comprises an

endonuclease recognition site sequence downstream or within the CRISPR direct repeat, wherein the endonuclease recognition site sequence is specifically recognizable by a site- specific endonuclease, particularly a site-specific restriciton endonuclease. In certain embodiments, the CRISPR direct repeat and the restriction site sequence are separated by 10 bps to 0 bps. In certain embodiments, the site-specific endonuclease is a Type IIS restriction endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI.

In certain embodiments, the isolated modified third transgene nucleic acid sequence is contacted with the specific endonuclease before sequencing, wherein the full length CRISPR direct repeat adjacent to said endonuclease site is cleaved into a truncated CRISPR direct repeat sequence.

Advantageously, the site-specific restriction endonuclease truncates the direct repeat sequence most distant to the leader sequence. As the direct reapeat sequence is duplicated upon spacer acquistion, modified third transgene nucleic acid sequences comprising at least one acquired spacer will still comprise a full length CRISPR direct repeat after digestion with the above named site-specific endonuclease, while unmodified third transgene nucleic acids (without acquired spacer) will comprise only a truncated CRISPR direct repeat sequence after digestions with the site-specific endonuclease. In certain embodiments, the sequencing comprises the use of a PCR primer, wherein the PCR primer comprises a nucleic acid sequence being essentially complementary to a full length CRISPR direct repeat sequence within the modified third nucleic acid sequence, wherein the full length CRISPR direct repeat sequence results from or is formed by at least one spacer aquisition event, particularly the portion of said restricion site sequence that is cleaved away upon digestion with said site-specific restriction endonuclease.

The above-mentioned preferred PCR primer binds this region, but not to the truncated CRISPR direct repeat within an unmodfied third transgene nucleic acid squence without acquired spacer. Thus, arrays with only a truncated single DR (i.e. no newly acquired spacers) have no primer binding sequence and are therefore not exponentially amplified. Thus, the site-specific restriction endnuclease site and the preferred primer advantageoulsy enable preferentially amplifying arrays with a new spacer.

In certain embodiments, said first transgene nucleic acid sequence encoding a fusion protein comprising a reverse transcriptase polypeptide and a Cas1 polypeptide comprises or essentially consists of a sequence selected from SEQ ID NO 1 , 3, 5, 7, 9, 1 1 , 13, 15, 17, 19, 21 , 23, 25, 27, 29, and 31 , or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 1 , 3, 5, 7, 9, 11 , 13, 15, 17, 19, 21 , 23, 25, 27, 29, or 31 , and the encoding polypeptide having substantially the same biological functionality as the polypeptide encoded by SEQ ID NO 7.

In certain embodiments, said second transgene nucleic acid sequence encoding a Cas2 polypeptide comprises or essentially consists of a sequence selected from SEQ ID NO 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, and 32, or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, or 32, and the encoding polypeptide having substantially the same biological functionality as the polypeptide encoded by SEQ ID NO 8.

In certain embodiments, said first transgene nucleic acid sequence and said second transgene nucleic acid sequence together comprise or essentially consist of a sequence of SEQ ID NO 34, or a sequence at least 85% identical, particularly >90%, >93%, >95%, >98% or >99% identical to SEQ ID NO 034 and encoding polypeptides having substantially the same biological functionality as the polypeptides encoded by SEQ ID NO 034.

SEQ ID NO 34 can be described as a multi-gene encoding nucleic acid molecule or a synthetic operon, wherein both the first and the second polypeptide are under transcriptional control of the same promoter. The distinct protein coding sequences of the first and the second polypeptide are separated by an RBS (ribosomal binding site), which results in two distinct protein products. In certain embodiments, said third transgene nucleic acid sequence comprising a CRISPR direct repeat (DR) sequence comprises or essentially consists of a sequence selected from SEQ ID NO 35 to 103.

A second aspect of the invention relates to an isolated nucleic acid molecule comprising: a CRISPR direct repeat (DR) sequence,

wherein the isolated nucleic acid molecule does not comprise any further CRISPR direct repeat sequence.

In certain embodiments, the isolated nucleic acid molecule additionally comprises a CRISPR leader sequence, wherein the CRISPR leader sequence may be upstream or downstream of the CRISPR direct repeat sequence. Particularly, the CRISPR direct repeat sequence and the CRISPR leader sequence are in immediate vicinity to each other, e.g. separated by not more than 10 to 0 bp.

In certain embodiments, the isolated nucleic acid molecule further comprises an

endonuclease recognition site sequence downstream or within said CRISPR direct repeat, wherein the endonuclease recognition site sequence is specifically recognizable by a site- specific endonuclease, particularly a site-specifc restriction endonuclease. In certain embodiments, the CRISPR direct repeat and the endonuclease recognition site sequence are separated by 10 bp to 0 bp.

In certain embodiments, the CRISPR leader sequence and/or the CRISPR direct repeat sequence are specifically recognizable by a RT-Cas1-Cas2 complex of F. saccharivorans, Candidatus accumlibacter (particularly sp. BA-91 or sp. SK-02), Eubacterium saburreum (particularly DSM 3986), Bacteriodes fragiles (particuraly strain S14), Camplyobacter fetus (particularly subspecies Fetus), Teredinibacter turnerae (particularly T8412), Woodsholea maritima, Desulfaculus baarsii (particularly DSM 2075), Azospirillum lipoferum (particularly 4B), Cellulomonospora bogoriensis (particularly 69B4), Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp. (particularly PCC 7116), or a RT-Cas1-Cas2 complex originating thereof.

In certain embodiments, the site-specific endonuclease is a Type IIS restriction

endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI.

In certain embodiments, the isolated nucleic acid molecule comprises or essentially consits of a nucleic acid sequences characterized by SEQ ID NO 35 to 103 or a nucleic acid sequence encoding a functional equivalent with an identity of at least 70%, 80%, 85%, 90%, 95% or 98% to one of SEQ ID NO 35 to 103. A third aspect of the invention relates to an expression vector comprising the following sequence elements:

a first nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and a second nucleic acid sequence encoding a Cas2

polypeptide, wherein the first nucleic acid sequence and the second nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and a CRISPR array sequence comprising, a CRISPR direct repeat (DR) sequence, wherein the CRISPR direct repeat sequence is specifically recognizable by a RT- Cas1-Cas2 complex formed by the expression products of the first nucleic acid sequence and the second nucleic acid sequence.

In certain embodiments, the expression vector does not comprise any further CRISPR direct repeat sequences recognizable by the RT-Cas1-Cas2 complex encoded by the first and second transgene nucleic acid sequence.

In certain embodiments, the expression vector further comprises a CRISPR leader sequence, wherein the CRISPR leader sequence is specifically recognizable by the RT- Cas1-Cas2 complex formed by the expression products of the first nucleic acid sequence and the second nucleic acid sequence, and wherein particularly the CRISPR leader sequence and the CRISPR direct repeat sequence are separated by 10 to 0 bp.

In certain embodiments, the expression vector further comprises an endonuclease recognition site sequence downstream or within of said CRISPR direct repeat. In certain embodiments, the endonuclease recognition site sequence is specifically recognizable by a site-specific endonuclease, particularly a site-specific restriciton endonuclease. In certain embodiments, said CRISPR direct repeat and said restriction site sequence are separated by 10 bps to 0 bps.

In certain embodiments, said site-specific endonuclease is a Type IIS restriction

endonuclease, particularly Faql, BsmFI, BsIFI, Finl, or BpuSI.

In certain embodiments, the CRISPR leader sequence and/or the CRISPR direct repeat sequence are specifically recognizable by a RT-Cas1-Cas2 complex of F. saccharivorans, Candidatus accumlibacter (particularly sp. BA-91 or sp. SK-02), Eubacterium saburreum (particularly DSM 3986), Bacteriodes fragiles (particuraly strain S14), Camplyobacter fetus (particularly subspecies Fetus), Teredinibacter turnerae (particularly T8412), Woodsholea maritima, Desulfaculus baarsii (particularly DSM 2075), Azospirillum lipoferum (particularly 4B), Cellulomonospora bogoriensis (particularly 69B4), Micromonospora rosaria, Tolypothirx camplyonemoides, Oscillatoriales cyanobacterium, or Rivularia sp. (particularly PCC 7116), or a RT-Cas1-Cas2 complex originating thereof. In certain embodiments, the first and second transgene nucleic acid sequence comprise or essentially consit of one of the nucleic acid sequences characterized by SEQ ID NO 1 to 34, respectively, or a nucleic acid sequence encoding a functional equivalent with an identity of at least 70%, 80%, 85%, 90%, 95% or 98% to one of SEQ ID NO 1 to 34.

In certain embodiments, the CRISPR array sequence comprises or essentially consits of one of the nucleic acid sequences characterized by SEQ ID NO 35 to 103 to or a nucleic acid sequence encoding a functional equivalent with an identity of at least 70%, 80%, 85%, 90%, 95% or 98% to one of SEQ ID NO 35 to 103.

In certain embodiments, said inducible promoter sequence is operable in E. coli and is particularly selected from T7 promoter, lac promoter, tac promoter, P tet promoter, Pc promoter or PBAD promoter.

In certain embodiments, the first and second transgene nucleic acid sequence are codon- optimized for expression in £. coli.

A fourth aspect of the invention relates to a cell comprising an expression vector according to the third aspect or comprising

a first transgene nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein the first transgene nucleic acid sequence and said second transgene nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and

a transgene nucleic acid molecule according to the above aspect or any embodiment thereof,

wherein the first transgene nucleic acid sequence, the second transgene and the transgene nucleic acid molecule are

comprised in an expression vector according to the third aspect, or

integrated into the genome of said cell.

In certain embodiments, the cell additionally comprises

a fourth transgene nucleic acid sequence encoding a fourth transgene product, particularly a polypeptide sensor or a nucleic acid sensor, wherein said fourth transgene product is capable of modulating [directly or indirectly] the expression of a record gene inside the cell, and wherein such modulating the expression of said record gene is dependent on the presence or absence of an analyte molecule;

wherein said molecule of interest is selected from any molecule in the environment or inside of said cell, particularly a small molecule, and wherein said record gene is not expressed under conditions in which no activated sensor is present.

A small molecule in the context of the invention is a molecule with a molecular weight of below 800 Da.

In certain embodiments, said fourth transgene product is a sensor which will be activated when contacted with a molecule of interest yielding an activated sensor, wherein said activated sensor will induce [directly or indirectly] the expression of a record gene inside the cell.

Direct modulation of gene expression is achieved when the fourth transgene product is a transcription factor, which is able to induce expression directly.

Indirect modulation of gene expression is achieved when the fourth transgene product is a receptor, which, when activated, starts a signal cascade leading to a modulation of gene expression.

A fifth aspect of the invention relates to a method for monitoring of a diet of a patient or for diagnosis of a disease of a patient, particularly of a digestive or gastrointestinal disease of a patient, said method comprising the steps of

collecting a cell from a feces sample collected from said patient, wherein the cell comprises an expression vector comprising

• a first nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and a second nucleic acid sequence encoding a Cas2 polypeptide, wherein the first nucleic acid sequence and the second nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and

• a transgene nucleic acid molecule comprising a CRISPR array sequence comprising, a CRISPR direct repeat (DR) sequence, wherein the CRISPR direct repeat sequence is specifically recognizable by a RT-Cas1 -Cas2 complex formed by the expression products of the first nucleic acid sequence and the second nucleic acid sequence;

or the cell comprises

• a first transgene nucleic acid sequence encoding a fusion protein of a reverse transcriptase and a Cas1 polypeptide, and a second transgene nucleic acid sequence encoding a Cas2 polypeptide, wherein the first transgene nucleic acid sequence and said second transgene nucleic acid sequence are under transcriptional control of an inducible promoter sequence, and • a transgene nucleic acid molecule comprising a CRISPR array sequence comprising, a CRISPR direct repeat (DR) sequence, wherein the CRISPR direct repeat sequence is specifically recognizable by a RT-Cas1-Cas2 complex formed by the expression products of the first nucleic acid sequence and the second nucleic acid sequence,

wherein the first transgene nucleic acid sequence, the second transgene and the transgene nucleic acid molecule are integrated into the genome of said cell; wherein said cell has been previously applied orally to said patient, and wherein the inducible promoter sequence is active in the gastrointestinal tract of said patient, isolating the transgene nucleic acid sequence from said cell yielding an isolated transgene nucleic acid sequence, and

sequencing said isolated transgene nucleic acid sequence

thereby recording one or more transcripts of said cell produced in the environment of the gastrointestinal tract.

The activity of the inducible promoter sequence in the gastrointestinal tract of the patient is achieved by either using promoters that specifically induce expression under hypoxic or anaerobic conditions or by administering an inducing compound such as anhytrotetracycline to the patient.

Advantageously, the test cell can be utilized as a sentinel cell for capturing information describing the extracellular environment within the gastrointestinal tract. For that purpose, test cells comprising the CRISPR machinery as described above may be administered to a patient. Changes in the transcriptome of the test cell, due to conditions or changes in the gastrointestinal environment, may be determined with the method of the invention.

Afterwards, the test cells may be collected from feces or gastrointestinal contents, wherein the therein comprised CRISPR array (third transgene nucleic acid sequence) may be sequenced revealing changes in the transcriptome of the test cell, serving as a proxy measurement of the extracellular environment within the gastrointestinal tract.

Bacterial cells, for example E. coli cells, are known to change their transcriptome depending on their environment. Under a certain diet or upon a certain digestive or gastrointestinal disease, the sentinel test cells within the gastrointestinal tract will capture changes in their transcriptome that reflect the extracellular environment within the gastrointestinal tract. Test cell transcriptome changes could be induced by numerous extracellular signals, including e.g. micronutrients, macronutrients, bile acids, inflammatory markers, autoregulatory molecules, and any other molecule naturally sensed by bacteria. Furthermore, the test cell can be equipped with a biosensor for specific intestinal molecules of interest, including e.g. tetrathionate and nitrate/nitrite, which are markers for intestinal inflammation. The inventors have shown that these transcriptome changes in E. coli grown in culture, e.g. upon oxidative stress, acid stress or herbicide exposure, may be observed with the method of the invention from the transcripts which act as protospacers and which are captured by the CRISPR machinery. Furthermore, the inventors have shown that these transcriptome changes in E. coli within the mouse gastrointestinal tract, e.g. upon mice fed different diets or mouse models of colitis, may be observed with the method of the invention from the transcripts which act as protospacers and which are captured by the CRISPR machinery. Within the gastrointestinal tract, expression of the first and the second transcript within the sentinel cells takes place leading to the assembly of the RT-Cas1-Cas2 protein complex. This complex integrates RNA within the £. coli cell into the CRISPR array. The RNA is converted into DNA and stored within the CRISPR array for sequencing. This way, an indirect observation of the transcriptome of the £. coli cell within the gastrointestinal tract, providing a proxy measurement of the extracellular environment within the gastrointestinal tract, is possible.

A sixth aspect of the invention relates to an apparatus for conducting the method of the fifth aspect.

Description of the Figures:

Fig. 1 shows the transcriptional recording by CRISPR spacer acquisition from RNA:

a) Expression of RT-Cas1-Cas2 leads to the acquisition of intracellular RNAs, providing a molecular memory of transcriptional events stored within DNA; and b) Comparison of RNA sequencing (RNA-seq) and CRISPR acquisition- mediated recording of RNA followed by deep sequencing (Record-seq). RNA- seq captures the transcriptome of a population of cells at a single point in time, providing a transient snapshot of cellular events. In contrast, Record-seq permanently stores information about prior transcriptional events in a CRISPR array, providing a molecular record for reconstructing transcriptional events that occurred over time.

Fig. 2 shows the characterization of spacers acquired by FsRT-Cas1-Cas2; a)

Schematic of Record-seq experimental workflow (Fig. 7); b) Coverage of spacers aligning to the £. coli genome (scale bar 250 kb) and a representative locus (scale bar 100 bp). Identical alignments represent recurrent spacers acquired in independent biological samples (n=14). The sense/antisense orientation label is with respect to the RNA; c) Length distribution of genome- aligning spacers; d) GC content distribution of genome-aligning spacers.

Dotted line represents 50% GC content; e) Nucleotide probabilities of the 5 ' (left) or 3 ' (right) end of the spacer, along with the respective flanking sequence. The spacer (blue) and flanking (grey) nucleotides are shown. Data represent spacers merged across n=14 independent biological samples; f) Gene body coverage of spacer alignments along transcripts. Relative position represents percentiles of coding sequence lengths +/- 300 bp of adjacent genomic regions. Values are mean normalized coverage ± s.d., n=14 independent biological samples. Values in c-e are mean percent of genome- aligning spacers ± s.e.m., n=14 independent biological samples.

Fig. 3 shows that the inventive system EsRT-Cas1-Cas2 acquires spacers directly from RNA according to abundance; a) Schematic of td intron-containing constructs and representative spacers aligning to the td intron splice junction; b) Quantification of spacers derived from the td intron splice junction. Values are mean td intron spacers per million reads ± s.e.m., n=3 independent biological samples. The sum of raw sequencing counts is shown below; c) Experimental workflow depicting MS2 recording; d) Quantification of MS2- derived RNA spacers. Values are mean MS2-aligning spacers per million reads ± s.e.m., n=3 (no MS2) and 4 (MS2) biologically independent samples; e) Coverage of spacers aligning to the MS2 genome. Data represents alignments merged across samples. Sense or antisense orientation is given with respect to the (+)-strand MS2 RNA., scale bar 200 bp; f) Schematic and quantification of transcriptional recording of arbitrary sequences. Values are mean relative spacer count ± s.e.m., n=10 independent biological samples. The constitutively expressed KanR selection marker was used as a control; g) Schematic and quantification of orthogonal transcriptional recording. Values are mean relative spacer count ± s.e.m., n=10 (treated) and 9 (untreated) independent biological samples.

Fig. 4 shows the transcriptome-scale recording and analysis of complex cellular behaviors; a) Workflow for comparing Record-seq with RNA-seq; b) Clustering of Record-seq data from untreated (grey) and oxidative stress treated (green) E. coli populations, performed using Pearson correlation, n=12 (untreated) and n=11 (treated) independent biological samples; c) Clustering of Record- seq data from untreated (grey boxes) and acid stress treated (orange boxes) E. coli populations, performed using Pearson correlation, n=10 independent biological samples; d) PCA of Record-seq data from untreated (grey) and oxidative stress treated (green) E. coli populations, n=12 (untreated) and n=11 (treated) independent biological samples; e) PCA of Record-seq data from untreated (grey) and acid stress treated (orange) E. coli populations, n=10 independent biological samples; f) Clustering of Record-seq data for signature differentially expressed genes under oxidative stress; g) Clustering of Record- seq data for signature differentially expressed genes under acid stress.

Fig. 5 shows sentinel cells for recording of dose-dependent and transient herbicide exposure; a) Clustering of Record-seq data from untreated (grey), 10 mM paraquat treated (red) and 1 mM paraquat treated (green) E. coli populations, performed using Pearson correlation, n=15 independent biological samples; b) PCA of Record-seq data from untreated (grey), 10 mM paraquat treated (red) and 1 mM paraquat treated (green) £. coli populations, n=15 independent biological samples; c) Clustering of Record-seq data for signature differentially expressed genes; d) Workflow for comparing Records-Seq with RNA-seq upon transient paraquat exposure; e) PCA of RNA-seq data from unexposed (grey), transient paraquat exposed (turquoise) and constantly paraquat exposed (red) £. coli populations, n=6 independent biological samples; f) PCA of Record-seq data from unexposed (grey), transient paraquat exposed (turquoise) and constantly paraquat exposed (red) £. coli populations, n=6 independent biological samples.

Fig. 6 shows the RT-Cas1 ortholog search and screening; a) Experimental workflow involving the identification of 121 RT-Cas1 orthologs, overexpression in £. coli from the plasmid carrying minimal CRISPR array, containing leader-DR- spacer1-DR-spacer2-DR, followed by deep sequencing of expanded CRISPR arrays, and analysis as well as characterization of identified spacers; b) A comparison of the 14 disparate RT-Cas1 proteins selected for functional testing. Indicated on the left is the host species followed by a neighbor-joining phylogenetic tree built using Jukes-Cantor genetic distances of a MUSCLE multiple sequence alignment. The large“Unknown Domain” is highlighted in green, Cas6 homology domain in pink, RT domain in purple, and Cas1 in yellow; c) Detection frequency of newly acquired spacers after overnight growth and induction of RT-Cas1-Cas2 in £. coli BL21 (DE3) in different induction medias. Shown is the sum of spacer counts per 1 million sequencing reads, n=1 biological sample; d) Representative alignments of 200 spacers sequenced from F. saccharivorans array 1 to the corresponding overexpression plasmid; e) Representative alignments of 200 spacers sequenced from F. saccharivorans array 2 to the corresponding overexpression plasmid.

Fig. 7 shows the SENECA workflow and assessment of Record-seq efficiency in different culture conditions; a) SENECA relies on a plasmid containing a minimal CRISPR array consisting of the leader sequence followed by a single DR and a recognition sequence for the restriction enzyme Faql. The SENECA workflow for the (left) parental and (right) expanded array are shown. In a Golden Gate reaction, Faql cleaves within the DR (I/ll) introducing sticky ends for ligation to an lllumina P7 3’ adapter (III). For the parental array this results in a single truncated DR (IVa). For the expanded array this results in a truncated DR as well as an intact DR and spacer (IVb). PCR with primers binding to the full-length DR and the lllumina P7 3’ adapter, results in linear amplification of the parental array (Va) and exponential amplification of the expanded array (Vb); b) Sequencing reads obtained from E. coli BL21 (DE3) cells transformed with FsRT-Cas1-Cas2 encoding plasmid with or without IPTG induction; c) Same as b) but in £. coli BL21AI; d) Same as b) but in £. coli NovaBlue(DE3), a K12 substrain of £. coli; e) Comparison of the percent of sequencing reads from induced samples containing newly acquired spacers; f) Spacers per million sequencing reads obtained from cultures at an Oϋboo of 0.4, 0.8 or upon saturation; g) CRISPR arrays with two spacers per million sequencing reads obtained from cultures at an Oϋboo of 0.4, 0.8 or upon saturation. Values in b-g are mean ± s.e.m., n=3 independent biological samples.

Fig. 8 shows the Record-seq-based screen of RT-Cas1 orthologs and CRISPR array directionalities; a) Schematic of the F. saccharivorans CRISPR locus depicting the selection of CRISPR arrays and directionalities for Record-seq analysis. CRISPR arrays within each locus were identified and cloned into plasmids encoding corresponding RT-Cas1-Cas2 coding sequences. Arrays were tested in both possible directionalities, forward and reverse with a 150 bp leader. In cases of insufficient genomic data, arrays were only tested in one directionality; b) Record-seq readout of RT-Cas1 orthologs and CRISPR array directionalities. Acquisition efficiency for forward (fw) and reverse complement (rc) directionality of each array are plotted in blue and orange, respectively. Values are genome-aligning spacers per million sequencing reads, n=1 biological sample.

Fig. 9 shows the characterization of spacers acquired by FsRT-Cas1-Cas2 and comparison of SENECA and classic spacer acquisition readouts; a)

Nucleotide probabilities determined using plasmid-aligning spacers merged across n=14 independent biological samples, prepared analogous to Fig. 2f; b) Histogram of spacer GC content for all spacers or spacers acquired internal to the body of the transcript (‘gene body internal’), Values represent mean percent of genome-aligning spacers ± s.e.m., n=3 independent biological samples; c) Percent of spacers aligning to either the sense or antisense strand of coding genes. The sense or antisense orientation label is with respect to the RNA, prepared analogous to Fig. 2c; d) Length distribution of genome-aligning spacers, prepared analogous to Fig. 2d; e) GC-content distribution of genome- aligning spacers. The dotted line represents a balanced (50%) GC content, prepared analogous to Fig. 2e; f) Nucleotide probabilities for classic acquisition readout, prepared analogous to Fig. 2f; g) Nucleotide probabilities for SENECA acquisition readout, prepared analogous to Fig. 2f. Gene body coverage. For each gene the spacer coverage was determined and transformed into percentiles for comparison. Values are mean normalized coverage. n=1 pooled sample, containing 5798 spacers. Values in c-g are mean percent of genome-aligning spacers, n=1 pooled sample, containing 5798 spacers.

Fig. 10 shows the characterization of spacers acquired by EsRT-Cas1-Cas2; a)

Experimental workflow for determining the specificity of EsRT-Cas1-Cas2 for RNA using the td intron splice junction to detect RNA-derived spacers.

Genomic DNA (gDNA) was extracted from an independent culture and subjected to targeted deep sequencing of the td intron insertion site; b) Quantification of td intron splice junctions, the splice junction is specific to RNA-derived spacers and not genomic DNA or cDNA copies generated by alternative RTs in the E. coli genome, Values represent mean td intron splice junction counts per million sequencing reads ± s.e.m., n=3 independent biological samples; c) Number of spacers aligned to plasmid, E. coli genome, and MS2 genome, showing CRISPR acquisition from an RNA virus. The total number and percent of spacers aligning to each reference are shown. Values represent the sum of MS2-aligning spacers across replicates, n=64 technical replicates from n=2 biological samples, representing 22 million spacers; d) Number of MS2-aligned spacers from c) that align to the overexpression plasmid, E. coli and MS2 genome, showing that MS2-aligned spacers are specific to the MS2 genome. The total number and percent of MS2-aligned spacers that subsequently align to each reference are shown, n=64 technical replicates from n=2 biological samples, representing 22 million spacers; e) Total number of spacers aligning to features of the MS2 genome, n=64 technical replicates from n=2 biological samples, representing 22 million spacers; f) Scatter plot of transcript counts from the MS2 and E. coli genomes. Each dot represents the mean spacer count for each transcript, n=4 independent biological samples. The horizontal black bars are mean genome- aligning spacer count across all transcripts ± s.e.m.

Fig. 11 shows the quantitative analysis of arbitrary RNA sequence recording using qRT-PCR and Record-seq; a) Coverage of spacers from Fig. 3f aligning to sfGFP or Rluc. Arrow and dotted line reflect the transcription start site (TSS), black octagon indicates the transcriptional terminator. For each nucleotide position, the sum spacer coverage per million sequencing reads is shown, n=10 independent biological samples; b) Absolute quantification of sfGFP mRNA measured by qRT-PCR. Samples from Fig. 3f. Values are mean copy number per 6 x 10 9 cells, normalized by 16S rRNA copy number, ± s.e.m., n=10 independent biological samples; c) Analogous to b, but for Rluc, d) Scatter plot depicting the correlation between absolute sfGFP mRNA copy number and the number of transcript-aligning spacers from Fig. 3f. Linear regression fit, coefficient of determination (R 2 ), and Pearson linear correlation coefficient (P), n=10 independent biological samples; e) Analogous to d, but for Rluc ; f) Comparison of spacer counts for arbitrary sfGFP sequence and endogenous transcripts. Each dot represents the mean spacer count for each transcript, horizontal black bars are mean genome-aligning spacer count ± s.e.m., n=10 independent biological samples; g) Dose-response relationship between sfGEP-aligning spacers and inducer concentration for different numbers of recorded spacers. These data represent the average number of sfGEP-aligning spacers ± s.e.m., n=10 independent biological samples; h) Relative spacer count of spacers mapping to the Flue transcript after 30C6- HSL induction. Values are the normalized mean number of spacers per million sequencing reads ± s.e.m. with n=6 independent biological samples; i) Absolute quantification of Flue mRNA measured by qRT-PCR. Data was obtained from the same bacterial cultures as in Fig. 3g. Values are mean copy number per 6 x 10 9 cells, normalized by 16S rRNA copy number, ± s.e.m., n=10 independent biological samples; j) The same as in g, but for Rluc.

Fig. 12 shows that Record-seq reveals cumulatively highly expressed genes; a)

Scatter plots depicting Record-seq correlation between n=3 independent biological replicates shown in b and c. Linear regression fit, coefficient of determination (R 2 ), and Pearson linear correlation coefficient (P) are shown for each comparison. Data represent log2-normalized transcript quantification counts; b) Spacers are preferentially acquired from highly expressed genes. Record-seq spacer counts for plasmid and E. coli genes (top) or only E. coli genes (bottom) according to decreasing RNA-seq-based gene expression values. Monte Carlo bounds reflect simulated spacers with no transcriptional bias. Mean cumulative normalized spacer count, and Monte Carlo bounds are shown, n=3 independent biological samples; c) Assessing the correlation between an RNA-seq stationary phase snapshot and a Record-seq transcriptional record. RNA-seq and Record-seq was performed on the same population of E. coli BL21 (DE3) in stationary phase growth, induced to express EsRT-Cas1-Cas2 overnight. The correlation between all (top left), stationary-phase (top right), log-phase (bottom left), and plasmid-borne (bottom right) genes are shown. The linear regression fit, coefficient of determination (R 2 ), and Pearson linear correlation coefficient (P) are shown for each comparison. The data represent the log2 normalized transcript quantification counts averaged across replicates, n=3 independent biological samples; d) Correlation of Record-seq with log and stationary-phase genes over long-term cultivation. These data represent the R 2 value calculated as described for b for either stationary or logarithmic phase gene sets using different E. coli culture time points as inputs with n=3 independent biological samples; e) Comparison of transcript-aligning spacer counts with and without normalizing for gene expression level. Each dot represents the mean normalized number of counts per transcript with n=3 independent biological samples. The horizontal black bars are mean genome-aligning spacer count ± s.e.m.

Fig. 13 shows the defining the minimum number of cells required for assessing

complex cellular behaviors using Record-seq and PCA; a) Using the acid stress response data set shown in Fig. 4, PCA was performed on the entire data set as well as progressively and randomly down sampled data. This data shows that Record-seq appropriately classifies the acid stress response samples with 7% of the original data (corresponding to 314 spacer or 6.1 x 10 6 E. coli cells)., n=10 independent biological samples.

Fig. 14 shows the defining the minimum number of cells required for assessing

complex cellular behaviors using Record-seq and differential expressed signature gene analysis; Using the acid stress response data set shown in Fig. 4e, f, g, differential expressed signature genes were identified for the entire data set as well as progressively and randomly down sampled data. The plots depict hierarchically clustered signature gene heatmaps. This data shows that with 10% of the original data (corresponding to 448 spacer or 8.8 x 10 6 E. coli cells) the signature genes can appropriately classify the samples., n=10 independent biological samples. Fig. 15 shows the optimization of CRISPR spacer acquisition efficiency and detection of signature genes corresponding to Record-seq-compatible sentinel cells for encoding transient herbicide exposure; a) Plasmid and genome-aligning spacers obtained from E. coli BL21 (DE3) transformed with FsRT-Cas1-Cas2 encoding plasmid using the original coding sequence (CDS) (light blue) or optimized CDS (dark blue) under the indicated IPTG concentrations; b) Plasmid and genome-aligning spacers obtained from E.coli BL21 (DE3) transformed with FsRT-Cas1-Cas2 encoding plasmid using the optimized coding sequence under transcriptional control of either the PT7iac, PtetA, or P rh aB promoter, induced with the indicated concentrations of IPTG, aTc, or

Rhamnose, respectively; c) Unsupervised hierarchical clustering of RNA-seq cumulative expression profiles for signature differentially (cumulatively) expressed genes. Signature genes represent the union between the top 20 most differently expressed genes identified by DESeq2, edgeR, and baySeq, n=6 independent biological samples; d) Unsupervised hierarchical clustering of Record-seq cumulative expression profiles for signature differentially (cumulatively) expressed genes. Signature genes represent the union between the top 20 most differently expressed genes identified by DESeq2, edgeR, and baySeq, n=6 independent biological samples. Data in a, b are mean ± s.e.m., n=3 independent biological samples.

Fig. 16 Shows a schematic of the general Record-seq workflow in the mouse gut. £.

coli BL21 (DE3) or MG1655 cells are transformed with a plasmid encoding FsRT-Cas1-Cas2 under transcriptional control of an inducible promoter (in this case PtetA). Furthermore, the vector encodes the SENECA compatible version of a Fs CRISPR array. £. coli cells are grown first on solid culture after transformation, and then in liquid culture from individual colonies. Subsequently, germfree mice are gavaged with £. coli cells, maintenance of the plasmid and expression of FsRT-Cas1-Cas2 are ensured by addition of antibiotics (matching the resistance marker of the FsRT-Cas1-Cas2 plasmid) as well as inducers of FsRT-Cas1-Cas2 expression (in this case anhydrotetracycline). The £. coli cells colonize the gut of the germ-free mouse and FsRT-Cas1-Cas2 records spacers into plasmid-borne CRISPR arrays during the passage of cells through the gut. £. coli cells are then collected from feces of the animals or contents of the gut at different sites. Plasmid DNA is extracted from £. coli and subjected to SENECA followed by deep sequencing to retrieve the recorded spacers and infer the intestinal environment. Fig. 17 Shows acquisition of spacers detected by SENECA and deep-sequencing after oral gavage of mice with E. coli BL21 (DE3) cells. An hydrotetracycline (aTc) was supplied through the drinking water at indicated concentrations. Acquisition of spacers increased with increasing aTc concentration.

Fig. 18: Shows acquisition of spacers detected by SENECA and deep-sequencing after oral gavage of mice with £. coli BL21 (DE3) cells. An hydrotetracycline (aTc) was supplied through the drinking water at indicated concentrations. Acquisition of multiple spacers increased with increasing aTc concentration.

Fig. 19: Shows acquisition of spacers detected by SENECA and deep-sequencing after oral gavage of mice with £. coli BL21 (DE3) cells. Plasmid DNA was isolated from £. coli cells from small intestine, cecum, colon and feces. Spacer acquisition occurs in all tested anatomical sections of the gut.

Fig. 20: Shows acquisition of spacers detected by SENECA and deep-sequencing after oral gavage of mice with £. coli BL21 (DE3) cells. Plasmid DNA was isolated from £. coli cells from feces of animals at days 2, 5 and 9 and spacer acquisition was shown to increase over time.

Fig. 21 : Shows a PCA for Record-seq data derived from C57BL/6 mice gavaged with

FsRT-Cas1-Cas2 expressing £. coli BL21 (DE3) cells as outlined in Figure 16 and treated with either water (FhO) or 1 , 2 or 3% (w/v) colitis inducing dextran sulfate sodium (DSS) in their drinking water.

Fig. 22: Shows a PCA for Record-seq data derived from C57BL/6 mice gavaged with

FsRT-Cas1-Cas2 expressing £. coli BL21 (DE3) cells as outlined in Figure 16 and fed with either a chow or starch-based diet.

Fig. 23: Shows a heatmap depicting unsupervised hierarchical clustering for the top differentially expressed genes for Record-seq data derived from C57BL/6 mice gavaged with FsRT-Cas1-Cas2 expressing £. coli BL21 (DE3) cells as outlined in Figure 16 and treated with either water (FhO) or 1 , 2 or 3% (w/v) colitis inducing dextran sulfate sodium (DSS) in their drinking water. Variance stabilizing transformation (vst) transformed genome-aligning spacer counts were used.

Fig. 24: Shows a heatmap depicting unsupervised hierarchical clustering for the top differentially expressed genes for Record-seq data derived from C57BL/6 mice gavaged with FsRT-Cas1-Cas2 expressing £. coli BL21 (DE3) cells as outlined in Figure 16 and fed with either a chow or starch-based diet. Variance stabilizing transformation (vst) transformed genome-aligning spacer counts were used.

Fig. 25: Shows a PCA plot for Record-seq data derived from C57BL/6 mice gavaged with FsRT-Cas1-Cas2 expressing E. coli MG1655 cells as outlined in Figure 16 and fed with either a chow, starch or fat-based diet.

Examples

The inventors hypothesized that direct CRISPR spacer acquisition from RNA could be leveraged to store transcriptional records in CRISPR arrays within living cells. Therefore, several orthologous RT-Cas1-containing CRISPR-Cas systems were characterized. The inventors identified one from Fusicatenibacter saccharivorans to be capable of acquiring RNA spacers heterologously in £. coli. Leveraging F. saccharivorans RT-Cas1 and Cas2 (FsRT-Cas1-Cas2) and developed Record-seq, a method enabling transcriptome-scale molecular recordings into populations of cells. Transcriptional events are recorded according to RNA abundance, stored in CRISPR arrays within DNA, and can be leveraged to describe continuous as well as transient complex cellular behaviors.

CRISPR spacer acquisition by FsRT-Cas1-Cas2

The inventors set out to identify an RT-Cas1-Cas2 CRISPR acquisition complex with the ability to acquire spacers directly from RNA upon heterologous expression in £. coli. The inventors identified 121 RT-Cas1 orthologs (Table 1 ), and selected 14 representatives for functional characterization (Fig. 6a, b). The inventors overexpressed corresponding RT-Cas1 and Cas2 proteins from a plasmid additionally containing their predicted CRISPR array (Fig. 6a). Using a previously established spacer acquisition assay, the inventors discovered that the ortholog of F. saccharivorans actively acquired new spacers (Fig. 6c). The endogenous F. saccharivorans locus contains two CRISPR arrays and the inventors observed novel spacers derived from the overexpression plasmid as well as the E. coli genome were acquired into either (Fig. 6c-e).

Selective amplification of expanded CRISPR arrays

Using the previously established spacer acquisition assay, the inventors obtained

approximately 1300 newly acquired spacers per 1 million deep sequencing reads for FsRT- Cas1-Cas2 (Fig. 6c). To improve detection of novel spacers, the inventors developed Selective amplification of expanded CRISPR arrays (SENECA), a method to selectively amplify CRISPR arrays that acquired new spacers (Fig. 2a Fig. 7a). A typical SENECA- assisted Record-seq experiment uses an input of -180 ng of plasmid DNA extracted from an overnight culture of E. coli overexpressing FsRT-Cas1-Cas2, and yields 950,000 total spacers aligning to the plasmid or host genome for every 1 million sequencing reads (Fig. 2a, Fig. 7b-e). This marks an improvement of several thousand-fold compared to recent reports. Using Record-seq, the inventors readily demonstrated in vivo activity of EsRT-Cas1-Cas2 in various E. coli strains and throughout growth phases (Fig. 7b-g).

The inventors then employed Record-seq to rescreen their initial selection of RT-Cas1 orthologs (Fig. 7b). Furthermore, the inventors included all potential CRISPR arrays present in their endogenous loci in both possible directionalities in order to overcome the challenges associated with predicting these a priori (Fig. 8a). Due to the improved sensitivity of Record- seq compared to the classic readout, the inventors readily detected newly acquired spacers for the majority of orthologs upon RT-Cas1-Cas2 expression (Fig. 8b). Only a few orthologs exhibited a preferred directionality of the CRISPR array (/ ' .e., specificity for an upstream leader sequence). Consistent with the classic readout, EsRT-Cas1-Cas2 outperformed all other orthologs in terms of spacer acquisition efficiency and was chosen for further characterization. The concepts employed by Record-seq may also be applied to characterize spacer acquisition in other CRISPR-Cas systems that have been intractable due to low spacer acquisition efficiencies.

Characteristics of FsRT-Cas1-Cas2 spacer acquisition

In order to better understand the properties of EsRT-Cas1-Cas2, the inventors extensively characterized newly acquired spacers by performing Record-seq on populations of E. coli overexpressing EsRT-Cas1-Cas2 (Fig. 2a). The inventors observed that genome-aligning spacers were preferentially acquired with a specific‘antisense’ orientation, whereby spacers were complementary to the originating RNA (Fig. 2b, c). The median spacer length was 39 bp, with a distribution biased towards longer lengths (Fig. 2d). The median GC content was 36%, showing a strong bias towards AT-rich spacers (Fig. 2e). In line with previously described Type III CRISPR systems, the inventors did not find a sequence preference within or adjacent to newly adapted spacers acquired from either plasmid (Fig. 9a) or genome (Fig. 2f), implying that the EsRT-Cas1-Cas2 complex exhibits no protospacer adjacent motif (PAM). While observing spacer alignments to the E. coli genome the inventors noted that many coverage peaks were located near the termini of genes (Fig. 2b). Consistent with this observation, the inventors found that at the genome-wide level, most spacers were derived from the 5', and to a lesser extent, 3' ends of genes (Fig. 2g). This finding raised the possibility that the apparent bias towards AT-rich spacers might be caused by the AT- richness of RNA ends in E. coli, however the bias towards AT-rich spacers persisted when only considering spacers derived from within the gene body (Fig. 8b). The inventors directly compared SENECA with the classic spacer readout to determine whether SENECA introduces additional biases but found no major differences (Fig. 9c-h). Taken together, these results reflect a process by which Fs RT-Cas1-Cas2 selects AT-rich spacers based sequences related to the beginning or end of a gene, such as the ends of an RNA molecule.

FsRT-Cas1-Cas2 acquires spacers directly from RNA

To determine whether FsRT-Cas1-Cas2 acquires spacers directly from RNA, the inventors utilized a self-splicing td group I intron. This intron is a functional ribozyme, catalyzing its own excision from the pre-mRNA, resulting in a characteristic splice junction that is not present at the DNA-level. The inventors constructed three intron-interrupted constructs based on genes that were highly sampled by spacers, namely cspA, rpoS and argR (Fig. 3a). Upon expression of these constructs followed by Record-seq the inventors observed unique spacers spanning the splice junctions (Fig. 3a, b). To exclude the possibility that splice junction-containing spacers were acquired from extended complementary DNA copies generated through unspecific RT activity in £. coli, the inventors performed targeted deep sequencing on genomic DNA extracted from td intron construct-expressing cultures (Fig.

10a) showing that the splice junction was absent at the DNA-level (Fig. 10a, b). Importantly, these results do not exclude the possibility of spacer acquisition from DNA. Taken together, FsRT-Cas1-Cas2 facilitates CRISPR spacer acquisition from RNA heterologously in £. coli.

To further validate this finding, the inventors utilized the Enterobacteria phage MS2. MS2 phages exist as both sense and antisense single-stranded RNAs during their lifecycle but have no DNA intermediates. Given that MS2 phages require the F pilus for cell entry, which is missing in £. coli BL21 (DE3) cells, the inventors turned to the £. coli K12 strain

NovaBlue(DE3). Upon infection of FsRT-Cas1-Cas2 expressing cells with MS2 phage, the inventors could readily observe novel MS2-aligning spacers sampled from throughout the MS2 genome (Fig. 3c-e, Fig. 10c-f). The MS2-aligning spacers shared no sequence similarity with the plasmid or host genome, confirming their specificity (Fig. 10d). In sum, FsRT-Cas1- Cas2 enables spacer acquisition directly from a foreign RNA, thereby providing a molecular memory of an invading virus.

Recording of arbitrary transcripts using Record-seq

To assess the potential of FsRT-Cas1-Cas2 for quantitatively recording transcriptional events, the inventors utilized an inducible expression system to directly determine whether spacers were being acquired according to RNA abundance. The corresponding constructs contained super-folder GFP ( sfGFP ) or renilla luciferase ( Rluc ) genes under transcriptional control of the anhydrotetracycline (aTc)-inducible PtetA promoter. The inventors introduced these into £. coli cultured in increasing levels of aTc and subsequently harvested both total RNA and plasmid DNA for qRT-PCR and Record-seq, respectively (Fig. 3f). The inventors observed that upon increasing induction of sfGFP or Rluc there was a concordant dose- dependent increase in the coverage of spacers aligning to the respective coding sequence (Fig. 1 1 a). The inventors quantified this response and observed a linear relationship (R 2 value of 0.97) between spacer counts and absolute mRNA copy number (Fig. 1 1 b-e) as well as aTc concentration in the media (Fig. 3f). Furthermore, sfGEP-aligning spacers were readily detected against the backdrop of genome-aligning spacers by almost an order of magnitude (Fig. 1 1 f, g), which is in line with using a strong synthetic inducible promoter such at P tetA . Importantly, spacers aligning to the constitutively expressed KanR gene were not dependent on the aTc concentration (Fig. 3f).

To further generalize these findings, the inventors evaluated a second inducible expression system, placing the firefly luciferase (Flue) gene downstream of the 3-oxohexanoyl- homoserine lactone (30C6-HSL)-inducible PL UX R promoter. Induction led to a 4-fold increase in E/uc-aligning spacers (Fig. 1 1 h). Furthermore, combining both the aTc-inducible PtetA and the 30C6-HSL-inducible PL UX R transcription system enabled orthogonal recording of two independent stimuli in parallel (Fig. 3g, Fig. 1 1 i, j). This suggests that Record-seq is compatible with seemingly any inducible expression system, thereby enabling recording of multiple orthogonal sets of defined stimuli within a population of living cells. Taken together, these results show that CRISPR spacer acquisition from RNA can generate a quantifiable record of cumulative transcript abundance, and also that the transcriptional records are efficiently retrieved using standard molecular and sequencing methods.

Record-seq shows cumulatively highly expressed genes

Considering that Fs RT-Cas1-Cas2 acquired spacers directly from RNA in an abundance- dependent manner, the inventors investigated whether this could enable quantification of the cumulative cellular transcriptome. The inventors harvested both plasmid DNA for Record-seq and total RNA for RNA-seq E. coli cultures overexpressing EsRT-Cas1-Cas2 (Fig. 4a). First, the inventors confirmed the reproducibility of Record-seq between biological replicates (Pearson Correlation = 0.996 to 0.999 and R 2 = 0.560 to 0.618) (Fig. 12a), and then assessed the influence of gene expression on spacer acquisition. The EsRT-Cas1-Cas2 spacers showed a strong bias towards highly transcribed genes (Extended Data Fig. 12a) and correlated with RNA-seq-based gene expression values transcriptome-wide at various growth stages (Fig. 12b-d). While certain CRISPR-Cas subtypes possess active

mechanisms for preferentially acquiring plasmid-derived spacers, the inventors did not observe the same after accounting for the high expression level of these genes (Fig. 12e). Taken together, spacers are systematically acquired from highly transcribed genes, and represent cumulative transcript expression.

Transcriptome-scale recording reveals cell behaviors

To determine whether Record-seq could be used to record and describe complex cellular behaviors, the inventors turned to the well-studied oxidative stress and acid stress responses in E. coli. The inventors performed Record-seq on oxidative and acid stress stimulated EsRT- Cas1-Cas2 expressing cultures and analyzed cumulative expression counts using unsupervised hierarchical clustering as well as principal component analysis (PCA). Both approaches were successful in distinguishing treatment conditions, suggesting that Record- seq captured the differential molecular histories (Fig. 4b-e). To identify the cumulatively differentially expressed genes the inventors leveraged standard differential expression (DE) analysis tools developed for RNA sequencing. To overcome specific biases and assumptions of individual tools, the inventors utilized three complementary tools, namely DESeq2, edgeR, and baySeq. After identifying DE genes with each tool, the inventors generated a set of signature genes for each stimulus based on the union of the top 20 DE genes from each analysis, which the inventors hierarchically clustered and plotted along with their expression values (Fig. 4f, g). Among the signature genes the inventors identified several that were expected to dominate the cellular responses for each stimulus. The inventors investigated the minimum number of cells required for assessing complex cellular behaviors by Record- seq, finding that 8.8 x 10 6 cells are sufficient to appropriately classify treatment conditions (Fig. 13, 14). In sum, these data support the notion that the RNA-derived spacers stored within CRISPR arrays can be utilized to reconstruct the transcriptional response underlying a complex cellular behavior.

Sentinel cells encode transient herbicide exposure

To determine whether Record-seq could be leveraged for producing sentinel cells, the inventors utilized the herbicide paraquat and determined if Record-seq could capture dose- dependent and transient exposures. Paraquat is a bacteriostatic herbicide that results in superoxide anion production in microbes, and is banned in a number of countries due to its acute toxicity in humans and use in suicide cases.

Using an improved EsRT-Cas1-Cas2 expression construct (Fig 15a, b) the inventors exposed E. coli cultures to increasing concentrations of paraquat and retrieved the transcriptional memories by Record-seq. Quantification of cumulative gene expression in the different treatment conditions showed that samples were readily classified into appropriate exposure groups using both unsupervised hierarchical clustering and PCA (Fig. 5a, b).

Moreover, the signature genes captured dose-responsive and canonical paraquat-exposure genes within E. coli (Fig 5c). For example, within the signature genes the inventors found ahpC and ahpF, which encode the two subunits of an alkyl hydroperoxide reductase previously shown to facilitate scavenging of reactive oxygen species (ROS) caused by paraquat. Additionally, the inventors identified a set of genes of the cys-regulon involved in cysteine metabolism, namely cysC, cysJ and cysK, which were previously shown to facilitate paraquat resistance in E. coli. The inventors next determined whether Record-seq was also capable of capturing transient paraquat exposure in a physiological range. After transiently stimulating cultures with paraquat (Fig. 5d), the inventors quantified cumulative gene expression and gene expression for Record-seq and RNA-seq data sets, respectively. Then, the inventors assessed whether the two methods were capable of capturing the transient paraquat exposure by PCA (Fig. 5e, f), and differentially expressed signature gene clustering (Fig. 15c, d). These analyses show that Record-seq, but not RNA-seq, was capable of capturing the transient paraquat exposure (Fig. 5e, f and Fig. 15c, d). Taken together, these results demonstrate that the memory of paraquat exposure was lost within the cellular transcriptome as assessed by RNA-seq, but preserved within the molecular memories stored within the DNA of the CRISPR arrays of the sentinel cells as investigated by Record-seq.

Sentinel cells recording the gut environment in mice

Microbes have evolved to adapt and survive in diverse environments, including intestinal niches with diverse micronutrient availabilities. The gene expression patterns of these microbes reflect the extracellular environment they inhabit and could therefore provide key information on the nutrients that enable colonization as well as maintenance of commensal and pathogenic microbes. This could provide a clear entry point for devising and testing clinical interventions that attempt to address dysbiosis of gut microbiota, which has been causally linked to inflammatory bowel diseases (IBD) such as Crohn’s disease and ulcerative colitis, as well as malnutrition, where supplementation with sugars and amino acids that are deficient in the diet has been demonstrated to be corrective in animal models and human infants. Unfortunately, microbial gene expression is transient and does not remain constant over time and throughout transit of microbes through the human intestine. Consequently, microbial gene expression patterns in intestinal niches are only accessible through highly invasive sample collection. The Record-seq technology presented by the inventors can address these limitations by creating sentinel cells that constantly record their environment as they transit through the mammalian intestine. It therefore has enormous potential to monitor human gut health and perturbations in the gut microbiome in a non-invasive manner, through collection of these sentinel cells from fecal sources, forming the basis for

personalized medicine. Further, in combination with metagenomic data, Record-seq data from multiple sentinel microbes could help monitor changes in microbe-microbe and host- microbe interactions in the context of alterations in the gut.

The inventors investigated the potential of various strains of E. coli cells overexpressing EsRT-Cas1-Cas2 to function as transcriptional recorders (i.e. sentinel cells) when transiting through the murine gut. To this end the inventors monocolonized gnotobiotic C57BL/6 mice with BL21 (DE3) or MG1655 E. coli cells encoding an anhydrotetracycline inducible EsRT- Cas1-Cas2 expression cassette through oral gavage. Expression of FsRT-Cas1-Cas2 was induced non-invasively via the administration of anhydrotetracycline through the drinking water of the animals along with kanamycin to ensure maintenance of the recording plasmid. Subsequently, these E. coli cells were longitudinally sampled from the feces of the mice as well as from different intestinal compartments at the endpoint of the experiment. Following plasmid DNA extraction, SENECA and deep-sequencing, the inventors could isolate newly acquired spacers (Figure 16).

Throughout their experiments, the inventors demonstrated, that recording of new spacers increased when raising the concentration of aTc in the drinking water and thus inducing stronger FsRT-Cas1-Cas2 expression (Figure 17 and Figure 18). Furthermore, spacers were recorded throughout the gastrointestinal tracts as evident by spacers accumulating from small intestine to cecum and colon of the mice (Figure 19). Finally, the inventors demonstrated, that the number of spacers obtained from fecal samples increased over time, indicating that bacteria robustly colonized the gut and continuously acquired new spacers throughout the experiment (Figure 20).

The inventors then assessed the potential of Record-seq to detect different

microenvironments and disease conditions in the murine gut. In one example, the inventors induced colitis by administering 1%, 2% or 3% (w/v) dextran sulfate sodium (DSS) to the drinking water of the animals. The corresponding data can be used to classify the three treatment conditions using principle component analysis (PCA) merely by performing Record-seq on cells isolated from feces of the treated animals (Figure 21 ).

Similarly, in another experiment, the inventors were able to accurately distinguish whether animals were fed with a starch or a chow-based diet (Figure 22). Together, these experiments indicate, that Record-seq based sentinel cells can stratify treatment conditions as well as reveal distinct signatures of the luminal environment and thus could serve as a diagnostic device.

This was further bolstered by performing differential expression analysis on the respective Record-seq datasets to pinpoint the exact genes that were differentially expressed in response to different treatment conditions (Figure 23 and Figure 24). In the colitis experiment the inventors observed signatures of nitrite reduction - likely a consequence of host inflammatory NOS upregulation. Also, in the differential diet experiment the inventors observed that sugar acid catabolism genes were induced in mice fed a starch diet, whereas the Enter-Doudoroff pathway and methylglyoxal shunt genes were induced on a chow diet, likely due to the availability of plant cell wall glycosides. In additional experiments using E. coli MG1655 cells, the inventors confirmed, that Record- seq could also readily distinguish three different diets in this case based on chow, starch and fat (Figure 25).

Discussion

Here, the inventors describe Record-seq, a technology to encode transcriptome-scale events into DNA and assess the cumulative gene expression of populations of cells. The inventors demonstrate its potential by recording specific and complex transcriptional information. First, to improve upon existing spacer readout methods the inventors developed SENECA, resulting in a several thousand-fold improvement of spacer detection efficiency compared to recent reports, thereby enabling in-depth characterization of FsRT-Cas1-Cas2 and its application as a molecular recorder. The inventors’ results suggest that RNA-derived spacers are preferentially acquired from the ends of abundant transcripts from AT-rich regions with no PAM, and are broadly sampled at transcriptome-scale, enabling the parallelized

quantification of cumulative transcript expression.

In a set of experiments, the inventors show that upon increasing induction of arbitrary sequences, spacers are acquired in an orthogonal, dose-dependent manner and highly correlate with the absolute mRNA copy number in the cell, thus demonstrating that the molecular record faithfully recapitulates the initial stimulus in a predictable way. This also paves the way for increasingly multiplexed and orthogonal molecular recording devices.

Upon inducing complex cellular behaviors, Record-seq provides a meaningful transcriptome- scale record of molecular events, which exceeds the capabilities of current molecular recording technologies that only record specific stimuli. Finally, the inventors use Record-seq to elucidate dose-dependent features of the complex cellular response to the bacteriostatic herbicide paraquat, and demonstrate that Record-seq, but not RNA-seq, is capable of recording transient paraquat stimulation.

Although additional work will greatly improve the capacity of Record-seq to encode richer and more dynamic expression and lineage information within fewer cells, the inventors’ proof-of-principle experiments introduce a powerful tool to record transcriptome-scale events permanently in DNA for later reconstructing complex molecular histories from populations of cells. The inventors show that the recorded transcriptional histories reflect the underlying gene expression changes and could therefore be used to interrogate biological or disease processes. In the long term, the inventors envision that CRISPR spacer acquisition components could be introduced into other cell types to record the molecular sequence of events, and lineage path, that gives rise to particular cell behaviors, cell states and types. Methods

Ortholog discovery pipeline

The protein sequence of Arthrospira platensis RT-Cas1 (WP_006620498) was used as a seed sequence, and a JACKHMMER search was run against all NCBI Non-redundant protein sequences using HMMER v3.1 b2 (E-value cutoff of 1 E-05). Proteins with both Cas1 and RT domains were subsequently identified using HMMSCAN (E-value cutoff of 1 E-05). Genome sequence information for the candidate proteins were retrieved and further inspected for the presence of RT-Cas1 , Cas2, and a CRISPR array using CRISPRdetect v2.0, CRISPRone, and HMMSCAN. From 121 candidate proteins, 14 CRISPR loci were selected and subsequently aligned using MUSCLE v3.8.31 to identify candidate domains and catalytic residues. Genetic distances were computed using the Jukes-Cantor method and a phylogenetic tree was built using the Nearest-Neighbour method.

Bacterial strains and culture conditions

Escherichia coli strains used in this study were Stbl3 (Thermo Fisher Scientific) for cloning purposes as well as BL21 (DE3) Gold (Agilent Technologies), BL21AI (Invitrogen) and NovaBlue(DE3) (EMD Millipore) as a K12 strain for acquisition assays. All strains were made competent using the Mix & Go E. coli Transformation Kit & Buffer Set (Zymo Research) following the manufacturer’s protocol with growth in ZymoBroth at 19 ° C directly from fresh colonies. After transformation, cells were grown at 37 ° C on lysogenic broth (LB) (Difco) 1.5 % agar plates containing 50 pg/mL kanamycin and 1 % glucose (w/v) to reduce background expression from the T7lac system. Liquid cultures for plasmid isolation were grown in TB media (24 g/L yeast extract, 20 g/L tryptone, 4 mL/L glycerol, 17 mM KH2PO4, 72 mM

K2HPO4) containing 1% glucose (w/v).

Generation of Golden Gate compatible pET30 overexpression vector

All standard PCRs for cloning were performed using Phusion Flash High-Fidelity PCR Master Mix (Thermo Scientific) or KAPA HiFi HotStart ReadyMix (Roche), oligonucleotides and gBIocks were ordered from Integrated DNA technologies. Primers are listed in Table 6.

pET30b(+) (kind gift from Markus Jeschek) was PCR amplified as five fragments using primers FS_151/FS_152, FS_153/FS_154, FS_155/FS_156, FS_157/FS_158,

FS_159/FS_160, respectively in order to remove the five undesired Bbsl restriction sites present in the backbone. The resulting PCR fragments were assembled using 2 x HiFi DNA Assembly Mastermix (NEB), yielding pFS_0012. Subsequently, oligos FS_380 and FS_381 were annealed to generate a double stranded DNA (dsDNA) fragment encoding the T7 terminator and cloned into pFS_0012 using Xhol/Csil, yielding pFS_0013 - a pET30 derived overexpression vector harboring two Golden Gate cloning sites and thus facilitating parallel cloning of RT-Cas1 , Cas2 as well as a corresponding CRISPR array. Nucleotide sequences of all RT-Cas1 and Cas2 orthologs tested in this study along with their corresponding CRISPR arrays are listed under Sequences.

Golden Gate Assembly of RT-Cas1-Cas2 overexpression vectors for ortholog screen

RT-Cas1 , Cas2 and CRISPR array sequences were ordered from Twist Biosciences and Genscript. Putative CRISPR arrays were ordered as sequences consisting of the leader sequence followed, by DR-nativespacer1-DR-nativespacer2-DR. Furthermore, each fragment was flanked by Bbsl restriction sites generating overhangs facilitating Golden Gate Assembly into pFS_0013. Briefly, 40 fmol per fragment (RT-Cas1 , Cas2, corresponding CRISPR array, pFS_0013 acceptor vector), 1 pl_ ATP/DTT mix (10 mM each), 0.25 mI_ T7 DNA Ligase (Enzymatics), 0.75 mI_ Bpil (Thermo Scientific), 1 mI_ buffer green up to 10 mI_ with PCR grade FhO were subjected to 99 cycles of 37 ° C for 3 min, 16 ° C for 5 min, followed by 80 ° C for 10 min. Subsequently, 5 mI_ of this mixture were transformed into 50 mI_ Stbl3 cells and recovered in SOC media for 30 min at 37 ° C, 1000 rpm before spreading on plates.

Spacer acquisition

Acquisition assays were performed at 37 ° C, 300 rpm in bacterial culture tubes containing 3 ml. of TB media supplied with 100 mM isopropyl-3-D-thiogalactopyranoside (IPTG) (Sigma Aldrich) and for BL21 (DE3) Gold and NovaBlue(DE3). For E. coli BL21AI, L-(+)-arabinose (Sigma Aldrich) was additionally added to 0.2% (w/v). Each culture was inoculated with 2 colonies of bacteria stored no longer than 14 days at 4 ° C upon transformation and overnight growth at 37 ° C. When cultures reached saturation (typically 12 - 14h post inoculation), 2 mL of bacterial culture were harvested and plasmids containing CRISPR arrays were isolated by standard plasmid Mini-Prep procedures to serve as a template for preparation of deep sequencing libraries.

Amplification of CRISPR arrays for classical acquisition readout by deep sequencing

Leader proximal spacers were PCR amplified from 3 ng of plasmid DNA per pL of PCR reaction using NEBNext High-Fidelity 2X PCR Master Mix (NEB) with a forward primer binding in the leader sequence of the respective CRISPR array and a reverse primer binding in the first native spacer (Primer Design Note 1 and Table 2 for primer design and binding sites of individual CRISPR arrays, respectively). For each biological replicate, 12 individual PCR reactions of 10 pL were performed with an extension time of 15 sec for 16 cycles. The individual 10-pL reactions belonging to the same biological sample were then pooled, and residual primers removed using homemade AMPure beads at a PCR to bead ratio of 1 :1.5 (v/v) eluting the PCR product in 60 pL of buffer TE. Subsequently, 500 ng of first round PCR product per biological sample was run on a 3% LAB agarose gel (300V, 55 min, cooling the gel-chamber in an ice-water bath during the run) and purified by blind excision of gel slices at 21 1 to 300 bp, avoiding the prominent DNA band corresponding to PCR products of the unexpanded array ( i.e . no acquisition of novel spacers). Amplicons were then purified from the gel slices using the QIAquick Gel Extraction Kit (QIAGEN) and eluted into 22 pl_ of buffer EB. Illumina sequencing adaptors and indices were appended in a second round of PCR, using 6 pl_ of gel purified input DNA as a template in a 20 mI_ PCR reaction with universal second round deep sequencing primers attaching P5 and P7 handles for binding of PCR products to the flow cell in deep sequencing as well as barcoding the samples with (N)e barcodes corresponding to Illumina TruSeq HT indices (Primer Design Note 2 and Table 3 for primer design and indices, respectively). After this second round of PCR, products were purified using the QIAquick PCR Purification Kit (QIAGEN) and eluted in 22 mI_ buffer EB. Samples were then pooled and subjected to another round of gel purification using the same parameters as described above, this time excising products in the range of 280 to 350 bp.

Selective amplification of ExpaNd Ed Crispr Arrays (SENECA)

FsCRISPRArray2 was amplified from pFS_160 using FS_871/FS_904, generating a minimal Fs CRISPR Array consisting of the leader sequence and a single DR followed by a Faql restriction site (CTTCAG) on the bottom strand resulting in plasmid pFS_0235 as our standard recording plasmid. This plasmid was transformed into chemocompetent BL21 (DE3) Gold bacteria or NovaBlue(DE3) (EMD Millipore) and subjected to spacer acquisition as described above. Following plasmid extraction and quantification using Quant-IT PicoGreen dsDNA Assay Kit (Thermo Scientific) read out with a Tecan M1000 Pro Microplate reader, plasmid DNA was subjected to SENECA-adapter ligation in a Golden Gate reaction.

Oligonucleotides FS_0963/FS_0964 were annealed (2.5 mI_ each of 100 mM oligo, 5 mI_ NEBuffer 2 (NEB), 40 mI_ PCR grade FhO), by heating to 95 ° C for 5 min and cooling to 20 ° C at 0.12 ° C/sec. Annealed oligos were diluted 1 : 100 in TE buffer. Next, 40 fmols of plasmid DNA (180.3 ng for pFS_0235), 0.25 mI_ T7 Ligase (Enzymatics), 1 mI_ FastDigest Faql 0.5 mI_ of 20 x SAM, 1 mM ATP, 1 mM DTT (all Thermo Scientific), 1 mI_ of annealed, diluted oligonucleotides FS_0963/FS_0964 in 10 pL total Volume were subjected to 99 cycles of 3 min 37 ° C, 3 min 20 ° C followed by 15 min at 55 ° C. First round deep sequencing PCR was performed using NEBNext High-Fidelity 2X PCR Master Mix (NEB) (forward primers:

FS_0968 to FS_0974, reverse primer: FS_091 1 ). For each biosample one 30 mI_ reaction containing 10.38 mI_ of adapter ligated plasmid DNA were performed (98 ° C for 30 s; 22 cycles at 98 ° C for 10 s, 57 ° C for 30 s and 72 ° C for 20 s followed by 72 ° C for 5 min), pooled and purified by magnetic beads (GE Healthcare) at a PCR to bead ratio of 1 :1.6 (v/v) recovering the PCR product in 25 mI_ TE buffer (Primer Design Note 3 for details on primer design). Illumina sequencing adaptors and indices were appended in a second round of PCR

(98 ° C for 30 s, 8 cycles of 98 ° C for 10 s, 65 ° C for 30 s and 72 ° C for 30 s, and 72 ° C for 5 min) using 5 mI_ of first round PCR product as input in a 20 mI_ reaction (Primer Design Note 2 and Table 3 for primer design and indices, respectively). Samples were pooled, desalted using the QIAquick PCR Purification Kit (QIAGEN) and size selected on a E-Gel EX Agarose Gels, 2% (Thermo Scientific), loading 200-500 ng of DNA per lane, extracted using the QIAquick Gel Extraction Kit and subjected to deep sequencing on lllumina MiSeq or NextSeq500 platforms using the MiSeq Reagent Kit v3 (150-cycle) or NextSeq 500/550 Mid/High Output v2 kit (150 cycles) (both llumina), respectively. Libraries were loaded at a concentration of 1.4 to 1.6 pM as determined by qPCR using the KAPA Library Quantification Kit for lllumina® Platforms (Roche). PhiX was included at 5 - 10%.

SENECA based ortholog screen

For the SENECA based CRISPR array directionality screen, putative CRISPR arrays were extracted from genomic sequences, assuming a standard leader length of 150 nt followed by a single DR. The Faql restriction site required for SENECA was appended downstream of the DR and sequences were flanked by universal adapters for amplification and cloning. The final array sequences including these features are depicted under Sequences 2 and were ordered from Twist Biosciences as linear DNA fragments. These were PCR amplified using primers FS_1406/FS_1407 and cloned into Csil/Notl-digested plasmids containing their respective RT-Cas1-Cas2 ortholog using HiFi DNA Assembly (NEB). Upon transformation into E. coli BL21 (DE3), these constructs were subjected to the standard spacer acquisition assay in TB media. Plasmid DNA was extracted and subjected to SENECA adapter ligation. The respective oligos to be annealed for each CRISPR array tested in this experiment are listed in Table 4. Following adapter ligation, a single 140 pL 1 st round PCR reaction was prepared for each ortholog using NEBNext High-Fidelity 2X PCR Master Mix and containing the entire 20 pL SENECA adapter ligation as a template. First round PCR primers specific to the respective DR of each CRISPR array tested are listed in Table 5. The 140 pL PCR reaction was split into 12 reactions of 1 1 pL along the row of a 96-well plate. This plate was subjected to a gradient PCR (53 to 68 ° C in an Eppendorf Mastercycler Gradient). This procedure was chosen because SENECA leverages the fact that a DR matching primer will only bind to the full DR resulting from an acquisition event but not the truncated parental DR at a unique annealing temperature. By splitting the PCR reaction and subjecting it to a temperature gradient, it is ensured that without a prior knowledge, at least one of the 12 reactions is subjected to the annealing temperature at which selective amplification of expanded CRISPR arrays occurs. PCR was performed for 30 cycles upon which, the 12 reactions performed along the temperature gradient were pooled again and purified using 1.85 x Ampure beads and eluted in 25 pL TE buffer. Five pL of this elution were used as a template for a standard 20 pL second round PCR at 65 ° C annealing temperature for 12 cycles as described above. Subsequently, PCR products were purified using 2.2 x Ampure beads, eluted into 22 mI_ TE buffer, size selected as described in the standard SENECA protocol (E-Gel Ex 2%, followed by gel extraction) and subjected to deep sequencing.

Deep sequencing

Small scale targeted deep sequencing of CRISPR Arrays for the ortholog screen was performed using the lllumina MiSeq v3 300 cycle kit on an lllumina MiSeq platform or lllumina HiSeq High Output High Output PE 200 cycle kit an lllumina HighSeq2500. Deep sequencing of spacer libraries prepared using SENECA were sequenced using the NextSeq 550/550 High Output Kit v2 150 cycle on lllumina NextSeq platform or the MiSeq Reagent Kit v3 150-cycle on a MiSeq.

Data analysis pipeline

FASTQ files were quality filtered and trimmed using trimmomatic (trimmomatic SE

LEADINGS TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75) and subsequently converted to FASTA files using FASTX-Toolkit v0.0.14 (fastq-to-fasta)

(http://hannonlab.cshl.edu/fastx_toolkit/). Using custom scripts written in python2.7, spacers were identified based on the identification of a 20-66 nucleotide sequence between two 10-nt DR segments, allowing for 2 and 3 mismatches in the first and second DR segment, respectively. Arrays with multiple spacers were identified based on the presence of a complete DR sequence, allowing for 3 mismatches. Only unique spacers (>1 mismatch) from a given sample were further processed. Spacers were aligned to a merged reference genome containing plasmid and E. coli sequences [E. coli BI21 (DE3) Gold (NC_012947.1 ) genome, E. coli K12 (NC_000913.3)] using bowtie2 (bowtie2 --very-sensitive-local). In MS2 challenge experiments, the MS2 sequence [MS2 (NC_001417.2)] was also included in the merged reference genome. Identical alignments were collapsed using samtoolsv1.3, and alignments were visualized in Geneiousvl 0.2.3. Basic statistics about numbers of reads or alignment features were calculated using standard bash commands, and compiled and visualized using Prism7.0d. Gene body percentiles were calculated using RSeQC

(geneBody_coverage.py v2.6.4). Nucleotide probabilities were determined and visualized using the weblogo webtool v2.8.2. Simulated spacer datasets were prepared using BEDtools v2.25 (bedtools random -n 500 -I 38). Transcript quantification for RNA-seq and Record-seq was performed using featureCounts v1.5.0. Using custom scripts written in Matlab v9.1.0, RNA-seq and Record-seq transcript counts were normalized using transcripts per million (TPM) and used to compute cumulative spacer sums, a linear regression fit, coefficient of determination (R 2 ), and Pearson linear correlation coefficient.

Record-seq datasets corresponding to oxidative or acid stress treatments were analyzed using custom scripts written in R v3.4.4. Briefly, transcripts with less than 5 counts across replicates were discarded. Heatmaps representing unsupervised hierarchical clustering of Pearson linear correlation with complete linkage (using raw transcript counts as inputs) were prepared using the‘heatmap.2’,‘hclust’, and‘cor’ commands with default settings. Principal component analysis (PCA) was performed on log2 transformed data (raw counts plus one pseudocount to tolerate zeros) for the 50 most variable (standard deviation) genes using the ‘prcomp’ command with default settings. Differential expression analyses (using raw counts plus one pseudocount as input) were performed using DEseq2v1 .14.1 , edgeRv3.16.5, and baySeqv2.8.0 encapsulations within R. Heatmaps representing unsupervised hierarchical clustering of signature differentially expressed genes were prepared using the‘pheatmap’ command with default settings.

Code availability

The custom scripts used for the described data analysis are available on the Platt Lab website (platt.ethz.ch).

RNASeq of E. coli BL21(DE3)

RNA extraction from E. coli BL21 (DE3) was performed after overnight growth under induction of FsRT-Cas1-Cas2 expression following the QIAGEN Supplementary Protocol: Purification of total RNA from bacteria using the RNeasy Mini Kit. To achieve the appropriate amount of input culture (corresponding to 5 x 10 8 cells), serial dilutions of the overnight culture were prepared to achieve an Oϋboo between 0.2 to 0.6 measured with a NanoDrop OneC (Thermo Scientific). Bacteria were lysed using acid-washed glass beads (G1277-10G, Sigma Aldrich). The additional on-column DNase digestion was performed using the RNase-Free DNase Set (QIAGEN). DNA free RNA was submitted to the Genomics Facility Basel for ribosomal RNA (rRNA) depletion using the Ribo-Zero rRNA Removal Kit (lllumina) and followed by library preparation and sequencing on an lllumina NextSeq platform using the NextSeq 500/550 High Output v2 kit (150 cycles).

td intron

The gBIock FS_gBlock_fc/_intron_acceptor (Sequences 3) was cloned into pFS_0235 using Sphl/SgrAI yielding pFS_0238. This gBIock encoded the BBa_J23104 promoter, the ribosome binding site from bacteriophage T7 gene 10 as well as the td intron sequence including flanking regions facilitating efficient splicing. Furthermore, a Bbsl-mediated Golden Gate cloning site was placed downstream and upstream of the td intron sequence, allowing for seamless assembly of upstream and downstream exon sequences in a single one-pot reaction as described above. As the inventors previously noticed, that the 5' end of transcripts was preferentially acquired by the FsRT-Cas1-Cas2 complex, the inventors introduced the td intron within the first 23 to 31 nucleotides of the respective transcripts. The inventors created intron-interrupted sequences of three £. coli genes cspA, rpoS, argR (cold shock protein CspA, RNA polymerase sigma factor RpoS and Arginine repressor, respectively). These were selected based on the fact that they were well sampled by the EsRT-Cas1-Cas2 complex in preceding SENECA experiments. The flanking exon sequences were mutated in four to six positions to yield optimized sequences for td intron splicing, which also aided in unambiguously distinguishing the spliced and endogenous transcripts or DNA.

Accordingly, the inventors ordered complementary oligonucleotides for the fragment of the transcript to be cloned 5' of the td intron and annealed them prior to Golden Gate Assembly, while the fragment to be cloned 3' of the intron was amplified by PCR from genomic DNA. Oligonucleotides were FS_1054/1055 (5' of the intron, annealed) and FS_1056/1057 (3' of the intron, PCR) for CspA; FS_1038/1039 and FS_1040/1041 for RpoS] FS_1046/1047 and FS_1048/1049 for ArgR. The inventors ensured that mutating sequences of the respective genes to those of the td intron flanking sites did not generate a stop codon. The td intron containing EsRT-Cas1-Cas2 overexpression constructs were subjected to a standard acquisition assay followed by plasmid DNA extraction, SENECA and deep sequencing.

Presence of td intron splice sites in DNA outside of the EsCRISPR array was tested by extracting gDNA from td-ArgR transformed cultures using the GenElute Bacterial Genomic DNA Kit (Sigma Aldrich). Libraries containing the td intron insertion site were amplified using a two-round PCR strategy method analogous to the ones described above using forward primers FS_1 154 to FS_1157 and reverse primers FS_1 158 to FS_1 161 (Table 6). First- round PCR was performed at 57 ° C annealing temperature and 20 sec elongation for 15 cycles. Second-round PCR was performed at 63 ° C annealing temperature and 20 sec elongation for 8 cycles.

Infection with MS2 Phage

For infections with MS2 phage, the recording plasmid pFS_0235 was transformed into the F', and thus MS2 susceptible NovaBlue(DE3) Competent Cells (EMD Millipore). Next morning, 15 mL of TB containing 100 mM of IPTG were inoculated with 10 colonies and grown at 37 ° C, 150 rpm in an orbital shaker until an Oϋboo of 0.24. Then, MgS0 4 was added to 5 mM final concentration. Aliquots of 3 mL were split into bacterial culture tubes, infected with 200 pL of high-titre MS2 phage suspension and incubated for 1 h at room temperature without shaking to allow infection by MS2. Next, culture tubes were transferred to the orbital shaker and incubated overnight at 30 ° C, 80 rpm. Growth of E. coli in presence of MS2 phage at 30 ° C rather than 37 ° C prevents lysis of cells by productive MS2. Next morning, shaking was increased to 150 rpm. Another day later (-41 h post-infection), cultures were pelleted by centrifugation, plasmid DNA was extracted and subjected to SENECA followed by deep sequencing. Synthetic recording of sfGFP and Rluc transcripts

The Pcat-tetR-term_PtetO encoding fragment was amplified with primers FS_1123/FS_1125 from pLP167 (kind gift from Luzi Pestalozzi), digested with BamHI/Agel and cloned into Agel/Bbsl-digested pFS_0238 (see cloning of td intron constructs), yielding pFS_0270 which contains a Bbsl-mediated Golden-Gate immediately downstream of the PtetA promoter. Subsequently, sfGFP was amplified from pLP167 with primers FS_1 134/FS_1135 and Rluc was amplified using FS_1 136/FS_1 137 from BBa_J52008 (registry of standard biological parts). Both fragments were cloned into pFS_0270 using Bbsl-mediated Golden Gate Assembly, yielding pFS_0271 (sfGFP) and pFS_0272 (Rluc), respectively. LuxR promoter parts were amplified with primers FS_1584/ FS_1585 from plG0046 and FS_1586/ FS_1587 from plG0059 (registry of standard biological parts) and cloned into Agel-digested pFS_0270 using NEBuilder HiFi DNA Assembly Master Mix (NEB), resulting in pFS_0399. Oligos FS_1588/ FS_1589 were annealed and cloned into pFS_0399 digested with Sall/BamHI - yielding pFS_0400. The Flue coding sequence was amplified from Bbal712019 (registry of standard biological parts) using FS_1618/FS_1619, digested with Bsal and cloned into Bbsl- digested pFS_0400, resulting in pFS_0412 that was used in RNA recording experiments. For each biological replicate, 50 ml. of IPTG containing TB media were inoculated with 22 colonies of E. coli BL21 (DE3) transformed with pFS_0271 (sfGFP), pFS_0272 (Rluc) or pFS_0412(E/uc). When reaching an Oϋboo of 0.25, cells were split into 3 ml. aliquots in bacterial culture tubes and induced with aTc in case of P tetA promoter or N-(3- Oxododecanoyl)-L-homoserine lactone (30C6-HSL) (Sigma) in case of P LUXR promoter, and cultured in an orbital shaker for 12-14 hours at 300 rpm, followed by plasmid DNA extraction, SENECA and deep sequencing. Spacers aligning to sfGFP, Rluc and Flue were quantified as described above (see“Data analysis pipeline”). Detected number of unique spacers per million sequencing reads was normalized defining the sum number of spacers per biological replicate as 100% and plotted using GraphPad Prism v7.0d. For RNA-recording with pFS_0271 and pFS_0272 RNA extraction from the same cultures was performed using the RNAsnap method followed by treatment with the TURBO DNA-free Kit (Thermo Scientific) using 1.5 mI_ of TURBO DNase to minimize DNA-background. Reverse transcription was performed using qScript cDNA SuperMix (Quanta Bio) with 500 ng of RNA sample as a template. cDNA was diluted 1 :4 and quantification was performed in 2 technical replicates by real-time PCR (qRT-PCR) using TaqMan Fast Advanced Master Mix (Life Technologies) in a Roche LightCycler 96 System. Primers and probes sequences are listed in Table 7. Absolute copy number was calculated using standard curve method and 16s rRNA was used as a housekeeper. To determine mRNA copy number corresponding to number of cells in a single SENECA reaction (6 x10 9 ) was calculated based on the average amount of 18700 16s rRNA transcripts per single E. coli cell (BNID 102992). Orthogonal synthetic recording

The Rluc coding sequence was amplified using FS_1620/FS_1137 from pFS_0272 and cloned into pFS_0399 using Bbsl-mediated Golden Gate Assembly, yielding pFS_0413. The Flue coding sequence was amplified from BbaJ712019 (registry of standard biological parts) using FS_1621/FS_1619, digested with Bbsl and cloned into Bsal-digested pFS_0413, resulting in pFS_0414 which was subsequently used in orthogonal synthetic recording experiments.

For each biological replicate, 50 mL of TB media containing 100 mM IPTG were inoculated with 33 colonies of E. coli BL21 (DE3) transformed with pFS_0414, containing (3- Oxododecanoyl)-L-homoserine lactone (30C6-HSL)-inducible Flue and aTc-inducible Rluc coding sequences. When reaching an Oϋboo of 0.25, cells were split into 3 ml. aliquots in bacterial culture tubes and induced with 75 ng/ml_ of anhydrotetracyclinehydrochloride (aTc) (Cayman Chemical) or 10 mM of 30C6-HSL (Sigma) or a combination of both and cultured in an orbital shaker for 12 hours at 300 rpm, followed by plasmid DNA extraction, SENECA, deep sequencing as well as parallelized RNA extraction from the same culture followed by reverse transcription and qPCR measurements. Data was analyzed as described above for recording of single synthetic transcripts.

Transcriptional response to oxidative stress

Per biological replicate 36 ml. IPTG containing TB media containing 100 mM IPTG were inoculated with 24 colonies of E. coli BL21 (DE3) transformed with pFS_0235 the evening before (resulting in 1 colony/1.5 ml.) and shaken in a 250 ml. baffled shaker flask until reaching an Oϋboo of 0.24 to 0.25. Then cultures were split into 3 ml. aliquots into bacterial culture tubes (Grainer) and treated with H2O2 (30% w/w solution, Sigma Aldrich) to a final concentration of 1 mM or an equal volume of ddhhO. Growth was continued for 12 hours at 300 rpm followed by harvesting of 2 ml. of culture for plasmid DNA extraction, SENECA and deep sequencing. Data were analyzed as described above (see“Data analysis pipeline”).

Transcriptional response to acid stress

For pH-controlled growth, potassium-modified lysogenic broth (LB) (10 g/L tryptone, 5 g/L yeast extract, 7.45 g/L KCI) was buffered with 100 mM HOMOPIPES (Homopiperazine-1 , 4- bis(2-ethanesulfonic acid)). Subsequently, the pH of the medium was adjusted to either 5.0 (acid stress) or 7.0 (neutral) using KOH solution as described previously. For each biological replicate 50 mL of pH adjusted, IPTG containing LB media were inoculated with 33 colonies of E. coli BL21 (DE3) transformed with pFS_0235 (resulting in 1 colony/1.5 mL). Samples were harvested between Oϋboo of 0.3 to 0.6 for plasmid DNA extraction, SENECA and deep sequencing. Data were analyzed as described above (see“Data analysis pipeline”). Cloning of aTc-inducible FsRT-Cas1-Cas2 expression construct

For recording the transcriptional response to paraquat an aTc-inducible EsRT-Cas1 -Cas2 expression construct was generated. Therefore, a fragment containing the tet repressor driven by a constitutive promoter as well as the PtetA promoter was amplified from pFS_0271 using FS_1574/1575 and digested with Bgll/Sphl, furthermore the N-terminus of EsRT-Cas1- Cas2 was amplified with FS_1576/1577 and digested with Sphl/Bglll. These two fragments were cloned into Bgll/Bgll l-digested pFS_0235 yielding pFS_0393. The codon optimized EsRT-Cas1 -Cas2 sequence was obtained from Genscript, amplified using FS_1641/1642 and cloned into pFS_0393 using Xhol/Sphl replacing the initial EsRT-Cas1-Cas2 coding sequence and yielding pFS_0453 (SEQ ID NO 334).

Transcriptional response to 1 mM or 10 mM paraquat

Paraquat dichloride hydrate (PESTANAL, Sigma Aldrich) was dissolved at 1 M in ddhhO. For each biological replicate, 75 ml. of TB media containing 30 ng/ml_ aTc were inoculated with 50 colonies of E. coli BL21 (DE3) transformed with pFS_0393 and shaken in baffled shaker flasks until reaching an Oϋboo of 0.24 to 0.25. Then cultures were split into 3 ml. aliquots into bacterial culture tubes and treated with either 1 mM or 10 mM paraquat and cultured for an additional 1 1-12 hours before harvesting of 2 ml. of culture for plasmid DNA extraction, SENECA and deep sequencing. Data were analyzed as described above (see“Data analysis pipeline”).

Transcriptional response to transient paraquat exposure

For each biological replicate two colonies of E. coli BL21 (DE3) transformed with pFS_0453 were inoculated into 3 ml. of TB media containing 30 ng/ml_ aTC in standard bacterial culture tubes. For the first 12 h all cultures were cultivated in the absence of paraquat (300 rpm, 37 ° C). Then 2 ml. of culture were aspirated, while the remaining 1 ml. was spun down (2300 x g, 10 min) the supernatant was aspirated and the bacterial pellet resuspended in 3 ml. of fresh TB media containing 30 ng/ml_ of aTc. For both the transient as well as the permanent stimulus conditions, paraquat was added to 10 mM final concentration and the cultures were grown for an additional 12 h as above. Then 2 ml. of culture were removed, the remaining 1 ml. was pelleted as above and resuspended in 3 ml. of fresh TB media containing 30 ng/ml_ of aTc. Paraquat was added to 10 mM the permanent stimulus condition and cultures were grown for an additional 12 h as above. Then 2 ml. of culture were harvested for plasmid DNA extraction, SENECA and deep sequencing. Additionally, 100 mI_ of culture were harvested for RNA-extraction by the RNASnap protocol as described above followed by treatment with the TURBO DNA-free Kit (Thermo Scientific) using 1.5 mI_ of TURBO DNase. Ribosomal RNA was depleted using Ribo-Zero rRNA Removal Kit (lllumina) followed by library prep using TruSeq Stranded mRNA (lllumina) and deep sequencing on an NextSeq 500/550 High Output v2 kit (75 cycles) sequencing each library at a depth of 4 million reads or greater.

Bacterial population inputs for Record-sea experiments and achieved recording efficiencies

Record-seq experiments were performed in standard 12 ml. culture tubes filled with 3 ml. of terrific broth (TB) media, of which 2 ml. were used for subsequent plasmid DNA extraction. In early experiments the inventors determined that using 40 fmols (180 ng of plasmid DNA) as an input to SENECA gave consistent results and left enough plasmid for archiving samples and performing several additional SENECA reactions on the same sample if necessary.

Accordingly, 40 fmols can be considered for contextualizing the number of cells used in a typical experiment. The construct depicted in Figure 2a (pFS_0235) has a size of 7293 bp, and 40 fmol of plasmid DNA was used as an input for a SENECA reaction. Using the formula [mass of dsDNA (g) = moles of dsDNA (mol) x ((length of dsDNA (bp) x 617.96 g/mol) +

36.04 g/mol)], this equals a mass of 180.3 ng of plasmid DNA. These 40 fmol of plasmid DNA equals a total number of 2.4 x 10 10 plasmids (using Avogadro’s number of 1 mole being equal to 6.022 x 1023 particles and multiplying this by 40 x 10-15 to account for the 40 fmol used). Assuming a copy number of ~20 for the pET origin, this results in 1.2 x 10 9 cells used as a standard input per SENECA reaction

A single SENECA reaction of pFS_0235 eventually yields -6,126 spacers upon using the entire adapter ligated plasmid DNA for PCR amplification (two 30 pl_ PCR reaction, each containing 10 pL of adapter ligated plasmid DNA). Using the optimized FsRT-Cas1-Cas2 expression construct encoding an E. coli codon-optimized FsRT-Cas1-Cas2 coding sequence under transcriptional control of the aTc inducible PtetA promoter (pFS_0453), Extended Data Fig. 10a, b) the efficiency increased -10-fold to 61 ,462 spacer/SENECA reaction. Accordingly, 40 fmol of plasmid DNA acquired, 61 ,462 spacers. This is equal to one in 390,485 plasmids acquiring a new spacer. Assuming the copy number of pET30b to be 20, this results in every one in 19,524 cells acquiring a new spacer.

Based on the number of cells required to detect a specific stimulus, this calculation can be used to derive the number of cells used as a minimal input for the respective recording. For example, the inventors defined the minimum number of spacers to be required for assessing an arbitrary sequence (sfGFP) to be as low as 500 spacers, which corresponds to 8.8 x 10 6 E. coli cells (Fig. 1 1 g).

Likewise, the inventors estimated the number of spacers required to detect complex cellular behaviors to be 313 (7% of the original data), (Fig. 13, 14). This equals 6.1 x 10 6 E. coli cells used as an input. The total number of spacers required to record a complex stimulus happens to be lower than that required to record a defined stimulus (sfGFP), because in the complex case, spacers mapping to many different genes contribute to a‘usable output’ while in the case of a defined stimulus, only a subset of the required total of 500 spacers is mapping to the single gene of interest (sfGFP).

Type 111 versus Type I CRISPR-Cas systems

Type III CRISPR-Cas systems like F. saccharivorans are generally several thousand-fold less efficient in spacer acquisition than the prototypical Type I systems (like the E. coli Type l-E). This necessitates multiple rounds of elaborate size selection procedures followed by deep sequencing to identify new spacers. Likewise, PCR products from extended CRISPR arrays cannot be detected on DNA gels (agarose or PAGE) due to their vanishingly low abundance. Taken together, while the classic spacer readout is applicable for highly efficient spacer acquisition systems, it precludes deep characterizations of most CRISPR-Cas systems, which motivated the development of SENECA.

Assessing the correlation between RNA-seq and Record-seq

The inventors set out to assess the direct correlation between RNA-seq and Record-seq (Fig.12b, c). However, given the distinct nature of the two techniques, namely RNA-seq being a snapshot in time and Record-seq being a cumulative record, the inventors expected the current transcript abundances (RNA-seq) to always precede its integration within a CRISPR array (Record-seq), thus leading to a weak correlation at any specific point in time. To investigate this potential asynchrony, the inventors performed RNA-seq and Record-seq from the same population of E. coli in stationary growth phase, and assessed the correlation between the two in the context of all genes, logarithmic-phase genes, stationary-phase genes 63 , and plasmid-borne genes. While a weak correlation was observed between the two datasets when considering all genes (Pearson Correlation = 0.61 , R 2 = 0.37), a much stronger correlation was observed when considering only logarithmic-phase genes (Pearson Correlation = 0.72, R 2 = 0.52). In contrast, the correlation was weakest when considering only stationary-phase genes (Pearson Correlation = 0.49, R 2 = 0.24), in which case the inventors expect that the spacers corresponding to stationary-phase growth have not yet been integrated. Performing this correlation analysis using stationary-phase or logarithmic- phase genes on Record-seq datasets obtained after 12, 24 and 36 hours of growth indeed revealed that the spacer repertoire shifted towards stationary-phase genes, while the correlation to logarithmic-phase genes decreased during extended growth (Fig. 7f, g) indicating that spacer acquisition is still active at stationary phase. Furthermore, the plasmid- borne genes expressed under strong synthetic promoters, which are expected to be less affected by the growth phase, show the highest correlation (Pearson Correlation = 0.84, R 2 = 0.70). Taken together, the differences between RNA-seq and Record-seq highlight the respective features of transcript measurement by both methods, namely that RNA-seq represents a snapshot of the cellular transcriptome at the time of cell harvest, and Record- seq reveals the cumulative transcriptome sampled by FsRT-Cas1-Cas2 in a population of cells over time (Fig. 1 b).

Analysis of complex cellular behaviors with Record-seq

The inventors set out to answer the following questions: (i) are the transcriptional-scale records broadly different between the treated and untreated conditions; (ii) do the most variable genes in the dataset distinguish the two populations; (iii) do standard RNA sequencing analysis tools identify genes that were cumulatively differentially expressed; (iv) are the cumulatively differentially expressed genes informative in the context of the initial stimulus; and (v) can the inventors unbiasedly classify the cellular populations into treated and untreated conditions based on broad, variable, or signature responses.

Questions (i-iv) are addressed in the main text, but here the inventors will elaborate on question (v). Among the signature genes the inventors identified several that were expected to dominate the cellular responses for each stimulus. For example, the inventors identified dps (DNA protection during starvation protein), which codes for a hallmark DNA damage repair protein, among the oxidative stress signature genes. Additionally, dps has previously been shown to be the top differentially expressed gene in response to oxidative stress.

Furthermore, the inventors identified three members of the SUF system (/ ' .e., sufABCDSE operon), which primarily operates under oxidative stress conditions to aid in the formation of iron-sulfur (Fe-S) clusters. Likewise, the inventors identified hallmark members of the acid stress response, including asr (acid-shock protein precursor) as well as several chaperones (e.g., dnaK and ibpB) and heat-shock proteins (e.g., grpE and ibpA) among the acid stress signature genes 35 .

CRISPR spacer acquisition from RNA versus DNA

The inventors present multiple lines of evidence showing CRISPR spacer acquisition from RNA, including spacer acquisition from an RNA only td intron splice junction (Fig. 3a, b and Fig. 8a-b), spacer acquisition from an RNA virus (Fig. 3c-e and Fig. 10c-f), and RNA abundance-dependent spacer acquisition (Fig. 3f, g, Fig. 1 1a-e and Fig. 12b-d). While these observations strongly suggest that FsRT-Cas1-Cas2 is capable of acquiring spacers directly from RNA, they do not exclude the possibility that spacers are also being acquired from DNA. While the distinction between spacer acquisition from RNA versus DNA is fundamental to understanding the molecular mechanism of FsRT-Cas1-Cas2-mediated spacer acquisition, it does not confound Record-seq interpretation, whereby acquired spacers are preferentially derived from highly transcribed genes, correlate with gene expression at the genome-wide level, and highly correlate with RNA abundance (Fig. 12b, c). Benefits of Record-seq

The benefits of Record-seq include (i) the ability to heterologously express orthologous RT- Cas1-containing CRISPR acquisition systems in order to capture and store RNA species within DNA in an abundance-dependent process; (ii) the capacity to efficiently and scalably read out molecular histories permanently stored in DNA and reconstruct transcriptome-scale events; (iii) the application of this technology for recording specific inputs, such as virus infection or any single or orthogonal set of inducible expression system and (iv) the potential applications of this system for creating‘sentinel’ cells for medical or biotechnology applications. Even if specific external stimuli cannot be recorded directly, the transcriptome- scale molecular signatures recorded within a bacterial population may be sufficient to report meaningful physiological states.

Mice experiments

For oral gavage, E. coli (BL21 (DE3) or MG1655) cells were transformed with pFS_0453 (SEQ ID NO 334) and streaked on LB-agar plates containing 50 pg/mL kanamycin and grown overnight (12h) at 37 ° C. The plasmid pFS_0453 encodes EsRT-Cas1-Cas2 under transcriptional control of an anhydrotetracycline inducible promoter (pTetA) as well as the FsCRISPR array 2 followed by a Faql restriction site for the SENECA readout.

The following evening, a single colony was picked into 3 ml. LB medium containing 50 pg/mL kanamycin under sterile conditions and grown overnight at 37 ° C in a bacterial shaker (200- 300 rpm). This culture was used to prepare a glycerol stock by mixing 500 pL of bacterial culture with 500 pL of sterile 50% (w/v) glycerol for long term storage at -80 ° C. For in vivo recording experiments, an overnight liquid culture was inoculated either directly from this glycerol stock or by streaking bacterial on an LB-agar plate containing 50 pg/mL kanamycin to obtain single bacterial colonies.

Gnotobiotic C57BL/6 mice were orally gavaged with 1 x 10 9 colony forming units (CFU) of E. coli BL21 (DE3) or MG1655 cells transformed with pFS_0453 in 500 pL PBS. Persistence of the plasmids was ensured by adding 100 pg/mL kanamycin sulfate (Sigma Aldrich) to the drinking water. Expression of EsRT-Cas1-Cas2 was induced by the addition of 10-30 pg/mL anhydrotetracycline (Cayman Chemical) to the drinking water.

For the DSS experiment, kanamycin (100 pg/mL) and anhydrotetracycline (30 pg/mL) were added to the drinking water of the germ-free C57BL/6 mice 24 hours prior to gavage.

Animals were maintained under germ-free conditions. A colony of E. coli BL21 (DE3) transformed with pFS_0453 was grown overnight in LB medium containing 50 pg/mL kanamycin. The resulting culture was pelleted and resuspended in 1 x PBS. This bacterial resuspension was used to orally gavage each animal with 1 x 10 9 colony forming units (CFU) of E. coli. Animals were maintained on water containing both kanamycin and anhydrotetracycline throughout the entire experiment. Fecal pellets were collected for 18 days starting 24 hours after the gavage. From day 5 to day 9 of the experiment, dextran sulfate sodium (DSS) (MPBio) was added to 1%, 2% or 3% (w/v) to the animals drinking water while maintaining kanamycin and anhydrotetracycline as described above. Animals were treated in groups of 3 and negative control animals received no DSS via the water.

The experiment was terminated on day 19 when colonal and cecal contents were also harvested for plasmid DNA extraction.

Plasmid DNA was extracted using the QIAprep Spin Miniprep Kit according to the manufacturer’s instructions, volumes of buffers were increased to 500, 500 and 700 pL for buffers P1 , P2 and N3, respectively to adjust for the increased biomass. Plasmid DNA was eluted in 150 pL of buffer EB and subsequently concentrated by precipitation. Therefore, 15 pL of 3M sodium acetate solution pH 5.2 (Sigma-Aldrich) and 105 pL isopropanol were added to each sample. Samples were incubated at -20 ° C for at least 20 mins. Following centrifugation to precipitate nucleic acids (20,000 x g, 30 mins, 4 ° C), the supernatant was removed and the DNA pellet was washed with 150 pL of 70% (v/v) ethanol by centrifugation (20,000 x g, 15 mins, 4 ° C). Ethanol was aspirated and DNA pellets were briefly dried at 55 ° C upon which the DNA pellet was resuspended in 15 pL of buffer EB. From this eluate, 7.5 pL were used for SENECA adapter ligation with all subsequent step of the SENECA protocol performed as described previously.

For the diet experiment comparing chow and starch diets, all animals were maintained on a chow-based diet (3307, Kliba Nafag) prior to the experiment. On Day 1 of the experiment, 5 animals were continuously maintained on the chow-based diet, while a second group of 5 animals was switched to a starch based diet (D12450Ji, Research Diets Inc.). On Day 2 of the experiment, anhydrotetracycline and kanamycin sulfate were added to the drinking water (30 pg/mL and 100 pg/mL, respectively). On Day 3 of the experiment, all animals were orally gavaged with 1 x 10 9 colony forming units (CFU) of E. coli BL21 (DE3) transformed with pFS_0453 as described above. Fecal pellets were collected from day 4 to day 9 of the experiment for the extraction of plasmid DNA as described above. Furthermore, on day 10 the animals were dissected to obtain cecal and colonic contents for plasmid DNA extraction as described above.

For the diet experiment comparing chow, starch and fat diets, all animals were maintained on a chow-based diet (3307, Kliba Nafag) prior to the experiment. On day 1 of the experiment, were put on either a chow-based diet (3307, Kliba Nafag), a starch-based diet (D12450Ji, Research Diets Inc.) or a fat-based diet (Fat-enriched diet D12492i, Research Diets Inc.). On Day 2 of the experiment, anhydrotetracycline and kanamycin sulfate were added to the drinking water (30 pg/mL and 100 pg/mL, respectively). On Day 3 of the experiment, all animals were orally gavaged with 1 x 10 9 colony forming units (CFU) of E. coli MG1655 transformed with pFS_0453 as described above. Fecal pellets were collected from day 4 to day 10 of the experiment for the extraction of plasmid DNA as described above.

Furthermore, on day 10 the animals were dissected to obtain cecal and colonic contents for plasmid DNA extraction as described above.

Table 1 : RT-Cas1 orthologs

Host strains and protein accession number of RT-Casl orthologs idenfitied by HMMER-based protein sequence homology search

Host and protein accession number

Bacteroides salyersiae 494745665 ref WP_00748l073.1

Leptolyngbya sp. PCC7375493562087 ref WP_006515493.1

Photobactedum aphoticum 837770314 ref WP_047875592.l

Millisia brevis 1055178592 ref WP_066909103.1

Calothnx panetina 505008919 ref WP_015196021.1

Bacteroides fragilis str. 3397 T10 595923015 gb EXY33263.1

Pelodictyon phaeoclathratiforme 501500885 ref WP_0l2509l 17.1

Arthrospira platensis 493670156 ref WP_006620498.l

Calothnx sp. PCC 75/77504941836 ref WP_0l5l28938.l

Leptolyngbya sp. PCC 6406495588276 ref WP_008312855.1

Lachnoanaerobaculum saburreum 987863574 ref WP_06093224l.l

Candidatus Brocadia fulgida 816979878 gb KKO 19838.1

Leptolyngbya sp. 0-77984539873 dbj BAU44853.1

Tistrella mobilis KA081020-065 388530577 gb AFK55773.1

Smithella sp. SC K08D1774562625H gb KIE18281.1

Lachnospiraceae bacterium oral taxon 082 497051594 ref WP_009447486.1

Psychrobacter lutiphocae 518502663 ref WP_0l9672870.l

Propionicicella superfundia 916602138 ref WP_051209229.1

Loktanella vestfoldensis 518800937 ref WP_0l995689l.l

Desulfovibno hydrothermalis 505147525 ref WP_0l5334627.1

Oceanospirillum beijerinckii 654849652 ref WP_028302067.1

Fischerella muscicola 737152142 ref WP_035139015.1

Desulfobacca acetoxidans 503473041 ref WP_013707702.1

Hippea sp. KM1643957755 ref WP_025270209.1 Chlorobium limicola 501442438 ref WP_0l2465887.l

Desulfarculus baarsii 503023536 ref WP_013258512.1

Thiocapsa sp. KS1971091367 emb CRI67871.1

Candidatus Accumulibacter sp. SK-02668684200 gb KFB76584.1

Candidates Magnetoglobus multicellularis str. Araruama 571788307 gb ETR69258.1

Vibrio sinaloensis 740352375 ref WP_038l88758.l

Campylobacter concisus 544653868 ref WP_021087740.1

Cellulomonas bogonensis 917498396 ref WP_052104813.1

Teredinibacter turnerae 518435809 ref WP_019606016.1

Campylobacter fetus subsp. fetus 998762051 emb CZE46369.1

Gemmatimonadetes bacterium SCN 70-221063993205 gb ODT03821.1

Microcoleus sp. PCC7113504999115 ref WP_0151862l7.l

Micromonospora rosana 1000329745 gb KXK58998.1

Candidatus Entotheonella sp. TSY2575418691 gb ETX03376.1

Lachnoanaerobaculum sp. MSX33 570843978 gb ET097675.1

Corynebacterium durum 492955761 ref WP_006063846.1

Anabaena cylindnca PCC 7122428682296 gb AFZ61061.1

Pseudanabaena biceps 497311431 ref WP_009625648.1

Vibrio sp. MEBiC08052972247703 gb KUI97421.1

Actinomyces johnsonii 545331217 ref WP_021604855.1

Microlunatus phosphovonis 503627960 ref WP_013862036.1

Kamptonema 494597365 ref WP_007355619.1

Skermania piniformis 1054700955 ref WP_066466672.l

Fischerella sp. NIES-3754965689238 dbj BAU08380.1

Chlorobium phaeobacteroides 500067943 ref WP_011745868.1

Vibrio vulnificus 499466110 ref WP_011152750.1

Bacteroides fragilis 547947118 ref WP_022348096.1

Porphyromonas sp. COT-052 OH4946746iM965 ref WP_039428138.1

Kutznena sp. 744 918333650 ref WP_052396493.1

Porphyromonas crevioricanis 565855908 ref WP_023938229.1

Rubnvivax benzoatilyticus 497541412 ref WP_009855610.1

Streptomyces sp. F-31026350507 dbj GAT81929.1

Campylobacter gracilis 492518353 ref WP_005873073.1

Fusicatenibacter saccharivorans 941895202 ref WP_055226073.1

uncultured Thiohalocapsa sp. PB-PSB1557040601 gb ESQ17084.1

Porphyromonas gingivalis 492529527 ref WP_005874916.1 uncultured Thiohalocapsa sp. PB-PSB1557029821 gb ESQ08042.1 Azospinllum lipoferum 503954719 ref WP_0l4l887l3.l

Teredinibacter sp. 991H.S. Oa.06797071444 ref WP_045826479.l Tolypothnx campylonemoides 751570959 ref WP_041039832.1 Pseudoalteromonas rubra 800981085 ref WP_046007427.1 Rhodovulum sulfidophilum 985596740 ref WP_06083624l.l Teredinibacter turnerae 516642225 ref WP_018013804.1

Arcobacter thereius 1054172508 ref WP_066177132.1

Nocardiopsis baichengensis 516128787 ref WP_017559367.l Arthrospira maxima 493720432 ref WP_006669920.1

Eubactenaceae bactenum CHKCI0041016807618 emb CVI70780.1 Frankia sp. BMG5.1919937513 ref WP_052914l80.1

Rosebuna inulinivorans 937570588 emb CRL43259.1

Porphyromonas gingivalis 503581191 ref WP_013815267.1 Campylobacter fetus subsp. fetus 998759376 emb CZE50714.1 Microcystis aeruginosa 640538680 ref WP_02497l 209.1

Mannomonas mediterranea 503425197 ref WP_013659858.1 Candidatus Magnetomorum sp. HK-1927673953 gb KPA10619.1 Campylobacter fetus subsp. Teft/s 998758141 emb CZE46264.1 Synechococcus sp. NKBG042902780027826 ref WP_045442561.1 Chlorobaculum limnaeum 1071376969 ref WP_069809202.l Nostocsp. PCC 7107764929206 ref WP_044499977.l

Arthrospira platensis 504041557 ref WP_014275551.1

Woodsholea maritima 518804695 ref WP_019960649.1

Actinomyces carcliffensis F0333 478776992 gb EN018597.1 Mastigocladus laminosus 764662524 ref WP_044448019.1

Clostndium 916986069 ref WP_051592781.1

Rhodococcus sp. YH3-31033138899 ref WP_064444911.1 Rhodobacter capsulatus 940623611 gb KQB14189.1

Lachnoanaerobaculum saburreum 496026892 ref WP_008751399.1 Vibno metoecus 941008961 ref WP_055043549.1

Porphyromonas gingivicanis 739003123 ref WP_036885018.1 Smithella sp. 7177683425608 gb KFZ44108.1

Candidatus Accumulibacter sp. BA-91668677118 gb KFB71594.1 Nodosilinea nodulosa 515871661 ref WP_017302244.l

Phormidesmis priestleyi Ana 938299454 gb KPQ33062.1 Vibno mexicanus 823288127 ref WP_047044098.l

Photobactenum mannum 494733933 ref WP_007469744.l

Candidatus Brocadia fulgida 816977369 gb KKO 17867.1

Desulfovibno bastinii 652926624 ref WP_027l 80402.1

Candidatus Magnetoovum chiemensis 778249022 gb KJR40057.1

Azospmllum lipoferum 502738680 ref WP_0l2973664.l

Cyanothece sp. PCC7822503100147 ref WP_0l 3334941.1

Clostndiales bactenum VE202-01639695530 ref WP_02472l32l.l

Actinomycetaceae bactenum BA1121032601389 ref WP_064231067.1

Bacteroides 495935708 ref WP_008660287.1

Candidatus Jettenia caeni 494421634 ref WP_007220853.l

Rhodobacter capsulatus SB 1003 294475643 gb ADE85031.1

Oscillatonales cyanobactenum USROOl 1049312742 gb OCQ91006.1

Nostocsp. PCC 77 A? 499304863 ref WP_0l0995638.l

Vibrio metoecus 941038135 ref WP_055051199.1

Scytonema hofmanni UTEX B 657929289 ref WP_029630506.1

Arthrospira sp. CC 8005 495324841 ref WP_008049584.1

Phormidium willei 1057444347 ref WP_068790073.l

Vibno rotifenanus 742405863 ref WP_038884984.1

Thermodesulfovibrio sp. N1 1057568519 ref WP_068860870.l

Bacteroides fragilis 492341859 ref WP_0058l5836.1

Rhodovulum sp. PH 10750340320 ref WP_040622239.l

Porphyromonas gulae 807048030 ref WP_046200570.1

Arthrospira sp. TJSD091809071417 ref WP_046320545.l

Streptomyces sp. A VP053U21057451804 gb ODA69832.1

Table 2: First round PCR primers for classic acquisition readout

Primer bindings sites for first round PCR primers to amplify CRISPR arrays for deep sequencing, related to classical acquisition read-out in Fig 6. Forward primer binding site is shown in top lane for each species, reverse primer binding site in bottom lane. The design of the primers including adapter sequences for first round PCR is described in detail in Primer Design Note 1 in the methods section of this paper.

Array Sequence (5' 3') (SEQ I D NO )

Bacteroides fragilis strain S14 T CAACAC T T CATC TAT C T AAC T GAATAA ( 1 05 )

TGTTATGAACGGCTACGCCT ( 10 6 )

Campylobacter fetus subsp. Fetus CGCTCGAATTCAGCTCTCACAG ( 107 )

AATTGCCAAATTCTGTTTCAATCC ( 108 ) Cellulomonas bogonensis 69B4 GTCAGCCCGGGGTCAAAAC (109)

GGAACTTTAAACCCTTTACATCCCC (110)

Fusicatenibacter saccharivorans array TCAGAAAAACGATCGACCGAC (111)

1 AGAAGAAGCAATCGAAAAAGCG (112)

Fusicatenibacter saccharivorans array AGAATCTGAAAACAGCGGAA (113)

2 ACGCTAGGGAATATGCAGCAA (114)

Candidatus Accumulibacter sp. SK-02 CCGAAAAGAGCCGTTAAATTCC (115)

CCTCAAAACGGTACCAAAGAAGC (116)

Micromonospora rosana array 1 CACAGCACCTCTTCGCCACG (117)

CGATTCCGGTCCTCGGTTTC (118)

Micromonospora rosana array 2 CTCAAGACCCACCGTTTTCG (119)

TTCAACAACGACGCCAACTATG (120)

Candidatus Accumulibacter sp. BA -91 GCAAGTCTCCGGCAAGTCAG (121)

TCACTTGAAGATTATATAGTGACTCTTTTCG (122) Desulfarculus baarsii DSM 2075 TGGCAAACCATGTGGAAACAG (123)

AAAATGGCAACGCCGGG (124)

Woodsholea maritima TGGAGCTGAATGTCACATCTTG (125)

GGAATCTCAAGCAGCGGAGAA (126)

Azospinllum lipoferum 4B array 1 CACAGGATGCGTGGAAAGG (127)

CTCAACGAACCGAAGCTGC (128)

Azospinllum lipoferum 4B array 2 CCGTTGGGAATTTTCCCGTT (129)

GACTCTTTTTCCCGGAGCCC (130)

Teredinibacter turnerae T8412 CCCAAACGGGGTTCTAGCAT (131)

GCGACAAAAGCATATTAAGGAGACT (132)

Tolypothnx campylonemoides GCGCTGTAGAATTATTTCAGGGT (133)

ATGGGATGGAGGTTCGGGT (134)

Oscillatonales cyanobactenum GAGCTTGGGGCAAGGCTC (135)

GTCGAGAAGTAGCAGTTCACTTTCT (136)

Eubactenum saburreum DSM 3986 ACCTATCACAACGGCTTAAATG (137)

Arrayl ATCACTGCTATGCAGCTTATTCG (138)

Eubactenum saburreum DSM 3986 AAAGCGAGGGCTTTCCCATA (139)

array 2 CTCATCAGAATGTGACGGTCG (140)

Table 3: Indices for deep sequencing

(N)s barcodes corresponding to Illumina TruSeq HT indices used in this study

BC1 Sequence (5' 3') BC2 Sequence (5' 3')

AAGTAGAG CATGATCG

CATGCTTA AGGATCTA

GCACATCT GACAGTAA

TGCTCGAC CCTATGCC AGCAATTC TCGCCTTG

AGTTGCTT ATAGCGTC

CCAGTTAG GAAGAAGT

TTGAGCCT ATTCTAGG

ACACGATC CGTTACCA

GGTCCAGA GTCTGATG

GTATAACA TTACGCAC

TTCGCTGA TTGAATAG

AACTTGAC TCCTTGGT

CACATCCT ACAGGTAT

TCGGAATG AGGTAAGG

AACGCATT AACAATGG

CGCGCGGT ACTGTATC

TCTGGCGA AGGTCGCA

CATAGCGA AGGTTATC

CAGGAGCC CAACTCTC

TGTCGGAT CCAACATT

ATTATGTT CTAACTCG

CCTACCAT ATTCCTCT

TACTTAGC CTACCAGG

Table 4: SENECA adapter oligos

Reverse oligos for adapter ligation during SENECA procedure sorted by their respective CRISPR array. Related to Fig. 7 and 8. Upon annealing with the universal reverse oligo FS_0963, the array specific forward oligo (table below) creates a 4 bp overhang compatible with the plasmid overhang generated during Faql digest in SENECA.

Array Sequence (5' 3') (SEQ ID NO)

Bacteroides fragilis strain S14 Array 1 ATAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (141)

Bacteroides fragilis strain Si 4 Array 1 GAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (142)

RC

Campylobacter fetus subsp. Fetus Array TAGGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ( 143 )

1

Campylobacter fetus subsp. Fetus Array GAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC ( 144 )

1 RC

Cellulomonas bogonensis 69B4 Array 1 GAGGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (145)

Cellulomonas bogonensis 69B4 Array 1 GCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (146)

RC

Fusicatenibacter saccharivorans Array 1 TGAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (147)

Fusicatenibacter saccharivorans Array 1 AGGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (148) RC Fusicatenibacter saccharivorans Array 2 AAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (149)

Fusicatenibacter saccharivorans Array 2 AGGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (150) RC

Candidatus Accumulibacter sp. SK-02 AAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (151) Array 1

Candidatus Accumulibacter sp. SK-02 GGCTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (152) Array 1 RC

Micromonospora rosana Array 1 GCGGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (153)

Micromonospora rosana Array 1 RC CTGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (154)

Micromonospora rosana Array 2 GCGGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (155)

Micromonospora rosana Array 2 RC CTGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (156)

Micromonospora rosana Array 3 GGGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (157)

Candidatus Accumulibacter sp. BA -91 AACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (158) Array 1

Desulfarculus baarsii DSM 2075 Array 1 AAGCGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (159)

Desulfarculus baarsii DSM 2075 Array 1 GCATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (160) RC

Desulfarculus baarsii DSM 2075 Array 2 AAGCGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (161)

Desulfarculus baarsii DSM 2075 Array 2 GCATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (162) RC

Woodsholea maritima Array 1 GAGCGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (163) Woodsholea maritima Array 1 RC GATTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (164) Woodsholea maritima Array 2 GAGCGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (165) Woodsholea maritima Array 2 RC GATGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (166) Azospinllum lipoferum 4B Array 1 GAGCGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (167) Azospinllum lipoferum 4B Array 1 RC GACAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (168) Azospinllum lipoferum 4B Array 2 TAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (169) Azospinllum lipoferum 4B Array 2 RC ATGTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (170) Teredinibacter turnerae T8412 Array 1 GAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (171)

Teredinibacter turnerae T8412 Array 1 GAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (172) RC

Tolypothnx campylonemoides Array 1 GAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (173)

Tolypothnx campylonemoides Array 1 GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (174) RC

Tolypothnx campylonemoides Array 2 GAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (175)

Tolypothnx campylonemoides Array 2 GAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (176) RC

Tolypothnx campylonemoides Array 3 AAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (177) Tolypothnx campylonemoides Array 3 GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (178) RC

Oscillatonales cyanobactenum Array 1 AATTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (179)

Oscillatonales cyanobactenum Array 1 TAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (180) RC

Oscillatonales cyanobactenum Array 2 GATTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (181)

Oscillatonales cyanobactenum Array 2 CCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (182) RC

Rivulana sp. PCC 7116 Array 1 GATTGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (183)

Rivulana sp. PCC 7116 Array 1 RC CCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (184)

Rivulana sp. PCC 7116 Array 2 TAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (185)

Rivulana sp. PCC 7116 Array 2 RC GGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (186)

Eubactenum saburreum DSM 3986 TAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (187) Array 1

Eubactenum saburreum DSM 3986 GGTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (188) Array 1 RC

Eubactenum saburreum DSM 3986 ATAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (189) Array 2

Eubactenum saburreum DSM 3986 GAATGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (190) Array 2 RC

Table 5: First round PCR primers for SENECA acquisition readout

Primer binding sites for DR specific SENECA forward amplification primer sorted by their

respective CRISPR arrays. Related to Fig. 8. During SENECA PCR, the forward primer was chosen corresponding to the respective CRISPR array while FS_09ll serves as a universal reverse primer binding the Illumina Adapter. Details on primer design are described in Primer Design Note 1 and 2. For the CRISPR array directionality screen, staggering was conducted by ordering only two forward primers with different stagger length (NN and NNN) instead of the usual 7 forward primers described for Fusicatenibacter saccharivorans array 2.

Array Sequence (5' 3') (SEQ ID NO)

Bacteroides fragilis strain S14 Array 1 CAGTATAATAAGGATTAAGAC (191) Bacteroides fragilis strain S14 Array 1 RC ACTGGAATACATCTACAT (192)

Campylobacter fetus subsp. Fetus Array 1 ATTAGGGGATTGAAAC (193)

Campylobacter fetus subsp. Fetus Array 1 RC GGAGAAAGTGTCTAAAC (194)

Cellulomonas bogonensis 69B4 Array 1 GAGGGCATTGAAAC (195)

Cellulomonas bogonensis 69B4 Array 1 RC GCCATGGGTGGAAC (196)

Fusicatenibacter saccharivorans Array 1 CCTATGAGGAATTGAAAC (197)

Fusicatenibacter saccharivorans Array 1 RC CATAGGTAAGGTACAAC (198)

Fusicatenibacter saccharivorans Array 2 CCTAAAAGGAATTGAAAC (199) Fusicatenibacter saccharivorans Array 2 RC TTTAGGTAAAGTACGAC (200) Candidatus Accumulibacter sp. SK-02 Array 1 GATAAAGGGATTGAGAC (201) Candidatus Accumulibacter sp. SK-02 Array 1 RC GGGCTTAGTTTTCAC (202) Micromonospora rosana Array 1 GCGGGCATAGAAAC (203) Micromonospora rosana Array 1 RC CTGTGGATGGCGAT (204) Micromonospora rosana Array 2 GCGGGCATAGAAAC (205) Micromonospora rosana Array 2 RC CTGTGGATGGCAAT (206) Micromonospora rosana Array 3 GGTGATGAGCGAC (207) Candidatus Accumulibacter sp. BA-91 Array 1 GAACAGGCTTGAAAC (208) Desulfarculus baarsii DSM 2075 Array 1 GAAGCGGATTGAAAC (209) Desulfarculus baarsii DSM 2075 Array 1 RC GGCATCCCTCAATAG (210) Desulfarculus baarsii DSM 2075 Array 2 GAAGCGGATTGAAAC (211) Desulfarculus baarsii DSM 2075 Array 2 RC GGCATCCCTCAATAG (212) Woodsholea maritima Array 1 CAGAGCTGATCAAAAC (213) Woodsholea maritima Array 1 RC GATTCGAGCAGAGC (214) Woodsholea maritima Array 2 GGAGCGGATTGAAAC (215) Woodsholea maritima Array 2 RC GATGCCGTCGCGAC (216) Azospinllum lipoferum 4B Array 1 GGAGCGGATTGAAAC (217) Azospinllum lipoferum 4B Array 1 RC GACACCGGCGGAAC (218) Azospinllum lipoferum 4B Array 2 GCTAAGGCTGTGAAAC (219) Azospinllum lipoferum 4B Array 2 RC CTAATGTCGATTGCGAC (220) Teredinibacter turnerae T8412 Array 1 AAGTTGAATTAATGGAAAC (221) Teredinibacter turnerae T8412 Array 1 RC TTCCGAAGAAGTTTAAAG (222) Tolypothnx campylonemoides Array 1 AAGTTGAATTAATGGAAAC (223) Tolypothnx campylonemoides Array 1 RC GGGAGAAGTTTAACAG (224) Tolypothnx campylonemoides Array 2 AAGTTGAATTAATGGAAAC (225) Tolypothnx campylonemoides Array 2 RC TTCCGAAGAAGTTTAAAG (226) Tolypothnx campylonemoides Array 3 AGTCAAATTAATGGAAAC (227) Tolypothnx campylonemoides Array 3 RC CAGAGAAGTCGAGAAG (228) Oscillatonales cyanobactenum Array 1 GTCAAATTAATGGAAACA (229) Oscillatonales cyanobactenum Array 1 RC CCTAAGAAGTCGAAAG (230) Oscillatonales cyanobactenum Array 2 CGGATTAGTTGGAAAC (231) Oscillatonales cyanobactenum Array 2 RC CCCAATCGGTGGGG (232) Rivulana sp. PCC 7116 Array 1 CGGATTAGTTGGAAAC (233) Rivulana sp. PCC 7116 Array 1 RC CCCAATCGGTGGGG (234) Rivulana sp. PCC 7116 Array 2 CCTATAAGGAATGGAAAC (235) Rivulana sp. PCC 7116 Array 2 RC TTATAGGTAAGGTACTTAC (236) Eubactenum saburreum DSM 3986 Array 1 CCTATAAGGAATGGAAAC (237) Eubactenum saburreum DSM 3986 Array 1 RC TTATAGGTAAGGTACTTAC (238) Eubactenum saburreum DSM 3986 Array 2 CAGTATAATAAGGATTAAGAC (239) Eubactenum saburreum DSM 3986 Array 2 RC ACTGGAATACATCTACAT (240)

Table 6: Miscellaneous Primers

Primers and oligonucleotides used for cloning purposes.

Primer ID Sequence (5' ® 3') (SEQ ID NO)

FS_0l5l ATGCTTCATGTCACCAGGTAGTCTTCCATCGACTTCAAAACTCGATCCAACATCCT

GAAGACGCGGCCGCTATTCTTTTGATTTATAAGGGATTTTG (241)

FS_0l52 CAACAACATGAATGATCTTCGGTTTCCGTGTTTCG (242)

FS_0l53 CACGGAAACCGAAGATCATTCATGTTGTTGCTCAGGTC (243)

FS_0l54 CGCCGCACTTATGACTATCTTCTTTATCATGCAACTCG (244)

FS_0l55 GATAAAGAAGATAGTCATAAGTGCGGCGACG (245)

FS_0l56 GATACCGAAGATAGCTCATGTTATATCCCGCCG (246)

FS_0l57 GATATAACATGAGCTATCTTCGGTATCGTCGTATCC (247)

FS_0l58 CTCCCATGAAGATGGTACGCGACTGGGC (248)

FS_0l59 GTCGCGTACCATCTTCATGGGAGAAAATAATACTGTTG (249)

FS_0l60 GAAGACTACCTGGTGACATGAAGCATCTCGAGGGTCTTCCTTGCCGGTGGTGCAGA

TGTTGAACAGAAGACCACATATGTATATCTCCTTCTTAAAGTTAAACAAAATTATT TC (250)

FSJ 80 TCGAGATCCGGCTGCTAACAAAGCCCGAAAGGAAGCTGAGTTGGCTGCTGCCACCG

CTGAGCAATAACTAGCATAACCCCTTGGGGCCTCTAAACGGGTCTTGAGGGGTTTT TTGCTGAAAGGAGGAACTATATCCGGATA (251)

FSJ 81 CCTGGTATCCGGATATAGTTCCTCCTTTCAGCAAAAAACCCCTCAAGACCCGTTTA

GAGGCCCCAAGGGGTTATGCTAGTTATTGCTCAGCGGTGGCAGCAGCCAACTCAGC TTCCTTTCGGGCTTTGTTAGCAGCCGGATC (252 )

FS_0658 GCTCAGCATATGGACATCCTGATCAGAAACAAGAAG (253)

FS_0659 GCTCAGCATATGCAGTACTCCAACTGGCACGACTC (254)

FS_0660 GCTCAGCATATGTTCATCAACGGTCGTTACCACATC (255)

FS_0662 CCTACTCGCTTCTGGTGAATGTC (256)

FS_087l CCGGATACCAGGTGAGAATTAAATTG (257)

FS_0904 GTTTAGCGGCCGCGGGACGTTTCAATTCCTCATAGGTAAGGTACAACATCAGCATT

TCCGCTATTTTCAC (258)

FS_09l l GTGACTGGAGTTCAGACG (259)

FS_0963 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC (260)

FS_0964 AAAGGATCGGAAGAGCACACGTCTGAACTCCAGTCAC (261)

FS_0995 GATATACATATGTTCACTATAGACGAGATG (262)

FS_0996 ATATAGCTGCGGCGTATCTGATC (263) FS_0997 AGATACGCCGCAGCTATATACATCTATATGGACAGCTACGAGAAG (264)

FS_0998 GTCGGATGTCTCTAAGATCTGG (265)

FS_l00l GCGAAATTAATACGACTCACTATAGG (266)

FS_l002 TACTCGCTTCTGGTGAATGTC (267)

FS_l003 GAGCTTTAGCCGCTAAGAGCATCATG (268)

FS_l004 CATGATGCTCTTAGCGGCTAAAGCTC (269)

FS_l005 GTTGCTGGCGGCAACAACCCC (270)

FS_l006 GGGGTTGTTGCCGCCAGCAAC (271)

FS_l007 GATGTCAGCAAAAGCCAGGTTAAGG (272)

FS_l008 CCTTAACCTGGCTTTTGCTGACATC (273)

FS_l009 GCTTGAAGATGGCAGCAAAATCC (274)

FS_lOlO GGATTTTGCTGCCATCTTCAAGC (275)

FS_lOll CTATGACTATAGGCGCGAAGATGTCAGC (276)

FS_l0l2 GCTGACATCTTCGCGCCTATAGTCATAG (277)

FS_l054 ACGCATGTCCGGTAAAATGA (278)

FS_l055 CAAGTCATTTTACCGGACAT (279)

FS_l056 GCTCAGGAAGACTTTGCTTAAAATGGTTCAACGCTGACAAAG (280)

FS_l057 GTTTAGAAGACTTGATCTTACAGGCTGGTTACGTTACCAG (281)

FS_l038 ACGCATGAGTCAGAATACGCTGAAAGTT (282)

FS_l039 CAAGAACTTTCAGCGTATTCTGACTCAT (283)

FS_l040 GCTCAGGAAGACTTTGCTAATGAAGATGCGGAATTTGATG (284)

FS_l04l GTTTAGAAGACTTGATCTTACTCGCGGAACAGCGC (285)

FS_l046 ACGCATGCGAAGCTCGGCTAAGCAAGAAGAACTA (286)

FS_l047 CAAGTAGTTCTTCTTGCTTAGCCGAGCTTCGCAT (287)

FS_l048 GTTTAGAAGACTTTGCTTTTAAAGCATTACTTAAAGAAGAGAAATTTAGC (288)

FS_l049 GTTTAGAAGACTTGATCTTAAAGCTCCTGGTCGAACAG (289)

FS_ll23 GCTCAGGAAGACTACCGGTGGCACGTAAGAGGTTCCAAC (290)

FS_ll25 GTTTAGGATCCGATCGCGTCTTCTGATCGTTGGAATCGCCATGGGAAGTCGAATGG

AAGACTACTCTAGTAGTGCTCAGTATCTCTATC (291)

FS_ll34 GCTCAGGAAGACTTAGAGAAGCTTGCGGAGGAGCATGCATGAGCAAAGGAGAAGAA

CTTTTC (292)

FS_ll35 GTTTAGAAGACTTGATCCTATCATTTGTAGAGTTCATCCATGCC (293)

FS_ll36 GCTCAGGAAGACTTAGAGAAGCTTGCGGAGGAGCATGCATGGCTTCCAAGGTGTAC

G (294)

FS_ll37 GTTTAGAAGACTTGATCTCATTACTGCTCGTTCTTCAGCAC (295)

FS_ll54 ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNAGCTCGGCTAAGCAAGAAGA (

296)

FS_ll55 ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNAGCTCGGCTAAGCAAGAAGA

(297) FS_ll56 ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNAGCTCGGCTAAGCAAGAAG A ( 298 )

FS_ll57 ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNAGCTCGGCTAAGCAAGAA

GA (299)

FS_ll58 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCNNGGTCAACATCCGCGAGACTT (

300)

FS_ll59 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCNNNGGTCAACATCCGCGAGACTT

(301)

FS_ll60 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCNNNNGGTCAACATCCGCGAGACT

T (302)

FS_ll6l GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCNNNNNGGTCAACATCCGCGAGAC

TT (303)

FS_l406 GCTGAAAGGAGGAACTATATCCG (304)

FS_l407 CAAAATCCCTTATAAATCAAAAGAATAGC (305)

FS_l584 CGCCGCAAGGAATGGTGCATGCAACTAGTATACAGTGACTCTTGGCGCGCCTTGAC

GGCTAGCTCAGTCCTAGGTACAGTGCTAGCTACTAGAGAAAGAGGAGAAATACTAG ATGAAAAAC (306)

FS_l585 CGATCCTACAGGTGAATTCATGCCTTTAATTATAAACGCAGAAAG (307)

FS_l586 GGCATGAATTCACCTGTAGGATCGTACAGGTTTACGCAAGAAAATGGTTTGTTATA

GTCGAATAAATACTGAGTCTTCACCACGACGATTTCCGGCAGTTTCTCCACAGAAG ACAACGATTAAAGGCATCAAATAAAACGAAAG (308 )

FS_l587 GAAAGTTGGAACCTCTTACGTGCCAGTCGACCCCAGCTGTCTAGGGCG (309)

FS_l588 TCGACCATTCGACTTCCCACGATTCCAACGATCAGG (310)

FS_l589 GATCCCTGATCGTTGGAATCGTGGGAAGTCGAATGG (311)

FS_l6l8 GCTCAGGGTCTCATACTAGAGAAAGAGGAGAAATACTAGATGGAAGATGCCAAAAA

CATAAAG (312)

FS_l6l9 GTTTAGGTCTCAATCGTCATTACACGGCGATCTTTCCG (313)

FS_l620 GCTCAAGAAGACAAAGAGATGGCTTCCAAGGTGTACG (314)

FS_l62l GCTCAGGGTCTCATACTATGGAAGATGCCAAAAACATAAAG (315)

FS_l574 GCTCAGGCCATGCCGGCGGCACGTAAGAGGTTCCAAC (316)

FS_l575 CTCCTTTGCTCATGCATGC (317)

FS_l576 GCTCAGGCATGCATGTTCACTATAGACGAGATGCTATC (318)

FS_l577 AAGTCGGATGTCTCTAAGATCTG (319)

FS_l64l GCGGAGGAGCATGCATGTTTACCATCGACGAGATG (320)

FS_l642 CAGCCGGATCTCGAGTTAG (321)

Table 7: Primers and TaqMan probes used for qRT-PCR

Primer ID Sequence (5' 3') (SEQID NO)

16S rRNA E.coli TaqMan Fw TGGCGCATACAAAGAGAAGC (322)

16S rRNA E.coli TaqMan Rv ACTCCAATCCGGACTACGAC (323) 16S rRNA E.coli TaqMan probe

(5’FAM / 3'Black Hole Quencher 1) ACCTCGCGAGAGCAAGCGGACC (324) sfGFP E.coli TaqMan Fw CGGATCACATGAAACGGCAT (325)

sfGFP E.coli TaqMan Rv CGTCTTGTAGGTCCCGTCAT (326)

sfGFP E.coli TaqMan probe

(5ΉEC / 3'Black Hole Quencher 1) ACCTTCGGGCATGGCACTCTTG (327 )

Rluc E.coli TaqMan Fw AATGGGTAAGTCCGGCAAGA (328)

Rluc E.coli TaqMan Rv CGTGGCCCACAAAGATGATT (329)

Rluc E.coli TaqMan probe

(5ΉEC / 3'Black Hole Quencher 1) ACCTCACCGCTTGGTTCGAGCTGC (330)

Flue E.coli TaqMan Fw GCTCCAACACCCCAACATCTTC (331)

Flue E.coli TaqMan Rv GCTCCAAAACAACAACGGCG (332)

Flue E.coli TaqMan probe

(5ΉEC / 3'Black Hole Quencher 1) CAGGTGTCGCAGGTCTTCCCGACGA (333)

Sequences 1 - RT-Casls, Cas2s and CRISPR arrays

Codon mapped DNA Sequences for the individual RT-Casl, Cas2 orthologs were ordered from Twist Biosciences or Genscript along with their predicted CRISPR arrays for the classical adaptation read-out in Fig. 6 and 7.

Bacteroides fragilis strain S14

Bacteroides fragilis strain S14 RT-Casl (SEQ_ID NO 1)

Bacteroides fragilis strain S14 Cas2 (SEQ_ID NO 2)

Bacteroides fragilis strain S14 Array (SEQ_ID NO 102)

Campylobacter fetus subsp. Fetus

Campylobacter fetus subsp. Fetus RT-Casl (SEQ_ID NO 3)

Campylobacter fetus subsp. Fetus Cas2 (SEQ_ID NO 4)

Campylobacter fetus subsp. Fetus Array (SEQ_ID NO 103)

Cellulomonas bogoriensis 69B4

Cellulomonas bogoriensis 69B4 RT-Casl (SEQ_ID NO 5)

Cellulomonas bogoriensis 69B4 Cas2 (SEQ_ID NO 6)

Cellulomonas bogoriensis 69B4 Array (SEQ_ID NO 35)

Fusicatenibacter sacchanvorans

Fusicatenibacter saccharivorans RT-Casl (SEQ_ID NO 7)

Fusicatenibacter sacchanvorans Gad. (SEQ_ID NO 8)

Fusicatenibacter saccharivorans Array 1 (SEQ_ID NO 36)

Fusicatenibacter saccharivorans Array 2 (SEQ_ID NO 37) Candidatus Accumulibacter sp. SK-02

Candidatus Accumulibacter sp. SK-02 Kΐ-Casl (SEQ_ID NO 9) Candidatus Accumulibacter sp. SK-02 Cas2 (SEQ_ID NO 10) Candidatus Accumulibacter sp. SK-02 Array (SEQ_ID NO 38)

Micromonospora rosaria

Micromonospora rosaria RT-Casl (SEQ_ID NO 11)

Micromonospora rosaria Cas2 (SEQ_ID NO 12)

Micromonospora rosaria Array 1 (SEQ_ID NO 39)

Micromonospora rosaria Array 2 (SEQ_ID NO 40)

Candidatus Accumulibacter sp. BA-91

Candidatus Accumulibacter sp. BA-91 RT-Casl (SEQ_ID NO 13)

Candidatus Accumulibacter sp. BA-91 Cas2 (SEQ_ID NO 14)

Candidatus Accumulibacter sp. BA-91 Array (SEQ_ID NO 41)

Desulfarculus baarsii DSM 2075

Desulfarculus baarsii DSM 2075 RT-Cas 1 (SEQ_ID NO 15)

Desulfarculus baarsii DSM 2075 Cas2 (SEQ_ID NO 16)

Desulfarculus baarsii DSM 2075 Array (SEQ_ID NO 42)

Woodsholea maritima

Woodsholea maritime RT-Casl (SEQ_ID NO 17)

Woodsholea mantima Array (SEQ_ID NO 43)

Azospirillum lipoferum 4B

Azospirillum lipoferum 4B RT-Casl (SEQ_ID NO 19)

Azospirillum lipoferum 4B Cas2 (SEQ_ID NO 20)

Azospirillum lipoferum 4B Array (SEQ_ID NO 44)

Azospirillum lipoferum 4B Array 2 (SEQ_ID NO 45)

Vibrio sinaloensis strain T08

Vibrio sinaloensis strain T08 RT-Casl (SEQ_ID NO 21)

Vibrio sinaloensis strain T08 Cas2 (SEQ_ID NO 22)

Vibrio sinaloensis strain T08 Array (SEQ_ID NO 46)

Teredinibacter turnerae T8412

Teredinibacter turnerae T8412 RT-Casl (SEQ_ID NO 23) Teredinibacter turnerae T8412 Cas2 (SEQ_ID NO 24)

Teredinibacter turnerae T8412 Array (SEQ_ID NO 47)

Tolypothrix campylonemoides Tolypothrix campylonemoides RT-Casl (SEQ_ID NO 25)

Tolypothrix campylonemoides Cas2 (SEQ_ID NO 26)

Tolypothrix campylonemoides Array (SEQ_ID NO 48)

Oscillatoriales cyanobacterium

Oscillatoriales cyanobacterium RT-Casl (SEQ_ID NO 27)

Oscillatoriales cyanobacterium Cas2 (SEQ_ID NO 28)

Oscillatoriales cyanobacterium Array (SEQ_ID NO 49)

Rivularia sp. PCC 7116

Rivularia sp. PCC 7116 Casl (SEQID NO 29)

Rivularia sp. PCC 7116 RT (SEQID NO 33)

Rivularia sp. PCC 7116 Cas2 (SEQID NO 30)

Rivularia sp. PCC 7116 Array 1 (SEQ_ID NO 50)

Rivularia sp. PCC 7116 Array 2 (SEQ_ID NO 51)

Eubacterium saburreum DSM 3986

Eubacterium saburreum DSM 3986 RT-Casl (SEQ_ID NO 31)

Eubacterium saburreum DSM 3986 Cas2 (SEQ_ID NO 32)

Eubacterium saburreum DSM 3986 Array 1 (SEQ_ID NO 52)

Eubacterium saburreum DSM 3986 Array 2 (SEQ_ID NO 53)

Sequences 2 - CRISPR array directionality screen

Sequences of putative arrays for the CRISPR array directionality screen related to Fig. 8b sorted by their respective ortholog. All sequences are depicted with flanking adapter sites for Gibson

Assembly into their respective RT-Casl -Cas2 expression plasmids (RC = reverse complement).

Bacteroides fragilis strain S14

Bacteroides fragilis strain S14 Array 1 (SEQ_ID NO 54)

Bacteroides fragilis strain S14 Array 1 RC (SEQ_ID NO 55)

Campylobacter fetus subsp. Fetus

Campylobacter fetus subsp. Fetus Array 1 (SEQ_ID NO 56)

Campylobacter fetus subsp. Fetus Array 1 RC (SEQ_ID NO 57)

Cellulomonas bogoriensis 69B4

Cellulomonas bogoriensis 69B4 Array 1 (SEQ_ID NO 58)

Cellulomonas bogoriensis 69B4 Array 1 RC (SEQ_ID NO 59)

Fusicatenibacter saccharivorans

Fusicatenibacter saccharivorans Array 1 (SEQ_ID NO 60)

Fusicatenibacter saccharivorans Array 1 RC (SEQ_ID NO 61) Fusicatenibacter saccharivorans Array 2 (SEQID NO 62)

Fusicatenibacter saccharivorans Array 2 RC (SEQID NO 63) Candidatus Accumulibacter sp. SK-02

Candidatus Accumulibacter sp. SK-02 Array 1 (SEQID NO 64) Candidatus Accumulibacter sp. SK-02 Array 1 RC (SEQID NO 65) Micromonospora rosaria

Micromonospora rosaria Array 1A (SEQID NO 66)

Micromonospora rosaria Array 1 RC (SEQID NO 67)

Micromonospora rosaria Array 2A (SEQID NO 68)

Micromonospora rosaria Array 2 RC (SEQID NO 69)

Micromonospora rosaria Array 3A (SEQID NO 70)

Candidatus Accumulibacter sp. BA-91

Candidatus Accumulibacter sp. BA-91 Array 1 (SEQID NO 71)

Desulfarculus baarsii DSM 2075

Desulfarculus baarsii DSM 2075 Array 1 (SEQID NO 72)

Desulfarculus baarsii DSM 2075 Array 1 RC (SEQID NO 73)

Desulfarculus baarsii DSM 2075 Array 2 (SEQID NO 74)

Desulfarculus baarsii DSM 2075 Array 2 RC (SEQ ID NO 75)

Woodsholea maritima

Woodsholea maritime Array 1 (SEQ ID NO 76)

Woodsholea maritima Array 1 RC (SEQ ID NO 77)

Azospirillum lipoferum 4B

Azospirillum lipoferum 4B Array 1 (SEQ ID NO 78)

Azospirillum lipoferum 4B Array 1 RC (SEQ ID NO 79)

Azospirillum lipoferum 4B Array 2A (SEQ ID NO 80)

Azospirillum lipoferum 4B Array 2 RC (SEQID NO 81)

Teredinibacter turnerae T8412

Teredinibacter turnerae T8412 Array 1 (SEQID NO 82)

Teredinibacter turnerae T8412 Array 1 RC (SEQID NO 83)

Tolypothrix campylonemoides

Tolypothrix campylonemoides Array 1 (SEQID NO 84)

Tolypothrix campylonemoides Array 1 RC (SEQID NO 85) Tolypothrix campylonemoides Array 2 (SEQID NO 86)

Tolypothrix campylonemoides Array 2 RC (SEQID NO 87) Tolypothrix campylonemoides Array 3 (SEQ_ID NO 88)

Tolypothrix campylonemoides Array 3 RC (SEQ_ID NO 89)

Oscillatoriales cyanobacterium

Oscillatoriales cyanobacterium Array 1 (SEQ_ID NO 90)

Oscillatoriales cyanobacterium Array 1 RC (SEQ_ID NO 91) Oscillatoriales cyanobacterium Array 2 (SEQ_ID NO 92)

Oscillatoriales cyanobacterium Array 2 RC (SEQ_ID NO 93)

Rivularia sp. PCC 7116

Rivularia sp. PCC 7116 Array 1 (SEQ_ID NO 94)

Rivularia sp. PCC 7116 Array 1 RC (SEQID NO 95)

Rivularia sp. PCC 7116 Array 2 (SEQ_ID NO 96)

Rivularia sp. PCC 7116 Array 2 RC (SEQID NO 97)

Eubacterium saburreum DSM 3986

Eubacterium saburreum DSM 3986 Array 1 (SEQ_ID NO 98) Eubacterium saburreum DSM 3986 Array 1 RC (SEQ_ID NO 99) Eubacterium saburreum DSM 3986 Array 2 (SEQ_ID NO 100) Eubacterium saburreum DSM 3986 Array 2 RC (SEQ_ID NO 101) Sequences 3 - Miscellaneous sequences

gBlock FS_gBlock_td_intron_acceptor (SEQ_ID NO 104)

Human codon-optimized FsRT-Casl -T7RBS-Cas2 (SEQ_ID NO 34) pFS_ 0453 plasmid (SEQ ID NO 334)