Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
RECONSTRUCTION OF ANCESTRAL CELLS BY ENZYMATIC RECORDING
Document Type and Number:
WIPO Patent Application WO/2016/040594
Kind Code:
A1
Abstract:
Provided herein are compositions aid methods for barcoding mammalian cells. The compositions and methods provided herein further provide methods for tracing such barcoded cells ex vivo or in vivo during the life time of an organism. In one aspect, a method of forming a barcoded cell is provided. The method includes expressing in a cell a heterologous cleaving protein complex including a sequence-specific DNA-binding domain and a nucleic acid cleaving domain. The sequence-specific DNA-binding domain targets the nucleic acid cleaving domain to a genomic nucleic acid sequence, thereby forming a genomic nucleic acid sequence bound to the heterologous cleaving protein complex.

Inventors:
MCMANUS MICHAEL T (US)
Application Number:
PCT/US2015/049375
Publication Date:
March 17, 2016
Filing Date:
September 10, 2015
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
C12N15/09
Domestic Patent References:
WO2013043638A12013-03-28
WO2014099750A22014-06-26
WO2012013717A12012-02-02
WO2010045526A12010-04-22
Foreign References:
US20140206546A12014-07-24
US20090100535A12009-04-16
US7122346B22006-10-17
US20030008290A12003-01-09
Attorney, Agent or Firm:
HINSCH, Matthew E. et al. (Two Embarcadero CenterEighth Floo, San Francisco California, US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method of forming a barcoded cell said method comprising,

(i) expressmg in a cell a heterologous cleaving protein complex comprising a sequence-specific DNA-binding domain and a nucleic acid cleaving domain; wherein said sequence-specific DNA-binding domain targets said nucleic acid cleaving domain to a genomic nucleic acid sequence, thereby forming a genomic nucleic acid sequence bound to said heterologous cleaving protein complex;

(ii) introducing a double-stranded cleavage site in said genomic nucleic acid sequence bound to said heterologous cleaving protein complex, thereby forming a double- stranded cleavage site in said genomic nucleic acid sequence; and

(iii) inserting random nucleotides at said double-stranded cleavage site, thereby forming said barcoded cell,

2. The method of claim 1, further comprising after said inserting step in

(iii):

(iv) allowing said barcoded cell to divide, thereby forming a barcoded progeny of cells;

(v) collecting said barcoded progeny;

(vi) nucleotide sequencing said barcoded nucleic acid sequence; and

(vii) correlating said barcoded nucleic acid sequence.

3. The method of claims I or 2, further comprising after said inserting step in (iii) and before said allowing step in (iv), (iii.i) ligating the ends of said double- stranded cleavage site.

4. The method of any one of the preceding claims, wherein said sequence-specific DNA-binding domain comprises an RNA molecule.

5. The method of claim 4, wherein said RNA molecule is a guide RNA.

6. The method of claim 4, wherein said RNA molecule comprises a nucleic acid cleaving domain recognition site.

7. The method of any one of claims 1 to 6, wherein said nucleic acid cleaving domain comprises a Cas9 domain or functional portion thereof.

8. The method any one of claims 1 to 7, wherein said genomic nuclei sequence comprises a guide RNA encoding sequence.

9 The method of claim 1 or 2, wherein said sequence-specific DNA binding domain is a TAL effector DNA. binding domain or functional portion thereof.

10. The method of claim 1 or 2, wherein said sequence-specific DNA- bmdmg domain is a zinc finger domain or functional portion thereof.

1 1 . The method of claims 9 or 10, wherein said nucleic acid cleaving domain comprises a restriction enzyme or functional portion thereof.

12. The method of claim 11 , wherein said restriction enzyme is Mmel or

Fold.

13. The method of any one of the preceding claims, wherein said inserting comprises targeting a recombinant DNA editing protein to said double-stranded cleavage site.

14. The method of any one of claims 1-12, wherein said inserting comprises targeting an endogenous DNA editing protem to said double-stranded cleavage site.

15. The method of claim 13, wherein said recombinant DNA editing protein is a heterologous DNA editing protein.

16. The method of claim 15, wherein said recombinant DNA editing protein comprises a sequence-specific D A-binding domain and a terminal deoxynueleotidyi transferase (TdT) domain,

17. The method of claim 16, wherein said sequence-specific DN A-binding domain is a TAL effector DN A binding domain or functional portion thereof.

18. The method of claim 16, wherein said sequence-specific DNA-binding domain is a zinc finger domain or functional portion thereof.

19. A recombinant cleaving ribonucleoprotein complex comprising, (i) a sequence-specific DNA-binding RNA. molecule; and (ii) a nucleic acid cleaving domain; wherein said RNA molecule comprises a nucleic acid cleavmg domain recognition site.

20. The recombinant cleavmg nbonucleoprotem complex of claim 19, wherein said RNA molecule is a guide RNA.

21 . The recombinant cleaving ribonueieoprotem complex of claim 19, wherem said RNA molecule comprises a nucleic acid cleaving domain recognition site.

22. The recombinant cleaving ribonucleoprotein complex of any one of claims 19 to 21, wherein said nucleic acid cleaving domain comprises a Cas9 domain or functional portion thereof.

23. The recombinant cleaving ribonucleoprotein complex of any one of claims 19 to 22, further comprising a recombinant DNA editing protein.

24. The recombinant cleavmg ribonucleoprotein complex of claim 23, wherein said recombinant DNA editing protein comprises a terminal deoxynucleotidyl transferase domain .

25. The recombinant cleaving ribonucleoprotein compl ex of claim 23, wherein said recombmant DNA editing protein comprises a sequence-specific DNA-binding domain.

26. A nucleic acid encoding a recombmant cleaving ribonucleoprotein complex of any one of claims 19-25.

A cell comprising the nucleic acid of claim 26.

28. The cell of claim 27, further comprising a promoter operably linked to the nucleic acid.

29 A non-human animal comprising the cell of claims 27 or 28.

30 A method of forming a barcoded cell said method comprising:

(i) expressing in a cell a recombinant cleaving ribonucleoprotein complex any one of claims 19-25; wherein said sequence-specific DNA-binding RNA molecule targets said nucleic acid cleaving domain to a genomic nucleic acid sequence, thereby forming a genomic nucleic acid sequence bound to said recombinant cleaving ribonucleoprotem complex;

(ii) introducing a double-stranded cleavage site in said genomic nucleic acid sequence bound to said recombinant cleaving ribonucleoprotein complex, thereby forming a double-stranded cleavage site in said genomic nucleic acid sequence; and

(iii) targeting said recombinant DNA editing protein to said double-stranded cleavage site such as said recombinant D'NA editing protein inserts a barcoded nucleic acid sequence into said double-stranded cleavage site; thereby forming said barcoded cell.

31. The method of claim 30, further comprising after said targeting step in

(iii):

(iv) allowing said barcoded ceil to divide, thereby forming a barcoded progeny of cells;

(v) collecting said barcoded progeny;

(vi) nucleotide sequencing said barcoded nucleic acid sequence; and

(vii) correlating said barcoded nucleic acid sequence.

32. The method of claims 30 or 31, further comprising after said inserting step in (iii) and before said allowing step in (iv), (iii.i) ligating the ends of said double- stranded cleavage site.

33. A recombinant DNA editing protein comprising;

(i) a sequence-specific DNA-binding domain; and

(ii) a terminal deoxynucleotidyl transferase domain.

34. The recombinant DNA editing protein of claim 33, wherein said sequence-specific DNA-binding domain comprises an RNA molecule.

35. The recombinant DNA editing protein of claim 34, wherein said RNA molecule is a guide RNA.

36. The recombinant DNA editing protein of claim 34, wherein said RNA molecule comprises a nucleic acid cleaving domain recognition site.

37. The recombinant DNA editing protein of claim 33, wherein said sequence-specific DNA-binding domain is a TAL effector DNA binding domain or functional portion thereof.

38. The recombinant DN A editing protein of claim 37, wherein said sequence-specific DNA-binding domain is a zinc finger domain or functional portion thereof.

39. The recombinant DNA editing protein of any one of claims 33 to 38, further comprising a nucleic acid cleaving domain.

40. The recombinant DNA editing protein of claim 39, wherein said nucleic acid cleaving domain is a restriction enzyme.

41. The recombinant DNA editing protein of claim 40, wherein sa d restriction enzyme is Mmel or FokL

42. A nucleic acid encoding a recombinant cleaving protein of any one of claims 43-41.

43. A recombinant cleaving protein comprising:

(i) a cell cycle regulated domain;

(ii) a sequence-specific DNA-binding domain; and

(iii) a DNA cleaving domain;

wherein said ceil cycle regulated domain is operably linked to one end of said sequence-specific DNA-binding domain and said DNA cleaving domain is linked to the other end of said sequence-specific DNA-binding domain.

44. The recombinant cleaving protein of claim 1, wherein ail of said domains are heterologous to each other.

45. The recombinant cleaving protein of claim 1, wherein said cell cycle regulated domain is a peptide domain.

46. The recombinant cleaving protein of claim 45, wherein said peptide domain is a Geminin peptide.

47. The recombinant cleaving protein of claim 1, wherein said sequence- specific DNA-binding domain is TAL effector DNA binding domain.

48. The recombinant cleaving protein of claim 1, wherein said DNA cleaving domain comprises a cleaving agent dimer.

49. The recombinant cleaving protein of claim 48, wherein said cleaving agent dimer comprises a first cleaving agent and a second cleaving agent.

50. The recombinant cleaving protein of claim 49, wherein said first cleaving agent and said second cleaving agent are linked through a linker.

51. The recombinant cleaving protein of claim 50, wherein said first cleaving agent and said second cleaving agent are a Fokl nuclease.

52. The recombinant cleaving protein of claim 50, wherein said first cleaving agent and said second cleaving agent are a Mmel nuclease.

53. A nucleic acid encoding a recombinant cleaving protein of any one of claims 43-52.

54. A recombinant DNA editing protein comprising:

(i) a cell cycle regulated domain;

(ii) a sequence-specific DNA-binding domain; and

(iii) a terminal deoxynucleotidyl transferase domain;

wherein said ceil cycle regulated domain is operably linked to one end of said sequence-specific DNA-binding domain and said terminal deoxynucleotidyl transferase domain is linked to the other end of said sequence-specific DN A-binding domain.

55. A nucleic acid encoding a recombinant DNA editing protein of claim

54.

56. A. cell comprising a recombinant cleaving protein of any one of claims 43-52, a recombinant DN A editing protein of claim 54 or both.

57. The cell of claim 56, wherein said cell is a zygote.

58. The cell of claim 56, wherein said cell forms part of an organism,

59. A method of forming a barcoded cell said method comprising;

(i) expressing in a cell a recombinant cleaving protein and a recombinant DNA editing protein in a cell cycle-dependent manner;

(ii) targeting said recombinant cleaving protein to a genomic nucleic acid sequence, thereby introducing a double-stranded cleavage site in said genomic nucleic acid sequence;

(iii) targeting said recombinant DNA editing protein to said double-stranded cleavage site such as said recombinant DNA editing protein inserts a barcoded nucleic acid sequence into said double-stranded cleavage site; thereby forming said barcoded cell.

60. A method of forming a barcoded cell said method comprising:

(i) expressing in a cell a recombinant cleaving protein of any one of claims 43- 52 and a recombinant DNA editing protein of claim 54 in a cell cycle-dependent manner;

(ii) targeting said recombinant cleaving protein to a genomic nucleic acid sequence, thereby introducing a double-stranded cleavage site in said genomic nucleic acid sequence;

(iii) targeting said recombinant DNA editing protein to said double-stranded cleavage site such as said recombinant DNA editing protein inserts a barcoded nucleic acid sequence into said double-stranded cleavage site; thereby forming said barcoded cell.

61 . The method of claim 59 or 60, further comprising after said targeting step in (iii):

(iv) allowing said barcoded cell to divide, thereby forming a barcoded progeny of cells;

(v) collecting said barcoded progeny;

(vi) nucleotide sequencing said barcoded nucleic acid sequence; and

(vii) correlating said barcoded nucleic acid sequence.

62. The method of claim 59 or 60, wherein said expressing in a cell cycle dependent manner comprises expressing in S, Gl , or M phase.

63. The method of claim 59 or 60, further comprising after said inserting step in (iii), ligating the ends of said double-stranded cleavage site.

Description:
RECONSTRUCTION OF ANCESTRAL CELLS BY ENZYMATIC

RECORDING

CROSS-REFERENCE TO RELATED PATENT APPLICATIONS [Θ001] The present application claims benefit of priority to US Provisional Patent

Application No. 62/048,695, filed September 10, 2014, which is incorporated by referenced for ail purposes.

BACKGROUND OF THE INVENTION

[0002] One of the most fascinating aspects of multicellular life is the ability for cells to change their identity. Developmental biologists have spent decades trying to understand this process in plants, fungi, and worms. As early as 1929, Walter Vogt used "vital dyes" to label individual cells in Xenopus frog embryos. The tissue(s) to which the cells contribute would thus be labeled and visible in the adult organism. With this method, Vogt was able to discern migrations of particular cells to their ultimate tissue into which they integrated. The information V ogt gathered from his Xenopus tracing experiments was then used to develop early qualitative fate maps for a 32 cell blastula. in 1983, using microscopy, Sulston and colleagues reconstructed an entire C. elegans fate map, in which the lineage of its invariable 959 somatic cel ls was visibly charted. This was a tremendous milestone for the

developmental biology field and the Nobel Prize was awarded in 2002 for this achievement. Yet worms are transparent, and extending this brute force fate mapping method to most other species is not possible.

[0003] In 2007 Jeff Lichtman and Joshua Sanes developed 'Brainbow' technology, based on transgenic animals harboring Cre recombinase and a multicolor cassette (FIG, 3). While earlier labeling techniques allowed for the mapping of only a handful of cells, Brainbow allows the generation of transgenic reporter mice where more than 100 differently mapped neurons can be simultaneously and differentially illuminated. However the use of Brainbow in the mouse is hampered by the incredible diversity of neurons of the CNS. The sheer cellular density combined with the presence of long tracts of axons make viewing larger regions of the CNS with high resolution difficult . Although this cutting-edge technology is fantastic for microscopically visualizing subsets of related cells, it comes up short for simultaneously and definitively mapping large populations of cells in complex tissues.

[00041 Some of the main limitations of all lineage tracing approaches is that of granularity and depth. Granularity is a major limitation when one considers that cell development does not proceed along a linear path, but instead branches out, splaying to many cell types. DNA barcodes have been used to mark lineages, but don't maintain a granular code between different cell types. For example, marking a single hematopoietic stem cell with a single DNA bar code. Every hematopoietic cell in the entire lineage will contain that very same mark. Such an approach may be useful for comparing the competition for hematopoietic reconstitution but it gives no granularity to the individual cells, much less the major and minor branched lineages. Currently there are no approaches for applying unique marks to individual cel ls in a way that would trace their individual fates. The methods and

compositions provided herein solve this and other problems in the art.

BRIEF SUMMARY OF THE INVENTION

[0005 J In one aspect, a method of forming a barcoded cell is pro vided. The method includes in step (i) expressing in a cell a heterologous cleaving protein complex including a sequence-specific DNA-binding domain and a nucleic acid cleaving domain. The sequence- specific DNA-binding domain targets the nucleic acid cleaving domain to a genomic nucleic acid sequence, thereby forming a genomic nucleic acid sequence bound to the heterologous cleaving protein complex. In step (ii) a double-stranded cleavage site is introduced in the genomic nucleic acid sequence bound to the heterologous cleaving protein complex, thereby- forming a double-stranded cleavage site in the genomic nucleic acid sequence, in step (iii) random nucleotides are inserted at the double-stranded cleavage site, thereby forming the barcoded cell,

[0006 J In another aspect, a recombinant cleaving ribonucleoprotein complex including (i) a sequence-specific DNA-binding RNA molecule and (ii) a nucleic acid cleaving domain is provided, wherein the RNA molecule includes a nucleic acid cleaving domain recognition site.

[0007] In another aspect, a method of forming a barcoded cell said method is provided. The method includes in step (i) expressing in a cell a recombinant cleaving ribonucleoprotein complex as provided herein including embodiments thereof. The sequence-specific DNA- binding RNA molecule targets the nucleic acid cleaving domain to a genomic nucleic acid sequence, thereby forming a genomic nucleic acid sequence bound to the recombinant cleaving ribonucleoprotein complex. In step (ii) a double-stranded cleavage site is introduced in the genomic nucleic acid sequence bound to the recombinant cleaving ribonucleoprotein complex, thereby forming a double-stranded cleavage site in the genomic nucleic acid sequence. In step (iii) the recombinant DNA editing protein is targeted to the double- stranded cleavage site such as the DNA editing protein inserts a barcoded nucleic acid sequence into the double-stranded cleavage site; thereby forming the barcoded cell.

[0008] In another aspect, a recombinant DNA editing protein is provided. The recombinant DNA editing protein includes (i) a sequence-specific DNA-binding domain and (iii) a terminal deoxynucleotidyl transferase domain.

[0009] In another aspect, a recombinant cleaving protein is provided. The recombinant cleaving protein includes (i) a cell cycle regulated domain, (ii) a sequence-specific DNA- binding domain and (iii) a DNA cleaving domain, wherein the cell cycle regulated domain is operably linked to one end of the sequence-specific DNA-binding domain and the DNA cleaving domain is linked to the other end of the sequence-specific DNA-binding domain,

[0010] In another aspect, a recombinant DNA editing protein is provided. The recombinant DNA editing protein includes (i) a cell cycle regulated domain, (ii) a sequence-specific DNA- binding domain and (iii) a terminal deoxynucleotidyl transferase domain, wherein the cell cycle regulated domain is operably linked to one end of the sequence-specific DNA-binding domain and the terminal deoxynucleotidyl transferase domain is linked to the other end of the sequence-specific DNA-binding domain.

[0011] In another aspect, a method of forming a barcoded cell is provided. The method includes (i) expressing in a cell a recombinant cleaving protein and a recombinant DNA editing protein in a cell cycle-dependent manner. In step (ii) the recombinant cleaving protein is targeted to a genomic nucleic acid sequence, thereby introducing a double-stranded cleavage site in the genomic nucleic acid sequence. In step (iii) the recombinant DNA editing protein is targeted to the double-stranded cleavage site such as the recombinant DNA editing protein inserts a barcoded nucleic acid sequence into the double-stranded cleavage site; thereby forming the barcoded cell.

[0012] In another aspect, a method of forming a barcoded cell is provided. The method includes in step (i) expressing in a cell a recombinant cleaving protein as provided herein including embodiments thereof and a recombinant DNA editing protein as provided herein including embodiments thereof in a cell cycle-dependent manner. In step (ii) the recombinant cleaving protein is targeted to a genomic nucleic acid sequence, thereby introducing a double- stranded cleavage site in the genomic nucleic acid sequence, in step (iii) the recombinant DNA editing protein is targeted to the double-stranded cleavage site such as the recombinant DNA editing protein inserts a barcoded nucleic acid sequence into the double-stranded cleavage site; thereby forming the barcoded cell.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] FIG. 1. The Cas9 gRNA complex. This image depicts the Cas9: gRNA complex targeting a stretch of DNA. Pairing of 5' -gRNA sequence with cognate DNA (green) triggers Cas9 to induce double- stranded cleavage of the DNA. Cleavage occurs proximal to the PAM motif, in this case NGG (orange). Converting the gRNA stem base to two GiC pairs should result in a self-targeting gRNA which (if active) will destroy itself. Normal!)' this is an unwanted activity, but it will allow Applicants to identify the active gRNAs by deep sequencing the gRNA sequence.

[0014J FIG. 2. Barcoding Schematics. A, Two plasmids were designed with the aim to introduce barcodes into cells. The first vector (left hand vector) contains puromycin, mcherry and Cas9 separated by T2A elements. The second vector (right hand vector) contains a self- editing guide RNA driven by a U6 vector, and a separate promoter driving hygromycin T2A CD4 cassette. Cells expressing both plasmids will, result in a charged Cas9 guide RNA complex. Pairing of the 5' -gRNA sequence with cognate DNA (green) triggers Cas9 to introduce a double stranded break 3 nucleotides upstream of the PAM sequence in orange (NGG). The schematic displays the new PAM motif introduced into the guide RNA, which will be cut by Cas9 and barcodes will be introduced at this site.

[0015] FIG. 3. (A) Braihbow-mouse. Different colors are generated upon random recombination of three spectrally distinct fluorescent proteins, (mages show combinatorial expression in the brain (Livet et aL, 2007). (B) Confetti-Mouse. A Brainbow construct modified such that Cre deletion removes a stop cassette, resulting in four possible

recombination outcomes (image shows small intestine; Snippert et aL, 2010b). Although fluorescent is the primar readout, the random recombination provides a short theoretical barcode. (C) illustration that depicts how mixing fluorescent markers may result in a limited number of microscopically discernible cells. [0016] FIG. 4. The tRACER concept. This overview schematic is described in the text. Note that the DNA binding domains of the TALEN:TYPER pair may be immediately side- by-side (proximal) or overlapping (competitive) as shown here. Also, the growing barcode extends away from the TALEN: TYPER pair. The cartoon displays barcode 3mer barcodes, but Applicants will optimize for longer lG-20mer barcodes.

[0017] FIG. 5. Single-chain Fokl can efficiently cleave DNA. (left) Schematic

representation of AZP-scFokl. (right) in vitro activity of a AZP-scFokl variant containing a flexible (GGGGS) * linker; lane 1 : Ctrl DNA substrate, lane 2: incubation with AZP-scFokl. Site-specific cleavage by AZP-scFokl produces 0.9- and 2-kbp DNA fragments (indicated as PI and P2, respectively). S: a plasmid substrate. FIG. adapted after Mino et al " \

[0018] FIG. 6. Modified TALEN and TYPER enzymes. This figure depicts schematics for some of the constructs Applicants have created and are now testing, CC, cell cycle peptide; TAL, TAL effector DNA binding domain; arm, extension peptide; RE, restriction enzyme; SCL, single-chain linker; TdT, terminal deoxymicleotidyl transferase.

[ΘΘ19] FIG. 7. Examples of TdT activity in cultured cells. These preliminary data are derived from transient transfection of cells with a Cas9 targeting nuclease- without (control, Ctrl) and with a wild- type TdT cDNA vector (TdT), Image shows a PCR product smear that appears only in TdT transfected cells. The PCR products were cloned, and sequenced (alignment, see right). Green nucleotides are non-templated additions. The control reactions have deletions but no additions.

[0020] FIG. 8. C aracter zat on of a Fluorescent Indicator for Cell- Cycle Progression ( A) A fluorescent probe that labels individual G \ phase nuclei in red and S/G->/M phase nuclei green, (F) Typical fluorescence images of HeLa cells expressing mK02-hCdt1 (30/120) and mAG-hGem (1/1 10) and immunofluorescence for incorporated BrdlJ at Gi, Gj/S, S, G2, and M phases. The scale bar represents 10 um. Figure and legend adapted from Mivawaki et al 1 .

[0021] FIG. 9. The tRACER concept is based on natural ly occurring phenomenon. VDJ recombination (left) and RNA editing (right) both use cascades of cleavage, terminal transferase activities, and ligation.

[0022] FIG. 10. tRACER path. This grossly simplified tracing of the lineage path of a single cell depicts nascent barcodes across the initial eight generations [0023] FIG. 1 1. New technologies offer tRACER a chance to profile specific cell types in biological settings. LEFT: In situ deep sequencing. Image adapted from Ke et at. RIGHT: Merged brightfield and fluorescence image of microfluidic "cell drops", showing successful detection of PTPRC via TaqMan probe (red) detection of j I (green), but not PC3 ceils (blue). These are cutting-edge methods that wil l be married to tRACER, providing spatial resolution and cell-identity to complex single-cell phyiogenetic mapping experiments

[0024 j FIG. 12: Schematic representation of embodiments of recombinant DNA editing proteins. Outlined are ail constmcts that will be generated including combindations of DNA editing enzymes coupled to fluorescent markers, DNA polymerases and ligases.

[0025] FIG. 13 : Schematic representation of a method of forming a barcoded cell.

[0026] FIG. 14: Evidence of Barcoding in vitro. A, HEK 293 cells were stably transduced with lentiviral construct expressing the self-editing guide RNA. Cells were selected for 1 week with hygromycin (100g'' ' ml). Ceils were transduced wit a lentiviral construct expressing Td ' T and selected with Zeomycin for 1 week (100g''ml). Finally cells were transduced with a lentiviral construct expressing Cas9 followed by selection for 1 week with blasticidin (l Og/ml), B, Following 2 weeks of blasticidm selection of the HE 293/Cas9/self- editing guide/TdT ceils genomic DNA was extracted and PCR was carried out to amplify the region of interest (left panel). The 250bp band was gel extracted and TOPO cloned. Colonies were sequenced and barcodes were identified (right panel).

[0027] FIG. 15 : Evidence of Barcoding in vitro. A, HEK 293 cells were stably transduced with lentiviral construct expressing the self-editing guide RNA. Cells were selected for 1 week with hygromycin (lOOg/ml). Ceils were transiently transfected with a construct expressing Cas9 fused to GFP and linked with TdT. B, 9 days following transfection, HE 293/self-editing guide cells were sorted upon level of gfp expression. Genomic DNA was extracted from gfp positive cells and PCR was carried out to amplify the region of interest (left panel). The 250bp band was gel extracted and TOPO cloned. Colonies were sequenced and barcodes were identified (right panel).

[0028] FIG. 16A displays dsDNA break at a conventional DNA locus. FIG. 16B displays a self-editing gRNA (segRNA) locus.

[0029] FIG. 17 displays exemplary sequencing results of barcode insertions from terminal transferase. [0030] FIG. 18 depicts constructs introduced into 293T ceils.

DEFINITIONS

[0031] Unless defined otherwise, all technical and scientific terms used herein generally have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Generally, the nomenclature used herein and the laboratory procedures in cell culture, molecular genetics, organic chemistry, and nucleic acid chemistry and hybridization described below are those well known and commonly employed in the art. Standard techniques are used for nucleic acid and peptide synthesis. The techniques and procedures are generally performed according to conventional methods in the art and various general references (see generally, Sambrook et al. MOLECULAR CLONING: A

LABORATORY MANUAL, 2d ed. (1989) Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., which is incorporated herein by reference), which are provided throughout this document. The nomenclature used herein and the laboratory procedures in analytical chemistry, and organic synthetic described below are those well known and commonly employed in the art.

[0032] "Nucleic acid" refers to deoxyribonucieotides or ribonucleotides and polymers thereof in either single- or double-strand ed form, and complements thereof. The term encompasses nucleic acids containing known nucleotide analogs or modified backbone residues or linkages, which are synthetic, naturally occurring, and non-naturally occurring, which have similar binding properties as the reference nucleic acid, and which are metabolized in a manner similar to the reference nucleotides. Examples of such analogs include, without limitation, phosphorothioates, phosphoramidates, methyl phosphonates, chiral-methyl phosphonates, 2-O-methyl ribonucleotides, peptide-nucleic acids (PNAs).

[0033] Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucleic Acid Res. 19:5081 (1991); Ohtsuka et al., J. Biol. Chem, 260:2605-2608 (1985); Rossolini ei al., Mol. Cell. Probes 8:91-98 (1994)). The term nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and polynucleotide. [0034] The terms "identical " or percent "identity," in the context of two or more nucleic acids or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same (i.e., about 60% identity, preferably 65%, 70%, 75%, 80%, 85%, 90%, 91 %, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher identity over a specified region, when compared and aligned for maximum correspondence over a comparison window or designated region) as measured using a BLAST or BLAST 2.0 sequence comparison algorithms with default parameters described below, or by manual alignment and visual inspection (see, e.g., NCBI web site or the like). Such sequences are then said to be "substantial ly identical ." This definition also refers to, or may be applied to, the complement of a test sequence. The definition also includes sequences that have deletions and/or additions, as well as those that have substitutions. As described below, the preferred algorithms can account for gaps and the like. Preferably, identity exists over a region that is at least about 25 amino acids or nucleotides in length, or more preferably over a region that is 50-100 amino acids or nucleotides in length.

[0035] For sequence comparison, typically one sequence acts as a reference sequence, to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are entered into a computer, subsequence coordinates are designated, if necessary, and sequence algorithm program parameters are designated. Preferably, default program parameters can be used, or alternative parameters can be designated. The sequence comparison algorithm then calculates the percent sequence identities for the test sequences relative to the reference sequence, based on the program parameters.

[ΘΘ36] A "comparison window", as used herein, includes reference to a segment of any one of the number of contiguous positions selected from the group consisting of from 20 to 600, usually about 50 to about 200, more usually about 100 to about 150 in which a sequence may be compared to a reference sequence of the same number of contiguous positions after the two sequences are optimally aligned. Methods of alignment of sequences for comparison are well-known in the art. Optimal alignment of sequences for comparison can be conducted, e.g., by the local homology algorithm of Smith & Waterman, Adv. Appl. Math. 2:482 (1981), by the homology alignment algorithm of Needleman & Wunseh, J. Mol. Biol. 48:443 (1970), by the search for similarity method of Pearson & Lipman, Proc. Nat ' I. Acad. Sci. USA 85 :2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, WI), or by manual alignment and visual inspection (see, e.g., Current Protocols in Molecular Biology (Ausubel et al., eds. 1995 supplement)).

[0037] A. preferred example of al gorithm that is suitable for determining percent sequence identity and sequence similarity are the BLAST and BLAST 2.0 algorithms, which are described in Altschul et al, Nuc. Acids Res. 25:3389-3402 (1977) and Altschul et al, J. Mol Biol. 215:403-410 (1990), respectively. BLAST and BLAST 2.0 are used, with the parameters described herein, to determine percent sequence identity for the nucleic acids and proteins. Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information, as known in the art. This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold (Altschul et al, supra). These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always > 0) and N (penalty score for mismatching residues; always < 0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative- scoring residue alignments; or the end of either sequence is reached. The BLAST algorithm parameters W, T, and X determme the sensitivity and speed of the alignment. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11 , an expectation (E) of 10, :::: 5, N :::: -4 and a compari son of both stran ds. For amino acid sequ ences, the BLASTP program uses as defaults a wordlength of 3, and expectation (E) of 10, and the BLOSUM62 scoring matrix (see Hemkoff & Henikoff, Proc. Natl. Acad. Sci. USA 89: 10915 (1989) ) alignmen ts (B) of 50, expectation (E) of 10, M=5, N=-4, and a comparison of both strands.

[0038] The terms "polypeptide," "peptide" and "protein" are used interchangeably herein to refer to a polymer of amino acid residues. The terms apply to amino acid polymers in which one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally occurring amino acid, as well as to naturally occurring amino acid polymers and non-naturally occurring amino acid polymer.

[0039] The term "amino acid" refers to naturally occurnng and synthetic amino acids, as well as amino acid analogs and amino acid mimetics that function in a manner similar to the naturally occurring amino acids. Naturally occurring amino acids are those encoded by the genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, y- carboxyglutamate, and O-phosphoserine. Amino acid analogs refers to compounds that have the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is bound to a hydrogen, a earboxyl group, an amino group, and an group, e.g., homoserine, norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified R groups (e.g., norleucine) or modified peptide backbones, but retain the same basic chemical structure as a naturally occurring amino acid. Amino acid mimetics refers to chemical compounds that have a stracture that is different from the general chemical structure of an amino acid, but that functions in a manner similar to a naturally occurring amino acid.

[0040J Amino acids may be referred to herein by either their commonly known three letter symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical Nomenclature Commission. Nucleotides, likewise, may be referred to by their commonly accepted single-letter codes.

[0041] "Conservatively modified variants" applies to both amino acid and nucleic acid sequences. With respect to particular nucleic acid sequences, conservatively modified variants refers to those nucleic acids which encode identical or essentially identical amino acid sequences, or where the nucleic acid does not encode an amino acid sequence, to essentially identical sequences. Because of the degeneracy of the genetic code, a large number of functionally identical nucleic acids encode any given protein. For instance, the codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every position where an alanine is specified by a codon, the codon can be altered to any of the corresponding codons described without altering the encoded polypeptide. Such nucleic acid variations are "silent variations," which are one species of conservatively modified variations. Every nucleic acid sequence herein which encodes a polypeptide also describes every possible silent variation of the nucleic acid. One of skill will recognize that each codon in a nucleic acid (except AUG, which is ordinarily the only codon for methionine, and TGG, which is ordinarily the only codon for tryptophan) can be modified to yield a functionally identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a polypeptide is implicit in each described sequence with respect to the expression product, but not with respect to actual probe sequences.

[0042] As to amino acid sequences, one of skill will recognize that individual substitutions, deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded sequence is a "conservatively modified variant" where the alteration results in the substitution of an amino acid with a chemically similar amino acid. Conservative substitution tables providing functionally similar amino acids are well known in the art. Such conservatively modified variants are in addition to and do not exclude polymorphic variants, interspecies homo logs, and alleles.

[Θ043] The following eight groups each contain amino acids that are conservative substitutions for one another: 1 ) Alanine (A), Glycine (G); 2) Aspartic acid (D), Glutamic acid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine ( ); 5) Isoieucine (I), Leucine (L), Methionine (M), Valine (V); 6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 7) Serine (S), Threonine (T); and 8) Cysteine (C), Methionine (M) (see, e.g., Creighton, Proteins (1984)).

[0044] The "active-site" of a protein or polypeptide refers to a protein domain that is structurally, functionally, or both structurally and functionally, active. For example, the active-site of a protein can be a site that catalyzes an enzymatic reaction, i.e., a catalytically active site. An enzyme refers to a domain that includes amino acid residues involved in binding of a substrate for the purpose of facilitating the enzymatic reaction. Optionally, the term active site refers to a protein domain that binds to another agent, molecule or

polypeptide. For example, the active sites of SENP1 include sites on SENP1 that bind to or interact with SUMO. A protein may have one or more active-sites.

[0045] Nucleic acid is "operably linked" when it is placed into a functional relationship with another nucleic acid sequence. For example, DNA for a presequence or secretory leader is operably linked to DNA for a polypeptide if it is expressed as a preprotein that participates in the secretion of the polypeptide; a promoter or enhancer is operably linked to a coding sequence if it affects the transcription of the sequence; or a ribosome binding site is operably linked to a coding sequence if it is positioned so as to facilitate translation. Generally, "operably linked" means that the DNA sequences being linked are near each other, and, in the case of a secretory leader, contiguous and in reading phase. However, enhancers do not have to be contiguous. Linking is accomplished by ligation at convenient restriction sites. If such sites do not exist, the synthetic oligonucleotide adaptors or linkers are used in accordance with conventional practice.

[0046] The term "gene" means the segment of DNA involved in producing a protein; it includes regions preceding and following the coding region (leader and trailer) as well as inten'ening sequences (introns) between individual coding segments (exons). The leader, the trailer as well as the introns include regulatory elements that are necessary during the transcription and the translation of a gene. Further, a "protein gene product" is a protein expressed from a particular gene.

[0047] The word "expression" or "expressed" as used herein in reference to a gene means the transcriptional and/or translational product of that gene. The level of expression of a DNA molecule in a cell may be determined on the basis of either the amount of

corresponding mRNA that is present within the cell or the amount of protein encoded by that DNA produced by the cell. The l evel of expression of non-coding nuclei c acid mol ecules (e.g., siRNA) may be detected by standard PCR or Northern blot methods well known in the art. See, Sambrook et al., 1989 Molecular Cloning: A Laboratory Manual, 18.1-18.88.

[0048] The term "recombinant" when used with reference, e.g., to a cell, or nucleic acid, protein, or vector, indicates that the ceil, nucleic acid, protein or vector, has been modified by the introduction of a heterologous nucleic acid or protein or the alteration of a nati ve nucleic acid or protein, or that the cell is derived from a cell so modified. Thus, for example, recombinant cells express genes that are not found within the native (non-recombinant) form of the cell or express native genes that are otherwise abnormally expressed, under expressed or not expressed at all. Transgenic cells and plants are those that express a heterologous gene or coding sequence, typically as a result of recombinant methods.

[0049] The term "exogenous" refers to a molecule or substance (e.g., a compound, nucleic acid or protein) that originates from outside a given cell or organism. For example, an "exogenous promoter" as referred to herein is a promoter that does not originate from the plant it is expressed by. Conversely, the term "endogenous" or "endogenous promoter" refers to a molecule or substance that is native to, or originates within, a given cell or organism.

[0050] As used herein, the term "about" means a range of values including the specified value, which a person of ordinary skill in the art. would consider reasonably similar to the specified value. In embodiments, the term "about" means within a standard deviation using measurements generally acceptable in the art. In embodiments, about means a range extending to +/- 10% of the specified value. In embodiments, about means the specified value.

[0051] "Heterologous", when used with reference to portions of a protein, indicates that the protein comprises two or more domains that are not found in the same relationship (e.g., do not occur in the same polypeptide) to each other in nature. Such a protein, e.g., a fusion protein, contains two or more domains from unrelated proteins arranged to make a new functional protein. Similarly, when used in the context of two substances (e.g., nucleic acids, cells, proteins), the two substances are not found in the same relationship to each other in nature. As an example, a "cell expressing a heterologous protein" refers to a cell that expresses a protein that does not naturally occur in the cell.

[0052] "Domain" refers to a unit of a protein or protein complex, comprising a polypeptide subsequence, a complete polypeptide sequence, or a plurality of polypeptide sequences where that unit has a defined function.

[0053] For specific proteins described herein (e.g., Cas 9, Fokl, Mmel), the named protein includes any of the protein's naturally occurring forms, or variants that maintain the protein transcription factor activity (e.g., within at least 50%, 80%, 90%, 95%, 96%, 97%, 98%, 99% or 100% activity compared to the native protein). In some embodiments, variants have at least 90%, 95%, 96%, 97%, 98%, 99% or 100% amino acid sequence identity across the whole sequence or a portion of the sequence (e.g. a 50, 100, 150 or 200 continuous amino acid portion) compared to a naturally occurring form. In other embodiments, the protein is the protein as identified by its NCBI sequence reference. In other embodiments, the protein is the protein as identified by its NCBI sequence reference or functional fragment thereof.

[0054] The term "Cas 9" as provided herein includes any of the CRJSPR. associated protein 9 protein naturally occurring forms, homologs or variants that maintain the RNA-guided DNA nuclease activity (e.g., within at least 50%, 80%, 90%, 95%, 96%, 97%, 98%, 99% or 100% activity compared to the native protein). In some embodiments, variants have at least 90%, 95%, 96%, 97%, 98%, 99% or 100% amino acid sequence identity across the whole sequence or a portion of the sequence (e.g. a 50, 100, 150 or 200 continuous amino acid portion) compared to a naturally occurring form. In embodiments, the Cas 9 protein is the protein as identified by the NCBI sequence reference : 01:672234581. In embodiments, the Cas 9 protein is the protein as identified by the NCBI sequence reference KJ796484

(01:672234581) or functional fragment thereof. In embodiments, the Cas 9 protein includes the sequence identified by the NCBI sequence referencer GL669193786. In embodiments, the Cas 9 protein has the sequence of SEQ ID NQ: 1. In embodiments, the Cas-9 protein is encoded by a nucleic acid sequence corresponding to Gene ID KJ796484 (GI:672234581 ).

[0055] The Zinc finger motif will include Cys2His2 motif (X2-C-X2.4-C-X 12-H-X3 ,4,5-

H, where X is any amino acid).

DETAILED DESCRIPTION OF THE INV ENTION

[0056] Provided herein are compositions and methods for barcoding mammalian cells. The compositions and methods provided herein further provide means for tracing such barcoded cells in vivo during the life time of an organism. For example, in the methods provided a fusion protein including a sequence-specific DNA-binding domain (e.g., a guide RNA or a TAL effector DNA binding domain) and a nucleic acid cleaving domain (e.g., a restriction enzyme) is targeted to a site in the cellular genome to insert a cleavage site in the genome. A DNA editing protein may then be targeted to said cleavage site to insert random nucleotides (barcode) at the site. The DNA editing enzyme could be endogenous or heterologous. When progeny cells are formed, the process of cleavage and random nucleotide insertion is repeated due to the constitutive or cell cycle-specifi c expression of the sequence-specific DNA- binding domain and nucleic acid cleaving domain . Every time a progeny cell is formed, additional random nucleotides are inserted at the original cleavage site thereby adding new nucleotides to the existing barcode. The newly formed barcode is longer than the ori ginal maternal barcode and is specific for each progeny cell. Since the barcode includes the nucleotides of the maternal barcode it can be used to trace back the maternal source of an individual cell thereby characterizing its ancestral lineage.

A. Cleaving protein complex

[0057J The cleaving protein complex provided herein is a heterologous protein complex including a sequence-specific DNA-binding domain and a nucleic acid cleaving domain. The cleaving protein complex may be a fusion protein where the sequence-specific DNA-binding domain and the nucleic acid cleaving domain are directly joined at their amino- or carboxy- termimis via a peptide bond. Alternatively, an amino acid linker sequence may be employed to separate the sequence-specific DNA-binding domain and nucleic acid cleaving domain polypeptide components by a distance sufficient to ensure that each polypeptide folds into its secondary and tertiary structures. Such an amino acid linker sequence is incorporated into the fusion protein using standard techniques well known in the art. Suitable peptide linker sequences may be chosen based on the following factors: (1) their ability to adopt a flexible extended conformation; (2) their inability to adopt a secondary structure tha could interact with the first and second polypeptides; and (3) the lack of hydrophobic or charged residues that might react with the first and second polypeptides. Typical peptide linker sequences contain Gly, Ser, Val and Thr residues. Other near neutral amino acids, such as Ala can also be used in the linker sequence. Amino acid sequences which may be usefully employed as linkers include those disclosed in Maratea et al ( 1985) Gene 40:39-46; Murphy ei al. (1986) Proc. Natl. Acad. Sci. USA 83:8258-8262; U.S. Pat. Nos. 4,935,233 and 4,751,180, each of which is hereby incorporated by reference in its entirety for all purposes and in particular for all teachings related to linkers. The linker sequence may generally be from 1 to about 50 amino acids in length, e.g., 3, 4, 6, or 10 amino acids in length, but can be 100 or 200 amino acids in length. Linker sequences may not be required when the first and second polypeptides have non-essential N-terminal amino acid regions that can be used to separate the functional domains and prevent steric interference. In some embodiments, linker sequences of use in the present invention comprise an amino acid sequence according to (GGGGs) n . In embodiments, linker sequences of use in the present invention include a protein encoded by the nucleotide sequence of SEQ ID NO:4. In embodiments, linker sequences of use in the present invention include a protein having the sequence of SEQ ID N():5.

[0058] Other chemical linkers include carbohydrate linkers, lipid linkers, fatty acid linkers, polyether linkers, e.g., PEG, etc. For example, polyi ethylene glycol) linkers are available from Shearwater Polymers, Inc. Huntsvil le, Ala. These linkers optionally have amide linkages, sulfhydryl linkages, or heterobifunctional linkages.

[Θ059] Other methods of joining two heterologous domains include ionic binding by expressing negative and positive tails and indirect binding through antibodies and

streptavidin-biotin interactions. See, e.g., Bioconjugate Techniques, Hermanson, Ed., Academic Press (1996).

[0060J Nucleic acids encoding the polypeptide fusions can be obtained using routine techniques in the field of recombinant genetics. Basic texts disclosing the general methods of use in this invention include Sambrook and Russell, Molecular Cloning, A Laboratory Manual (3rd ed. 2001); Kriegler, Gene Transfer and Expression: A Laboratory Manual (1990); and Current Protocols in Molecular Biology (Ausubel et al, eds., 1994-1999). Such nucleic acids may also be obtained through in vitro amplification methods such as those described herein and in Berger, Sambrook, and Ausubel, as well as Muilis et al., (1987) U.S. Pat. No. 4,683,202; PCR Protocols A Guide to Methods and Applications (Innis et al, eds) Academic Press Inc. San Diego, Calif. ( 1990) (Innis); Arnheim & Levinson (Oct. 1 , 1990) C&EN 36-47; The Journal Of NIH Research (1991) 3: 81-94; Kwoh et al (1989) Proc. Natl, Acad. Sci. USA 86; 1 173; Guatelli et al. (1990) Proc. Natl. Acad. Sci. USA 87, 1874; Lomell et al. (1989) J. Clin. Chem., 35: 1826; Landegren et al., (1988) Science 241 : 1077-1080; Van Brunt (1990) Biotechnology 8: 291-294; Wu and Wallace (1989) Gene 4: 560; and Barrmger et al. (1990) Gene 89: 117, each of which is incorporated by reference in its entirety for all purposes and in particular for all teachings related to amplification methods.

[0061] Alternatively, the sequence-specific DNA-binding domain and the nucleic acid cleaving domain are expressed as individual proteins encoded by separate nucleic acids and the cleaving protein complex is formed through protein-protein interaction.

[0062] The term "nucleic acid cleaving domain" as provided herein refers to a restriction enzyme or nuclease or functional fragment thereof. The terms "restriction enzyme" or "nuclease" have the same ordinary meaning in the art and can be used mterchangably throughout. A nuclease is an enzyme capable of cleaving the phosphodiester bonds between the nucleotide sub units of nucleic acids. Nucleases are usually further divided into endonucieases and exonueleases, although some of the enzymes may fall in both categories. Non-limiting examples of nucleases are deoxyribonuclease and ri.bomicl.ease. In

embodiments, the nucleic acid cleaving domain includes or is a Cas 9 domain or functional portion thereof. In embodiments, the nucleic acid cleaving domain includes or is a restriction enzyme (e.g., Mmei, Fokl) or functional portion thereof. Where the nucleic acid cleaving domain includes a restriction enzyme, the nucleic acid cleaving domain may be a restriction enzyme dimer, wherein two restriction enzymes or functional portions thereof are connected through a single-chain linker. In embodiments, the single-chain linker is encoded by a nucleic acid of SEQ ID NO:6. In embodiments, the single-chain linker has the sequence of SEQ ID NO: 7

[Θ063] The sequence-specific DNA-binding domain as provided herein may include a polypeptide or nucleic acid capable of binding a genomic nucleic acid sequence. Where the DNA-binding domain includes or is a nucleic acid, the nucleic acid may be an RNA molecule capable of hybridizing to the genomic nucleic acid sequence. The RNA molecule may be a guide RN A and the genomic nucleic acid sequence may form part of the gene encoding said guide RNA (guide RNA encoding sequence). Therefore, in embodiments, the guide RNA provided herein binds to a part, or entirety of its own gene. In embodiments, the guide RNA includes a nucleic acid cleaving domain recognition site. The term "nucleic acid cleaving domain recognition site" refers to a nucleotide sequence, which forms part of the guide RNA and which is recognized by a nucleic acid cleaving domain (e.g., a nuclease). Where the DNA-binding domain includes a polypeptide, the DNA-bmdmg domain may be a TAL (transcription acti ator-like) effector DNA binding domain or a zinc finger domain.

B. Recombinant DNA editing proteins

[0064J As described above, the cleaving protein complex as provided herein is targeted to a genomic nucleic acid sequence by sequence-speicifc DNA binding and inserts a cleavage site at binding site or in close vicinity thereto. Random nucleotides may be subsequently inserted at the cleavage site by further targeting a DNA editing protein to the cleavage site. A DNA editing protein as provided herein is a polypeptide including a terminal deoxynucleotidyl transferase (Td'T) activity. A "terminal deoxynucleotidyl transferase" refers to a specialized DN A polymerase, which catalyzes the addtion of n ucleotides to the 3' terminus of a DN A molecule. Unlike most DNA polymerases, it does not require a template. The preferred substrate of terminal deoxynucleotidyl transferase is a 3' -overhang, but it can also add nucleotides to blunt or recessed 3' ends. In embodiments, the terminal deoxynucleotidyl transferase is the protein as identified by the NCBI sequence reference NM 004088.3. In embodiments, the DNA editing protein is an endogenous DNA editing protein. Where the DNA editing protein is an endogenous DNA editing protein, the DNA editing protein is native to, or originates within, a given cell or organism. In embodiments, the DNA editing protein is a recombinant DNA editing protein. The DNA editing protein as provided herein may include a sequence-specific DNA. binding domain and a DNA transferease domain. Where the DNA editing protein includes a sequence-specific DNA binding domain and a DNA transferease domain, the DNA editing protein may be a heterologous protein. The DNA transferase domain may include a terminal deoxynucleotidyl transferase or functional fragment thereof. In embodiments, the DNA transferase domain is a terminal

deoxynucleotidyl transferase or functional fragment thereof. The sequence-specific DNA binding domain may be as described above, for example an RNA molecule (e.g., a guide RNA), a TAL (transcription activator-like) effector DNA binding domain or a zinc finger domain.

[0065] To provide for regulated expression and activity of the protein cleaving compl ex and the recombinant DNA editing proteins during ceil division, they may be operably linked to a cell-cycle regulated domain. A cell cylce regulated domain may be a peptide that is proteolytically cleaved in a cell-cycle dependent manner to ensure the timely accumulation during the appropriate phase of the cell cycle. Alternatively, the cell-cycle regulated domain is a nucleotide sequence which controls the transcription or RNA turnover of the

polynucleotide it is operably linked to. Coupling the protein cleaving complex and the recombinant DNA editing proteins provided herein to cell-cycle regulatory elements provides that barcodes will be added in a temporal manner during cell division. In embodiments, the cell-cycle regulator element is operably linked to the N-terminal end of the sequence- specific DNA binding domain,

C. Fusion proteins

[Θ066] As described above the sequence-specific DNA binding domain and the nucleic acid cleaving domain forming the cleaving protein complex may be separately expressed or may form part of a fusion protein. Similarly, the sequence-specific DNA binding domain and the DNA transferease domain forming the DNA editing protein may be separately expressed or may form part of a fusion protein. In embodiments, the fusion protein includes a TAL effector DNA binding domain operably linked to a nucleic acid cleaving domain (e.g., two Fokl domains separated by a single chain linker). In further embodiments, the N-terminal end of the TAL effector DNA binding domain is operably linked to a cell-cycle regulated domain and the C -terminal end of the TAL effector DNA binding domain is connected through an extension peptide to the nucleic acid cleaving domain.

[Θ067] In embodiments, the fusion protein includes a TAL effector DNA binding domain operably linked to a DNA transferease domain. In further embodiments, the N-terminal end of the TAL effector DNA binding domain is operably linked to a cell-cycle regulated domain and the C~termmal end of the TAL effector DNA binding domain is connected through an extension peptide to the DNA transferease domain. In embodiments, the fusion protein includes a zinc finger binding domain operably linked to a DNA transferease domain. The fusion protein provided herein may further include a non-specific DNAse domain connecting the DNA binding domain with the DN A transferease domain.. In embodiments, the non- specific DNAse domain is a dimer. Alternatively, the cleaving protein complex and the recombinant DNA editing protein may form a fusion protein. Thus, in embodiments, a fusion protein is formed that includes a Cas9 protein and a terminal deoxynueleotidyl transferase, wherein the Cas9 protein is bound to a guide RNA .

D. Methods of barcodmg a cell

[0068] The compositions and methods provided may be used for barcodmg mammalian cells. The compositions and methods provided herein further provide means for tracing such barcoded cells in vivo during the life time of an organism or in vitro in a cell (e.g., cell in a cell culture). For example, in the methods provided a fusion protein including a sequence- specific DNA-binding domain (e.g., a guide RNA or a TAL effector DNA binding domain) and a nucleic acid cleaving domain (e.g., a restriction enzyme) is targeted to a site in the cellular genome to insert a cleavage site in the genome. A DNA editing protein may then be targeted to said cleavage site to insert random nucleotides (barcode) at the site. The DNA editing enzyme could be endogenous or heterologous. When progeny cells are formed, the process of cleavage and random nucleotide insertion is repeated due to the constitutive or cell cycle-specific expression of the sequence-specific DNA-binding domain and nucleic acid cleaving domain. Every time a progeny cell is formed, additional random nucleotides are inserted at the original cleavage site thereby adding new nucleotides to the existing barcode. The newly formed barcode is longer than the original maternal barcode and is specific for each progeny cell. Using sequencing methodologies well known in the art (e.g., deep sequencing) the barcode sequence of each cell can be identified and its maternal origin determined. Further, applying deconvolution methodology well known in the art and referred to herein, the maternal source of an individual cell can be traced back thereby characterizing its ancestral lineage. References disclosing the general methods of deconvolution include Vogt W. et al. Gastrulation und Mesodermbildung hei Urodelen nd An ren. II. Teil. W. R.oux Arch Entwicklungsmech Org 120384-706. Keller RE (1986) Developmental Biology; 1929; Sulston JE et al. The embryonic cell lineage of the nematode Caenorhabditis elegans Developmental Biology 1983 Nov;100(I):64-1 19; Livet J et al. Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system Nature. 2007;

Snippert HJ et al Intestinal Crypt Homeostasis Results from Neutral Competition between Symmetrically Dividing LgrS Stem Cells Cell : 2010 Oct; 143(1 ): 134-44; Mino T et al.

Efficient double-stranded DNA cleavage by artificial zinc-finger nucleases composed of one zinc finger protein and a single-chain Fob ' dimer Journal of Biotechnology 2009 Mar; 140(3- 4): 156-61 ; Sakaue-Sawano A et al Visualizing Spatiotemporal Dynamics of Multicellular Cell-Cycle Progression Cell 2008 Feb;132(3):487-98; e R et al. In situ sequencing for RNA analysis in preserved tissue and cells Nature methods 2013 Sep;10(9):857-60; Batzer MA et al. Amplification dynamics of human-specific (MS) alii family members Nucleic Acids Res, Oxford University Press; 1991 Jul 1 l;19(13):3619-23; Ohtsuka E et al. An alternative approach to deoxyoligonucleo tides as hybridization probes by insertion of deoxyinosine at ambiguous codon positions Journal of Biological Chemistry American Soci ety for

Biochemistry and Molecular Biology; 1985 Mar 10;260(5):2605-8; Rossoiiiii GM et al. Use of deoxyinosine-containing primers vs degenerate primers for polymerase chain reaction based on ambiguous sequence information Molecular and Cellular Probes 1994 Apr;8(2):91- 8; Maratea D et al. Deletion and fusion analysis of the phage φΧΙ 74 lysis gene E. Gene 1985 Jan;40(l):39-46; Murphy JR et al. Genetic construction, expression, and melanoma- selective cytotoxicity of a diphtheria toxin-related alpha-melanocyte-stimulating hormone fusion protein Proc Natl Acad Sci USA National Acad Sciences; 1986 Nov;83(21):8258-62; woh DY et al. Transcription-based amplification system and detection of amplified human immunodeficiency virus type I with a head-based sandwich hybridization format Proc Natl Acad Sci USA. National Acad Sciences; 1989 Feb;86(4): 1173-7; Guatelli JC et al.

Isothermal, in vitro amplification of nucleic acids by a multienzyme reaction modeled after retroviral replication Proc Natl Acad Sci USA. National Acad Sciences; 1990

Mar;87(5): 1874-8; Lomeli H et al. Quantitative assays based on the use of replicatable hybridization probes Clinical Chemistry. American Association for Clinical Chemistry; 1989 Sep;35(9): 1826-31 ; Landegren U et al. A ligase-mediated gene detection technique Science. American Association for the Advancement of Science; 1988 Aug 26;241(4869): 1077-80; Wu DY et al. The ligation amplification reaction (LAR)— Amplification of specific DNA sequences using sequential rounds of template-dependent ligation. Genomics 1989

May;4(4):560-9; Barringer KJ et al. Blunt-end and single-strand ligations by Escherichia coli ligase: influence on an in vitro amplication scheme Gene. 1990 Apr;89(l):l 17—22;

Jimenez JI et al. Comprehensive experimental fitness landscape and evolutionary network for small RNA Proc Natl Acad Sci US A National Acad Sciences; 2013 Sep 10; 110(37): 14984-9; Schloss PD et al. Introducing mothur: open-source, platform-independent, community- supported software for describing and comparing microbial communities Appl Environ Microbiol. American Society for Microbiology; 2009 Dec;75(23):7537-41; Li W et al. Cd- hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences Bioinformatics 2006; each of which is incorporated by reference in its entirety for all purposes and in particular for all teachings related to amplification methods,

[0069] The methods of barcoding a cell provided herein including embodiments thereof may further include a step of ligafing the ends of the double-stranded cleavage site. The ligation enzymes used for this ligation step may be endogenous DNA ligation enzymes (e.g., a ligase that naturally occurs in the ceil being bareoded). In embodiments, the ligation enzyme is a heterologous DNA ligation complex. A heterologous DNA ligation complex as provided herein includes a sequence-specific DNA-binding domain and a nucleic acid ligation domain. In further embodiments, the heterologous DNA ligation complex includes a DNA editing domain. A DNA editing domain as provided herein includes a protein having terminal deoxynucleotidyl transferase (TdT) activity. Thus, in embodiments, the method further includes after step (iii) of inserting random nucleotides a step (iii.i) of ligating the ends of the double-stranded cleavage site. In embodiments, the ligating is achieved by contacting the double-stranded cleavage site with an endogenous DNA ligase. In

embodiments, the ligating is achieved by contacting the double-stranded cleavage site with a heterologous DNA ligation complex. In embodiments, the heterologous DNA ligation complex includes a sequence-specific DNA-binding domain and a nucleic acid ligation domain.

[0070] It is understood that the examples and embodiments described herein are for illustrative purposes only and that various modifications or changes in light thereof will be suggested to persons skilled in the art and are to be included within the spirit and purview of this application and scope of the appended cl aims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.

EXAMPLES

Example 1

[0071] Cas9-based systems potentially represent a significant advance. The prokaryotic CRISPR adaptive immune system has led to the development of custom nucleases whose sequence specificity can be programmed by small RNAs. CRISPR loci are composed of an array of repeats, each separated by 'spacer' sequences that match the genomes of

bacteriophages and other mobile genetic elements. This array is transcribed as a long precursor and processed within the repeat sequences to generate smal l crisper R ' NA (crRNA) that specifies the target dsDNA to he cleaved. An essential feature is the protospacer- adjacent motif (P AM) that is required for efficient target cleavage (FIG. 1). Cas9 is a double-stranded dsDNA endonuclease that uses the crRNA as a guide to specify the cleavage site. To change the target, one only needs to alter the small guiding RNA sequence, a key advantage over TALENs, ZF s, and Megs, For this reason, Applicants' main approach is to develop the Cas9 system for efficient high-throughput gene targeting.

[0072] A. new approach is provided for tracing the evolutionary history of cells- at the most possible granular level, the individual cells. Applicants take advantage of new technologies (deep sequencing and TALENs) combining them in a way to create a single cell lineage tracer in which each cell contains a unique barcode. This system is comprised of a synthetic "TYPER" genetic circuit which can be introduced into cells via homologous recombination or more conveniently, via a retrovirus. Once created, Applicants' vision is to introduce the TYPER circuit into fertilized zygotes, were mouse lines will be developed. In essence every cell in a TYPER mouse will contain a unique barcode, and each barcode would contain information on its previous lineage, starting with the fertilized zygote. This technology, the Reconstruction of Ancestral Cells by Enzymatic Recording (tRACER) is accomplished using two custom enzymes that Applicants have built and are currently optimizing for the digital tracing of cell lineages.

[0073] Applicants' first goal is to tangibly realize the concept described in FIG. 4. The foundation of this concept is the development of two distinct enzymes: a modified TALEN and a novel 'TYPER '. Applicants have recently built these two enzymes and are currently characterizi g their activity in vitro and in vivo.

[0074] Modified TALENs. Transcription activator-like effector nucleases (TALENs) are essentially artificial restriction enzymes generated by fusing a TAL effector DNA binding domain to a DNA cleavage domain. A simple code between amino acid sequences in the TAL effector DNA binding domain and the DNA recognition site allows for protein engineering applications. This code has been used to design a number of specific DNA binding protein fusions.

[0075] TALENs are typically used in pairs, where each TALEN cleaves only a single- strand. In genome engineering applications, TALEN binding sites are designed juxtaposed and proximal, producing double-stranded DNA (dsDNA) cleavage. Notably this offers a higher level of specificity, requiring a collectively longer recognition site. Most importantly,

77 each TALEN is composed of a TAL effector DNA binding domain linked to the Fokl restriction enzyme, and the Fokl enzyme requires dimerization to produce a dsDNA cleavage.

[0076] Applicants have recently synthesized novel TALENs designed to cleave both strands. These unique FIG. 5. Single-chain Fokl can efficiently cleave DNA. (left)

Schematic representation of AZP-scFokl. (right) in vitro activity of a AZP-scFokl variant containing a flexible (GGGGS) 12 linker; lane 1 : Ctrl DNA substrate, lane 2: incubation with AZP-scFokl. Site-specific cleavage by AZP-scFokl produces 0.9- and 2-kbp DNA fragments (indicated as P I and P2, respectively). S: a plasmid substrate, adapted after Mino et al. nucleases are composed of the traditional TAL effector DNA binding domain fused to single a nuclease domain that nicks one DNA strand. However, Applicants have engineered the Fokl enzyme as a dimer using a flexible single chain linker, allowing it to cleave dsDNA. Synthetic Fokl dimers based on zinc finger DNA binding domains (i.e. not TAL effectors) have been created and contain robust activity in vitro (FIG. 5). Applicants have created 1) a TAL effector fused to a single-chain Fokl, and 2) a TAL effector fused to a single-chain Mmel (FIG. 6). The main difference between these TALENs is the overhang that is produced: Fokl produces a four nt 5 ' -overhang and Mmel produces a two nt 3 '-overhang. Applicants' goal is to test and optimize several restriction enzymes when coupled to TAL effector DNA binding domains. Only one enzyme will be needed for the tRACER platform. The ideal enzyme will exhibit maximal activity and specificity on its DNA target site, allowing for robust enzymatic machinations with a novel 'TYPER' enzyme Applicants describe below.

[0077] A novel TYPER enzyme. Applicants have constructed a unique enzyme fusion between a TAL effector DNA binding domain and a terminal deoxynucleotidyl transferase (I ' d! ' ) (FIG. 6). TdT is a nuclear enzyme responsible for the non-templated addition of nucleotides at gene segment junctions of developing lymphocytes4. For B cells and T cells, TdT is a key component of their development, participating in somatic recombination of variable gene segments. Regulated rearrangement of lymphocyte receptor gene segments through recombination expands the diversity of antigen-specific receptors. TdT binds to specific DNA sites, adding non-templated A, T, G, and C nucleotides to the 3 '-end of the DN A cleavagesite, and is critical value for antigen-specific receptor diversity. The ability of TdT to randomly incorporate nucleotides greatly aids in the generation the ~ 1014 different immunoglobulins and ~ 018 unique T cell antigen receptors. [0078] TdT is perhaps the most enigmatic of DNA polymerases, as it bends many of the general rules: not only does it not require a template strand, it does not appear to be proeessive. Regulated activity at VDJ junctions is limited, typically adding 4-6 nucleotides in a highly regulated process; however, overexpression in non-lyrnphoid cell lines can yield large insertions (>100 nt) 5, and the recombinant TdT enzyme can robustly add thousands of nucleotides under unregulated conditions. In non-optimized limited cleavage assays

Applicants have found that it readily adds up to 4-8 residues to Cas9 induced breakpoints (FIG. 7) and hypothesize it may help 'lock-in' Cas9 dsDNA cleavage. Different number of nucleotides may be added when TdT is 'tethered' near a DNA 3' -end using a TAL effector DNA binding domain. Applicants hypothesize that the length of the linker may limit the number of nucleotides added; if so, Applicants will modify the linker domain as needed to change barcode length.

[Θ079] Cell cycle regulation. One aspect of the tRACER system is that it is active during cell division, such that barcodes will be added in a temporal manner. This is not an essentia] feature of the TRACER technology but may desirably restrict TRACER activity. Cell cycle is a carefully regulated process that ensures DNA replication occurs only once during the cell cycle. In higher eukaryotes such as humans, proteolysis and Geminin (hGem) mediated inhibition of the licensing factor hCdtl are essential for preventing DNA re -replication. Due to cell cycle-dependent proteolysis, protein levels of hGem and hCdtl oscillate inversely, with hCdtl levels being high during Gl, while hGem levels are the highest during the S, G2, and M phases. Their regulation is governed by proteolytic rather than transcriptional controls or RNA turnover to ensure the timely accumulation during the appropriate phase. Consistent with this mode of regulation, hGem and hCdtl peptides can be added onto proteins to regulate their expression in a robust cell-cycle dependent manner. This strategy has been incredibly successful for developing fluorescent markers that definitively illuminate cell cycle progression. To accomplish this Applicants will conjugate hGem peptide sequences onto both the TYPER and TALEN enzymes to pulse-restrict their expression during the cell cycle, if further restriction is needed, Applicants may be able to harness other cell cycle regulatory elements, such as APC Cdc20 regulation which is active during M-phase. The general concept is to trigger tRACER TALEN cleavage and TYPER activity only when cell divide. In some embodiments, one can employ cell cycle proteolytic regulation. Optionally, one may also test cycle dependent transcriptional activation/repression or cell RNA turnover. If needed, these regulatory processes might be able to be combined to augment finer restriction of tRACER activity. In some embodiments, an inducible tRACER apparatus could be immensely valuable in pulse-type experiments. This could be made possible by coupling the enzymes to ERT2 or possible placing it in the context of optogenetic regulation.

[0080] As a general concept, it is worth noting that regul ated cycles of nucleic acid cleavage, terminal transferase, and ligation occur in different cell types among different species, including the evolutionarily ancient Trypanosomes (FIG. 9). Another striking example (not depicted here) of regulated retention of DNA 'barcodes' at a specific locus is the prokaryotic CRJSPR array that provides phage immunity and a long history (many years) of each species subtype.

[0081] Biomformatic considerations. Although Applicants retain flexibility for barcode length, some practical aspects should be considered when optimizing for enzyme activity. A first consideration is that extremely short barcodes may limit the number of cell types that can be analyzed in parallel. However one must consider that if one begins the tRACE with a small number of cells, the second barcode adds to the complexity and allows deconvolution using traditional cladistics analysis (via Bayesian inference of phytogeny). Bayesian inference of phytogeny is based upon the posterior probability distribution of fate map trees, which is the probability of a given phylogenetic tree conditioned on a deep sequencing dataset. Because the posterior probability distribution of trees is impossible to calculate analytically, Markov chain Monte Carlo simulation may be used to approximate the posterior probabilities of trees.

[0082 ] Applicants expect phylogenetic nonconformities and interesting mapping patterns may result from biologic origins, including asymmetric cell division and limited barcoding activity to occur outside of the context of cell division. Similarly Applicants expect nonconformities that result from technical origins such as barcode loss or mutation during the experiment and sample preparation. Notably Applicants do not necessarily need to capture 100% of barcoded cells to reconstruct the cell division tree and assemble testable fate map models. In fact, the resolution depends on the number of cell s and the complexity of the trees, a <1% capture rate may be sufficient in many applications, and even less when large numbers of cells are examined .

[0083] In some embodiments, one can optimize the lengths of the barcodes. Whil e minimal lengths are technically desirable, tone should ensure that the information content is appropriately long enough to uniquely map to a specific cell. In determining the minimal barcode length, a relevant consideration is the number of ceils present at the outset of the experiment. Here Applicants would define n as the starting number of unique barcoded cells. Because the barcode history contributes to the growing complexity, in theory a single nucleotide added at each cell doubling would be wholly sufficient, providing you start from a single eel! (FIG, 10). However, in practice, limited exonucleo lytic trimming during DNA repair would complicate the results. Hence, one goal can be to optimize barcode lengths between 15-20 bp, giving some buffer for potential trimming, and allow one to initiate experiments with extremely large numbers of ceils. Limited exonucleo lytic trimming of the barcode wi ll simply generate additional uniqueness and should not negatively affect data interpretation.

[ΘΘ84] Statistical considerations. In some embodiments, one can use the Illumina HiSeq 2500, a platform having two general considerations: read length and number of reads. The maximal confidence read length is approximately 200 nt (2 X 100 bp) hence the

combinations of barcodes and their lengths cannot exceed what can be physically read by Illumina sequencing. Depending on barcode length, 200 nt can accommodate 10-50 cell doublings. The Illumina platform has a high output (nearly 3 billion reads per full run) which is sufficient for focused experiments, but would be no match for the trillions of reads needed to deconvolute an entire mouse, particularly given the need for read redundancy. With these limitations it can be assumed that tRACER could fate map in a single Illumina run approximately at least 10 ' ' ceils, assuming a 300 fold sequence coverage.

[0085] Another consideration is that many parallel internal tRACER 'biological replicates' can be obtained in some experimental settings. For example, introducing the construct into mouse ES cells and letting them divide several times in culture will establish 'pre-barcoded' cells, Co-injecting 10-12 pre-barcoded tRACER ES cells into a single blastocyst might act as internal replicates, with the potential caveat that some cells may not fully contribute to all lineages. Given the numbers of cells present at gastrulation and shortly thereafter, tRACER is ideal for mapping early and portions of mid-stage mouse embryos.

[Θ086] Tracing space and time. With any DNA modification system, a potential caveat is whether the expression of DNA modifying enzymes would promote tumorigenesis when present in the animal. This has not been observed with TALEN or CRISPR systems but remains a formal possibility. If tumors do appear, their tRACER phylogenetic analysis could prove very interesting in its own right. In fact, the contribution of stem cells to cancer remains a debate. It is unknown whether cancer stem cells are the origin of all malignant cells in the body, and whether they are responsible for the existence of drug-resistant and metastatic cancer cells. tRACER offers a unique opportunity to definitively mark the cell-of- origin for any cancer types.

[0087] Once tRACER is optimized, Applicants' goal is to integrate spatial and cell-type information. tRACER barcodes do not identify specific cell types but instead generate testable models for uncovering new or pathologically diverged lineages in an ultra high- throughput fashion. However, there are a number of already-developed downstream technologies that allow both spatial and cell-type information will be integrated with tRACER. In some embodiments, one can evaluate whether laser capture of tRACER barcodes from immunohistochemicaily stained embryonic pancreatic islet cells fate can inform cell origins maps. Such a focused approach will provide both barcode identification and confirmation of specific ceil types and their lineages. Second, multiplex FISH will allow probing tissue sections with LNAs against the barcodes. This would allow large numbers of barcodes to be probed simultaneously (using quantum dot or other markers), perhaps in three- dimensional space using whole embryos or whole-mount tissues. Third, an in situ tissue deep sequencing method was recently developed, paving the way for tRACEing hundreds of thousands to millions of immunohistochemicaily stained cells (FIG, 11, left panel).

[0088] Another goal is to integrate tRACER with a novel ultrahighthrough t platform that combines droplet- based microfluidic techniques and PCR to define cell types (FIG. 11, right panel). Applicants' goal is to sort individual cells based on their tRACER barcode and generate R A-seq libraries. These single-cell RNA-seq libraries can be barcoded and pooled to analyze true single cell gene expression for large numbers of cell types. These systems will give Applicants an unprecedented view of gene expression, digitizing cell identity over developmental space and time.

[Θ089] The adult human body is composed of trillions of ceils that ail originated from a single fertilized egg cell, in the adult, most tissues are in a state of constant flux, where old cells die and new cells are created from resident populations of stem cells. Disease such as cancer emerges when cells lose their directions, and divide in an uncontrolled manner, losing their identities. Other diseases are hallmarked by a loss of cells, triggered by unwanted self- elimination such as apoptosis or autoimmunity. The fluidity of cell populations initiates from the moment a being is conceived to the being's final breath of life. Multicellular life dances to the music of a highly ordered process, directed by a score that is not well understood ,

[0090] Cell heterogeneity— inherent differences between individual cells in a given tissue or tumor— is one of the biggest challenges in research today. Current techniques are greatly limited in their ability to mark individual cells while retaining their ancestry. tRACER. offers a light year leap. Heterogeneity is a natural consequence of biology, fostering the

evolutionary adaptation that hampers cancer treatment.

[0091] Using current technologies, it is practically impossible to map the origin of the initial rogue cancer cell that causes a tumor. In essence, using tRACER technology.

Applicants will be able to probe the cell of origin of any cancer by deep sequencing the barcodes within a given tumor. Specifically, each ceil in that tumor would contain a barcoded digital DNA record of its evolutionary path. Moreover, sequencing barcodes from metastatic cel ls wi ll trace the cells back to their original tumor and again their wild type healthy cell-of-origin, whether that be a stem ceil, a mid-stage progenitor, or a fully differentiated nondividvng cell type. Likewise, tracing cell death and amplification in the context of drug treatment may provide information about the evolution of a tumorigenesis during treatment. The origin of cancer heterogeneity has been controversial, with good data to support epigenetic and genetic heterogeneity models. New tools are needed to better understand the origin, development, and evolution of cancers, and the ability to describe tumors at the resolution of single cells could transform one's ability to plot the best treatment options and to anticipate disease outcome.

[0092] Currently there are no technologies that can delineate cell ancestries on such a large scale. Applicants' proposed concept takes advantage of the growing power of deep sequencing, as Applicants have the power to sequence billions of reads, potentially tracing hundreds of millions of cells or more. This represents a tremendous step forward from the scale at which fate mapping is currently done (typically qualitatively hundreds of cells).

[0093] Derivation and use of a self-editing gRN A for TRACER.

[0094] Concept and mech anism of activi ty. Applicants have de veloped a novel mechanism for the self-destruction of a gRNA, namely the inclusion of a PAM motif within the context of an actual gRNA (Applicants name self-editing gRNA, or segRNA). Conceptually PAM motifs within the gRNA should be absolutely avoided in natural prokaryotic CRISPR settings as self-destruction would cause loss of CRISPR function and worse, genome instability. However Applicants have found that the tracer portion of the gRNA can be altered to include a PAM motif; Applicants have discovered that the DNA encoding that specific gRN A can be recognized by the gRNA to which it encodes. In this way, the PAM motif causes a self- destruction of the gRNA guiding portion. A precept of the segRNA is that it does not necessarily destroy the upstream promoter that transcribes it, nor the downstream tracer portion of the gRNA that is important for Cas9 binding,

[0095] Definition of self-editing. Self-editing occurs when the gRNA has successfully cut its own gene, in the TRACER system, the TdT will add nucleotides to the cut-site, resulting in a change in the DNA guiding portion of the gRNA (depicted in green in FIG. 1). This could be one nucleotide or more that is added, but importantly should have enough added nucleotides to specify the cell lineages within a given experiment.

[0096] Promoter and relevance of transcription. In principle the promoter can be pol II or po! III or perhaps pol I. The key element to consider is that the gRNA, once self-edited, will continue to be transcribed, allowing for new gRNAs to be created and destroy the new self- edited gRN A gene. It is in fact an ever-changing process where repeating cycles of self- editing give rise to new gRNA genes which give rise to new gRNA transcripts that self edit.

[0097] Length of barcode. Applicants expect that each cycle of self-editing will cause multiple nucleotides being added within a given cell. Applicants are working on regulating the cell-cycle nature of this process, but reason that it does not necessarily need to be cell cycle regulated. The important concept is that the nascent barcodes are unique for a given cell, no matter how or when they are added. Since the barcodes are not 'forgotten', new cell divisions give rise to new barcodes which extend the length of the barcode array (FIG. 4).

[0098] Applicants' current system allows for the barcode array to be compact, allowing for sequencing of the array by Illumina sequencing, effectively giving bil lions of reads. Longer reads can be achieved by PacBio technologies.

[0099] Terminal deoxynucleotidyl transferase (TdT) was determined to efficiently add nucleotides to a Cas9-induced dsDNA break. In these experiments, 293T cells were treated with either Cas9 or Cas9 and TdT as depicted in FIG. 18. In the absence of TdT, genomic deletions prevailed. In the presence of TdT, insertions were visualized by added nucleotides at the site of the dsDNA break. FIG. 16A displays dsDNA break at a conventional DNA locus. FIG.16B displays a self-editing gRNA (segRNA) locus. Example sequencing results are displayed FIG.17.

INFORMAL SEQUENCE LISTING

[0100] SEQ ID NO:l

MDYKDDDDKDYKDDDDK.MAPKXKRKVGIHGVPA, DKKYSIGLDIGTNSVGWAVI TDEYKVPSKKFKVLGNTDRHSIKKNLiGAL

ICYLQEIFSNEMAKVDDSFFH LEESFLVEEDKKHERHPIFGNIVDEVAYHE YPTIYH

LRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQ

LFEENPINASGVDA AILSARLS SRRLENLIAQLPGE KNGLFGNLIALSLGLTPNFK

SNFDLAEDAKLQLS DTYDDDLD LLAQIGDQYADLFLAA LSDAILLSDILRVNT

EITK,APLSASMIKRY T DEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGA

SQEE^ FI PILE MDGTEELLV LNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQ

EDFYPFL DNREKIE ILTFRIPYYVGPLARG SRFAWMTRKSEET1TP\T^FEEVVDK

GASAQSFIERMTNFDK^LPNE VLPKHSLLYEYFWY ELTKVKYVTEGMRKPAFLS

GEQKKAIVDLLFKTNRKVTVKQL EDYFK IECFDSVEISGVEDRFNASLGTYHDLL

KIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTY T AHLFDDK\^M QLKRRRY

TGWGR1SRKLINGIR QSG TILDFL SDGFANRNFMQLIHDDSLTF EDIQKAQVS

GQGD S L HEHIAN LAG S P AIKKG I

LQTV VVDELV VMGRHKPENIVIEMARENQTTQKGQKNSRERMKRIEEGIKELGS

QILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKD

DSIDNKVLTRSD NRGKSDNVPSEEVVKK NY\T^QLLTS ! AKLITQR FDNLT AER

GGLSELDKAGFIK.RQLVETRQITKHVAQILDSRJviNTKYDENDKLIREVKVITLK SKLV

S i R !}FQFY V! :!\XY]!HAHI)AYEAAVV{n A!.i KYP !J:Sh! VY(H)Y VY])VR V! I A K S ! · Q Γ.1 G A Ί ' A K ' If Y SXITvIXIl !!·. IFF WC.H JRK i . Π . Γ X ( Ϊ i · ' . ' f ' C ] \ i V \V I ) ( j .

DFATVRKVL

SMPQVNIVK TEVQTGGFSKESILP RNSD LIARK LDWDP KYGGFDSPTVAYSVL

\^VAK\^EKGKSKKL SVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLP Y T S

LFELENGRXRMLASAGELQKGNELALPS YVNFLYLASHYEKL GSPEDNEQ QLF

VEQHKHYLDE11EQISEFSKRV1IADANLDKVLSAYNKHRDKPIREQAENIIHLFTL TN

LGAPAAF YFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGG

DKRPAATK AGQAK LK

[0101] SEQ ID NO:2 (WT guide RNA sequence):

GTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTATCAACTTGAA

AAAGTGGCACCGAGTCGGTGCTTTTTT

[0102] SEQ ID NO:3 (GST-TAL-Fokl-linker-FokT) gcttaagcggt.cgacggaicgggagatctcccgatcccciatggtgcactctcagtaca atc

atcigctccctgcttgtgtgttggaggtcgcigagtagtgcgcgagcaaaatttaag ctacaacaaggcaaggcttgaccgacaaiigc atgaagaaicigcttagggttaggcgttttgcgctgcttcgcgatgtacgggccagatat acgcgttgacattgattattgactagttattaa

ccaacgacccccgcccattgacgtcaataaigacgtaigttccxatagtaacgccaa tagggactitccattgacgtca^^ aitacggiaaactgcccacttggcagiacaicaagtgiatcatatgccaagtacgccccc tattgacgcaagacggtaaatggcccg cctggeattaigcccagiacaigaccilatgggactto^ tttggcagtaeateaatgggcgtggatagcggtttgactcacgggg

gcaccaaaa caacgggactttccaaaaigtcg aacaactccgccccattgacgcaaa gggcggtaggcgtgtacggtgggaggt c†atataagcagcgcgttttgcetgtac¾^

gcttaagcctcaataaagcttgccttgagtgcttcaagtagigtgtgcccgicigtt gigtgactctggtaactagagatccctcagaccct tttagtcagtgtggaaaatctctagcag ggcgcccgaacagggact gaaagcgaaagggaaaccagaggagctctctcgacgca ggactcggcttgcigaagcgcgcacggcaagaggcgaggggcggcgaciggigagiacgc caaaaattttgaciagcggaggcia gaaggagagagatgggtgcgagagcgtcagtaitaagcgggggagaaitagaicgcgatg ggaaaaaaticggttaaggccaggg ggaaagaaaaaaiaiaaattaaaaeatatagiaigggcaageag

agaaggctgtagacaaatactgggacagciacaaccatcccticagacaggatcaga agaacttagatcattataiaatacagtagcaa ccctctaitgigtgcatcaaaggatagagataaaagacaccaaggaagctttagacaaga iagaggaagagcaaaacaaaagtaaga ccaccgcacagcaagcggccggccgcgctgatcttcagacctggaggaggagatatgagg gacaattggagaagtgaa†.tatataa aiaiaaagiagtaaaaattgaaccattaggagiagcacccaccaaggcaaagagaagagi ggigcagagagaaaaaagagcagtgg gaaiaggagciitg icc igggtictigggagcagcaggaagcaciatgggcgcagcgicaaigacgcigacggiaca ggccagac aattattgtctggtatagigcageagcagaaoaatt^

tcaagcagctccaggcaagaaicctggcigtggaaagatacctaaaggatcaacagc tcciggggattiggggtigciciggaaaaci caiitgeaccacigcigtgeciiggaatgc^

cagagaaattaacaaiiaGacaagcitaaiacactcciiaaiigaagaaicgcaaaa ccagcaagaaaagaaigaacaagaattaitgg aattagataaatgggcaagtttgtggaattggtttaacataacaaattggctgiggiaia iaaaatiaitcaiaatgatagtaggaggcttgg taggittaagaaiagttittgctgtact^

gaggggacGCgacaggcGCgaaggaaiagaagaagaaggiggagagagagacagaga cagaiccattcgattagigaacggaic ggcactgcgtgcgccaattcigcagacaaatggcagtattcatccacaattttaaaagaa aaggggggattggggggiacagtgcag gggaaagaatagiagacaiaatageaacagacaiacaaaeiaaa^

cagggacagcagagatccagtttggttagtaccgggccctagagatcacgagactag cctcgagagatctgaicataaicagccatac cacaiitgtagaggiittacttgcii.taaaaaacctcccacacctcccccigaacciga aacataaaatgaaigcaaii.gttg tttattgcagcitataaigg tacaaataaggcaatagcaicacaaatitcacaaataaggcatitititcaGtgcatici agitttggtttgtcc aaactcatcaaigtatcttatcaigtctggaicicaaatccctcggaagctgcgcctgtc aicgaattcctgcagcccggtgcatgactaa gctagiaccggtiaggatgcaigciagcicagitagcctcccccatctcte

GCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACG TAAACGGCCACAAGITCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACG

GCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCC CACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGAC

CACATGAAGCAGCACGACTTcm:AAGTCcc ^

GAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTG

/\AGTTCriAGGGCGAC VCCC GGlOAACCGCATCGAGCTG/\AGGGCATCGACTrC

AAGGAGGACGGCAACATCCTGGGGCACAAGCIXKJAGTACAACTACAACAGCCA

CAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAA GATCCGC ACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCA GAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAG CACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCT

GCTGGAGTTCGIXJACCGCCGCXXXXIAIXACTCTCGG

gtgg cgageggaggetggateggteccggtgteitcM^

actaaacgagctctgcttatataggccicccaccgiacacgcciaccctcgagaagc tigatatcactagagctctagTGTGCCC

GTCAGTCGGCAGAGCGCACATC^

TCGGCAATTGAACCGGTGCCTAGAGAAGGTGGCGCGGGGTAAACTGGGAAAGTG ATGTCGTGTACTCGCTCCOCCL TRCCCGAGGGLOGGGGAGAACCGTATATAAG TCCAGTAGTCGCCGTGAACGTTCTT^^

CTAGCgciaccggtcgccaccCCTAGGATGTCCCCTATACTAGGTTATTGGAAAATTAAG G GCCrrGTGCAACCCAClOGACriOl rTGGAATATCTTGAAGA-AAAATATGAAGA GCATTTGTATGAGCGCGATGAAGGTGATAAATGGCGAAACAAAAAGTTTGAATT

GGGTTTGGAGTTTCCCAATCTTCCTTATTATATTGATGGTGATGTTAAATTAACAC AGTCTATGGCCATCATACGTTATATAGCTGACAAGCACAACATGTTGGGTGGTTG

TCCAAAAGAGCGTGCAGAGATrrCAATGCTTGAAGGAGCGGmTGGATATrAG

ATACGGTGTTTCGAGAATTGCATATAGTAAAGACTTTGAAACTCTCAAAGTTGAT

TI CITAGCAAGCTACC GAAATCCTGAAAATGTTCGAAGATCGITTATCTCATA

AAACATATTTAAATCGTGATCATGT^

TCTTGATGTTGTTTTATACATGGACCCAATGTGCCTGGATGCGTTCCCAAAATTAG TTTGTTTTAAAAAACGTATIXJAAGCT

C AGC A A GT AT A TAG C ATG GCCTTTGC AG GG CTGG C A A GCC ACGTTTG GTGGTGGC

GACCATCCTCCAAAATCGGATCTGGTTCCGCGTGGATCCGGCGGTAGTTTAAACat ggcttcctcccctccaaagaaaaagagaaaggtiagttggaaggacgcaagtggttggtc tagagtggatctacgcacgctcggctac agtcagcagcagcaagagaagaicaaaccgaaggtgcgitcgacagtggcgcagcaccac gaggcactggtgggccai cacacgcgcacatcgttgcgctcagccaacacccggcagcgttagggaccgtcgctgtca cgtatcagcacataatcacggcgtt cagaggcgacacacgaagaca cgttggcgtcggcaaacagtggtccggcgcacgcgccctggaggccttgctcacggatgc gg gggagttgagaggiccgccgitacagitggacacaggccaacttgtgaagattgcaaaac gtggcggcgtgaccgcaatggaggca gtgcatgcaicgcgcaatgcacigacgggtgcccccctgaacCTCACC ^CGGAC ^AAGTGGTGGCTATCG

CCAGO\ACAATGGC^

TGCTGTGCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCA

ACGGTGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTi ' GCCGGTGCTGT

GCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCAACAATG

GCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGG

ACCATGGCCTGACCCCGGACCAAGTGGIXKiCTATCGCCAGCAACA ' n KjCGGCA

AGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGA.ee ATG

GCCTGACCCCOGACCAAGTGGTGGCTATCGCCAGCAACAATGGCGGCAAGCAAG

CGCTCGAAACGGT(X:ACK:GGCTGrF(X:^

CTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCG AAACGGTGCAGCGGCTCTTGCCGGTGCTGTGCCAGGACCATGGCCTGACCCCGG ACCAAGTGGTGGCTATCGCCAGCAACATTGGCGGCAAGCAAGCGCTCGAAACGG

TGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGT

GGI ' GGCIATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCG

GCTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCT

ATCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTG

CCGGTGCTG1XK CAGGACCATGGCCTGACTCCGGACCAAGTGG1XKJCTATC(K

AGCCACGA.TGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTG

CTGlOCCAGGACCArGGCCTGACCCCGGACCAAGl ' GGl ' GGCTATCGCCAGCAAC

AT!XKJCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTCTTGCCGGTGCTGTCK:

CAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCAACAATGGC

GGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGI GCCGGTGCTGTGCCAGGAC

CATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCGGCAAG

CAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGC

CTGACCCCGGACCAAGTGGTGGCTATCGCCAGCAACAATGGCGGCAAGCAAGCG

CTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGA.CCATGGCCTGACC

CCGGACCAAGTGGI ' GGCIATCGCCAGCAACAATGGCGGCAAGCAAGCGCT ' CGAA

ACGGTGCAGCGGCTGTrGCCGGTGCTG1XKX:AGGACCATGGCCTGACCCCGGAC

CAAGTGGTGGCTATCGCCAGCAACATTGGCGGCAAGCAAGCGCTCGAAACGGTG

CAGCGGC:rGTTGCCGGTGCTGTGCCAGGACCATGGCCl ' GACl ' CCGGACCAAGTG

GTGGCTATCGCCAGCCACGAlXKiCGGCAAGCAAGCGCTCGAAACGGTGCAGCGG

CTGTTGCCGGTGCTGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTA

TCGCCAGCCACGATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGC

CGGTGCTGTGCCAGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCA

GCAACGGTGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGC TGTGCCAGGACCATGGCCTGACTCCGGACCAAGTGGTGGCTATCGCCAGCCACG ATGGCGGCAAGCAAGCGCTCGAAACGGTGCAGCGGCTGI GCCGGTCCTGTCKX AGGACCATGGCCTGACCCCGGACCAAGTGGTGGCTATCGCCAGCCACGATGGCG GCAAGCAAGCGCTCGAAACXiGTGCAGCGGC GTrGCCGGTGCTGTGCCAGGACC ATCGCCTXIACCCCGGACCA^

AAGCGCTCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCC TGACTCCGGACCAAGTGGTGGC ATCGCCAGCCACGATGGCGGCAAGCAAGCGC TCGAAACGGTGCAGCGGCTGTTGCCGGTGCTGTGCCAGGACCATGGCctgaccccggac eaagtggtggctatcgceagcaaeggtggeggeaagcaagegctcgaaagcatt^

ggccgcgttgaccaacgaccacctcgicgccttggccigcctcggcggacgtcctgc catggaigcagigaaaaagggattgccgc acgcgccggaattgatcagaagagtcaatcgccgtattggcgaacgcacgtcccatcgcg ttgcctctagatcccagCCTGCAG

GTTCCCAACTAGTCAAAAGTGAACTGGAGGAGAAGAAATCTGAACTTCGTCATA

AATrGAAAl rGTGCCTCATGAATATATTGAAlTAArrGAAATrGCCAGAAATrC

CACTCAGGATAGAATTCTTGAAATGAAGGTAATGGAATTTTTTA.TGAAAGTTTAT

GGATATAGAGGTAAACATrTGGGTGGATCAAGGAAACCGGACGGAGCAAT fAT

ACTGTCGGATCTCCTATTGATTACGGTGTGATCGTGGATACTAAAGCrrATAGCG

GAGGTTATAATCTGCCAATTGGCCAAGCAGATGAAATGCAACGATATGTCGAAG

AAAATCAAACACG.AAACAAACATATCAACCCTAATGAATGGTGGAAAGTCTATC

CATCTTCTGTAACGGAATTTAAGTTTTTATTTGTGAGTGGTCACTTTAAAGGAAAC

TACAAAGCTCAGCTTACACGATTAAATCATATCACTAATTGTAATGGAGCTGTTC

TI GTGTAGAAGAGCl n ^ AATTGGTGGAGAAATGATTAAAGCCGGCACA ' ITAAC

CTTAGAGGAAGTGAGACGGAAATTTAATAACGGCGAGATAAACTTTggcgcgcctgg c ggaggtggaagigcaggigciggaiccggiagiggcicaggiggiggiggcggticagci ggcgctggaagiggiicaggtagigg aggaggaggcggctc gcaggagcaggctciggctccggatctggaggaggtggcggaagcgciggigcaggc ccggaagcg gaagiggagcgatcgc icGcagciagigaaaicigaaiiggaagagaagaaaiGigaacitagaca.taaa.tiga aa.tatgigccaca.t gaaiaiaiigaaiigaiigaaaicgcaagaaaiicaacicaggatagaaicciigaaaig aaggtgaiggagiiciiiaigaaggittaiggt ta†£gtggtaaacai†tggg†ggatcaaggaaaccagacggagcaaittaia^ ^

taaggcaiaticaggaggttaiaaicttGcaatiggicaagGagatgaaaigcaaag aiaigicgaagagaaicaaacaagaaacaagc ataicaaccciaaigaaiggiggaaagtciatccatciicagiaacagaaiiiaagtict tgiiigigagtggtcaiiicaaaggaaaciaca aagctcagcttacaagatigaaicalatca^

ciggiacaitgacacttgaggaagigagaaggaaaittaaiaacggigagaiaaact ttTAGttaattaagaaiicgtcgagggaccta alaacttcgtatagca acattatacgaagttatacatgttlaagggttccggttccaciaggtacaattcgatatc aagctta cgataatca acctctggattacaaaatttgtgaaagattg^

caigciaiigcttcccgiaiggciiicattiiciccicciigiaiaaaicciggiig cigiciciiiaigaggagiigiggcccgiigicaggcaa cgtggcgtggtgtgcacigtgtttgctgacgcaacccccaciggitggggcattgccacc acctgtcagctcctttccgggactitcgcit tccccciccc attgccacggcggaacicaicgccgGCigcciigcccgGtgciggacaggggctcggctg ttgggcactgacaattG egtggigttgteggggaaatcatcgtcctttecttggctgetc^

cttcggccctcaatccagcgga^

agtcggaiciCGCttigggccgcctccccgcaicgaiaccgtcgacctcgaicgaga cctagaaaaacaiggagcaaicacaagtagc aatacagcagctaccaatgctgattgtgcctggctagaagcacaagaggaggaggaggtg ggltttccagtcacacctcaggtaccttt aagaccaatgacttacaaggcagctgfcagatcttagcca^

agacaagataiccttgatctgiggatctaccacacacaaggciacticcctgattgg cagaactacacaccagggccagggatcagata tccac gacctttggatggtgc acaagctagtaccagttgagcaagagaagg agaagaagccaatgaaggagagaacacccgctt gttacaccctgtgagcctgcatgggatggatgacccggagagagaagtattagagtggag gtttgacagccgcctagcatticatcac atggcccgagagctgcatccggactgtacigggtciciciggttagaccagatcigagcc tgggagciciciggctaactagggaacc cac gcttaagcctcaataaagcttgccttgag gcttcaagtagtg gtgcccgtcigttgtg gactctggtaaclagag cccttttagicagtgiggaaaaictctagcagcatgtgagGaaaaggcGagcaaaaggcc aggaaccgiaaaaaggccgcgitgcig gcgttiticcaiaggciccgcccccctgacgagcatcacaaaaatcgacgctcaagtcag aggtggcgaaacccgacaggaciaiaa agaiacoaggegitiecccoiggaagc^

cgggaagcgtggegetrtctcatagctcacgctgtaggiata^ cceccgttcagcecgaccgetgegccttatccggm^

gccactggtaacaggattagcagagcgaggiatg aggcggtgctacagagttcttgaagtggtggcctaactacggctacaciagaa gaacagiait igg iaie igegci c tgetgaag^

gtagcggtggtttttttgtttgcaagcagcagatta^

gctcaglggaacgaaaacicacgiiaaggga tiiggtcaigaga iaicaaaaaggaicitcacciagatccittiaaattaaaaa tttiaaaicaaiciaaagiataiatgagiaaac iggicigacagttaccaaigcttaaiGagtgaggcacctatcicagcgaicigtctattt c gttcatccatagttgccigactccccgtcgtgtagataactacgaiacgggagggcitac catctggccccagtgctgcaatgataccgc gagaceeacgcicaceggciccagattiatca

ccgcctccatccagtc1¾.itaattgtigccgggaagctagagtaagta.gticgc cagitaaiagttt

ggcatcgiggigtcacgctcgtcgtttggiaiggcttcattcagctccggticccaa cgatcaaggcgagtiacatgatcccccatgitgt gcaaaaaagcggttagciccttcggtoctccgatcgttgicagaagtaagiiggccgcag tgte

ataattetcttactgtcatgceatecgtaagatg^^

cgagttgctcttgcccggcgtcaatacgggataataccgcgccacatagcagaactt taaaagtgctcatcattggaaaacgttcttcgg ggcgaaaactctcaaggateitaecgctgtigaga^

accagcgtttctgggtgagcaaaaacaggaaggcaaaatgccgcaaaaaagggaaia agggcgacacggaaatgttgaatacicat actcttectttttcaatattattgaagcatttatcagggttattgtctcatgagcggata catatttgaatglattlag

ggttccgcgcacatttccccgaaaagtgccacctgac

[0103] SEQ ID N0:4: (Linker)

CCTAGGGGGGGAGGGTCCGGCGGCGGTTCCGGCGGAGGATCGGGTGGAGGGTCA

GGTGGAGGCTCAGGCGGTGGATCAGGAGGAGGGAGCGGTGGCGGGAGCGGCGG

AGGGTCGGGAGGAGGTTCGGGCGGAGGCTCGGGCGGTGGGTCCGGAGGTGGCTC

α( ; αΛα( : ;ΛΛ< : ; Λ( : ;ο( π ] θ{ ;( ::ΛΊχ AC iGAGi iCAC -GAG

GAGGATCAGGTGGCGGAAGCGGAGGCGGCTCCGGAGGAGGCTCCGGCGGTGGA AGi X SGTGG AGGA AGi iC XX ·( X IGATi SGGAi SGTC iGGTC X}

[0104] SEQ ID N0:5 : (Protein sequence of linker)

F R G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G (5 S (5 G G S G G G S G G G S G G (5 S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S G G G S

[0105] SEQ ID NO:6: (Linker sequence) ggcggaggtggaagtgcaggtgctggatccggtagtggctcaggtggtggtggcggttca gctggcgctggaagtggttcaggtag tggaggaggaggcggctctgcaggagcaggctctggctccggatctggaggaggtggcgg aagcgctggtgcaggctccggaag cggaagtgga

[0106] SEQ ID NO : 7 : (linker protein sequence)

G G G G S A G A G S G S G S G G G G G S A G A G S G S G S G G G G G S A G A G S G S G S G G G G G S A G A G S G S G S G REFERENCES

[0107] 1 Sakaue-Sawano, A. et. al. Visualizing spatiotemporal dynamics of

multicellular cell-cycle progression. Cell 132, 487-498, doi: 10.1016/j.cell.2007.12.033 (2008).

[0108] 2 e, R. et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat Methods 10, 857-860, doi:10.1038/nmeth.2563 (2013 ).

3 Mino, T., Aoyama, Y. & Sera, T. Efficient double-stranded DNA cleavage by artificial zinc-finger nucleases composed of one zinc-finger protein and a single-chain Fokl dimer. Journal of biotechnology 140, 156-161 , doi: 10.1016/j.jbiotec.2009.02.004 (2009).

[Θ110] 4 Komori, T., Okada, A,, Stewart, V. & Alt, F. W. Lack of N regions in antigen receptor variable region genes of TdT-deficient lymphocytes. Science 261, 1171-1 175 (1993).

[0111] 5 Boubakour-Azzouz, I, Bertrand, P., Claes, A,, Lopez, B. S. & Rougeon, F. Terminal deoxynucleotidyl transferase requires KU8Q and XRCC4 to promote N-addition at non-V(D)J chromosomal breaks in non-lymphoid cells. Nucleic Acids Res 40, 8381 -8391, doi: 10.1093/nar/gks585 (2012).

[0112] 6 Eastburn, D. J., Sciambi, A. & Abate, A. R. Uitrahigh-througbput Mammalian single-cell reverse-transeriptase polymerase chain reaction in microfluidic drops. Anal Chem 85, 8016-8021 , doi: 10.1021/ac402057q (2013).

[0113] Vogt W. ... Vitalfiirbung. Π. Teil. Gastrulation und Mesodermbildung bei Urodelen und Anuren. W. Roux Arch Entwicklungsmech Org 120384-706. Keller RE (1986) ....

Developmental Biology; 1929.

[0114] Sulston JE, Sehierenberg E, White JG, Thomson JN. The embryonic cell lineage of the nematode Caenorhabditis eiegans. Developmental Biology. 1983 Nov;100(l):64-T 19.

[0115] Livet J, Weissman TA, ang H, Draft RW, Lu J. Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. N ature. 2007.

[0116] Snippert Hi, van der Flier LG, Sato T, van Es JH, van den Born M, Kroon-

V eenboer C, et al. Intestinal Crypt Homeostasis Results from Neutral Competition between

Symmetrically Dividing Lgr5 Stem Cells. Cell. 2010 Oct; 143(1 ): 134-44. [0117] Mino T, Aoyama Y, Sera T. Efficient double-stranded DNA cleavage by artificial zinc-finger nucleases composed of one zinc-finger protein and a single-chain Fokl dimer. Journal of Biotechnology. 2009 Mar; 140(3-4): 156-61.

[0118] Sakaue-Sawano A, Kurokawa H, Morimura T, Hanyu A, Kama H, Osawa H, et ai. Visualizing Spatiotemporal Dynamics of Multicellular Cell-Cycle Progression. Cell . 2008 Feb;132(3):487-98.

[0119] Ke R, Mignardi M, Pacureanu A, Svedlimd J, Botling J, Wahlby C, et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nature methods. 2013

Sep;10(9):857-60.

[0120] Batzer MA, Gudi VA, Mena JC, Foltz DW, Herrera RJ, Deininger PL.

Amplification dynamics of human-specific (HS) alu family members. Nucleic Acids Res. Oxford University Press; 1991 Jul 1 1;19(13):3619-23.

[0121] Ohtsuka E, Matsuki S, Ikehara M, Takahashi Y, Matsubara K. An alternative approach to deoxyoligonucleotides as hybridization probes by insertion of deoxyinosine at ambiguous codon positions. Journal of Biological Chemistry. American Society for Biochemistry and Molecular Biology; 1985 Mar 10;260(5):2605-8.

[0122] Rossolini GM, Cresti S, Ingianni A, Cattani P, Riccio L, Satta G. Use of deoxyinosine-containing primers vs degenerate primers for polymerase chain reaction based on ambiguous sequence information. Molecular and Cellular Probes. 1994 Apr;8(2):91-8.

[0123] Maratea D, Young , Young R. Deletion and fusion analysis of the phage φΧΙ 74 lysis gene E. Gene. 1985 Jan;40(l):39-46.

[0124] Murphy JR, Bishai W, Borowski M, Miyanohara A, Boyd J, Nagle S. Genetic construction, expression, and melanoma-selective cytotoxicity of a diphtheria toxin-related alpha-melanocyte-stimulating hormone fusion protein. Proc Natl Acad Sci USA. National Acad Sciences; 1986 Nov;83(21):8258-62.

[0125] Kwoh DY, Davis GR, Whitfield M Chappelle HL, DiMichele LJ, Gingeras TR. Transcription-based amplification system and detection of amplified human

immunodeficiency virus type 1 with a bead-based sandwich hybridization format. Proc Natl Acad Sci USA. National Acad Sciences; 1989 Feb;86(4): l 173-7. [0126] Guatelli JC, Whitfield KM, Kwoh DY, Bamnger KJ, Richman DD, Gingeras TR. Isothermal, in vitro amplification of nucleic acids by a multienzyme reaction modeled after retroviral replication. Proc Natl Acad Sci USA. National Acad Sciences; 1990

Mar;87(5): 1874-8.

[0127] Lomeli H, Tyagi S, Pritchard CG, Lizardi PM, Kramer FR. Quantitative assays based on the use of replicatable hybridization probes. Clinical Chemistry. American

Association for Clinical Chemistry; 1989 Sep;35(9): 1826-31.

[0128] Landegren U, Kaiser R, Sanders J, Hood L. A ligase-mediated gene detection technique. Science. American Association for the Advancement of Science; 1988 Aug 26;241(4869): 1077-80.

[0129] Wu DY, Wallace RB. The ligation amplification reaction (LAR)— Amplification of specific DNA sequences using sequential rounds of template-dependent ligation. Genomics. 1989 May;4(4):560-9.

[0130] Bamnger KJ, Orgel L, Wahl G, Gingeras TR. Blunt-end and single-strand ligations by Escherichia coli ligase: influence on an in vitro amplication scheme. Gene. 1990

Apr;89(l): i 17-22.

[0131] Jimenez JI, Xulvi-Brunet R, Campbell GW, Turk-MacLeod R, Chen lA.

Comprehensive experimental fitness landscape and evolutionary network for small R A. Proc Natl Acad Sci USA. National Acad Sciences; 2013 Sep 10; 110(37): 14984-9.

[0132] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al.

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. American Society for Microbiology; 2009 Dec;75(23):7537-41.

[0133] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformati.es. 2006.

[0134] In the claims appended hereto, the term "a" or "an" is intended to mean "one or more." The term "comprise" and variations thereof such as "comprises" and "comprising," when preceding the recitation of a step or an element, are intended to mean that the addition of further steps or elements is optional and not excluded. All patents, patent applications, and other published reference materials cited in this specification are hereby incorporated herein by reference in their entirety. Any discrepancy between any reference material cited herein or any prior art in general and an explicit teaching of this specification is intended to be resolved in favor of the teaching in this specification. This includes any discrepancy between an art- understood definition of a word or phrase and a definition explicitly provided in this specificati on of the same word or phrase.