Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRANSCRIPT OPTIMIZED EXPRESSION ENHANCEMENT FOR HIGH-LEVEL PRODUCTION OF PROTEINS AND PROTEIN DOMAINS
Document Type and Number:
WIPO Patent Application WO/2013/071295
Kind Code:
A2
Abstract:
The present invention relates to a system for high-level production of recombinant proteins and protein domains.

Inventors:
ACTON THOMAS B (US)
ANDERSON STEPHEN (US)
HUANG YUANPENG JANET (US)
MONTELIONE GAETANO (US)
Application Number:
PCT/US2012/064836
Publication Date:
May 16, 2013
Filing Date:
November 13, 2012
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV RUTGERS (US)
ACTON THOMAS B (US)
ANDERSON STEPHEN (US)
HUANG YUANPENG JANET (US)
MONTELIONE GAETANO (US)
International Classes:
C12N15/09
Domestic Patent References:
WO2006097945A22006-09-21
WO1989000604A11989-01-26
Other References:
MYERS; MILLER, CABIOS, vol. 4, 1988, pages 11
SMITH ET AL., ADV. APPL. MATH., vol. 2, 1981, pages 482
NEEDLEMAN; WUNSCH, JMB, vol. 48, 1970, pages 443
PEARSON; LIPMAN, PROC. NATL. ACAD. SCI. USA, vol. 85, 1988, pages 2444
KARLIN; ALTSCHUL, PROC. NATL. ACAD. SCI. USA, vol. 87, 1990, pages 2264
KARLIN; ALTSCHUL, PROC. NATL. ACAD. SCI. USA, vol. 90, 1993, pages 5873
HIGGINS ET AL., CABIOS, vol. 5, 1989, pages 151
CORPET ET AL., NUCL. ACIDS RES., vol. 16, 1988, pages 10881
HUANG ET AL., CABIOS, vol. 8, 1992, pages 155
PEARSON ET AL., METH. MOL. BIOL., vol. 24, 1994, pages 307
ALTSCHUL ET AL., JMB, vol. 215, 1990, pages 403
GOEDDEL: "Gene Expression Technology: Methods in Enzymology", 1990, ACADEMIC PRESS, pages: 185
MILSTEIN ET AL., METHODS ENZYMOL., vol. 73, 1981, pages 3 - 46
HUSTON ET AL., PROC. NATL. ACAD. SCI. USA, vol. 85, 1988, pages 5879 - 5883
CO ET AL., J. IMMUNOL., vol. 152, 1994, pages 2968 - 2976
BETTER ET AL., METHODS ENZYMOL., vol. 178, 1989, pages 476 - 496
PLUCKTHUN ET AL., METHODS ENZYMOL., vol. 178, 1989, pages 497 - 515
LAMOYI, METHODS ENZYMOL., vol. 121, 1986, pages 652 - 663
ROUSSEAUX ET AL., METHODS ENZYMOL., vol. 121, 1986, pages 663 - 669
BIRD ET AL., TRENDS BIOTECHNOL., vol. 9, 1991, pages 132 - 137
"Antibodies: a laboratory manual.", 1988, COLD SPRING HARBOR LABORATORY
"Strategies for Protein Purification and Characterization: A laboratory Course Manual.", 1996, COLD SPRING HARBOR LABORATORY PRESS
"Methods in Enzymology", vol. 493, ACADEMIC PRESS, pages: 21 - 60
EVERETT, J.K.; ACTON, T.B.; MONTELIONE, G.T.J., STRUCT. FUNCT. GENOMICS, vol. 5, 2004, pages 13 - 21
ACTON, T. B. ET AL.: "Preparation of protein samples for NMR structure, function, and small-molecule screening studies", METHODS ENZYMOL., vol. 493, 2011, pages 21 - 60, XP009169372, DOI: doi:10.1016/B978-0-12-381274-2.00002-9
AGATON ET AL., MOLECULAR & CELLULAR PROTEOMICS, vol. 2, 2003, pages 405 - 414
BINDEWALD, E. ET AL.: "CyloFold: secondary structure prediction including pseudoknots", NUCLEIC ACIDS RES., vol. 38, pages W368 - 72
BRODSKII, L. I. ET AL.: "GeneBee-NET: An Internet based server for biopolymer structure analysis", BIOKHIMIIA, vol. 60, 1995, pages 1221 - 30
CROWE, J. ET AL.: "6xHis-Ni-NTA chromatography as a superior technique in recombinant protein expression/purification", METHODS MOL BIOL., vol. 31, 1994, pages 371 - 87
DING, Y. ET AL.: "Sfold web server for statistical folding and rational design of nucleic acids", NUCLEIC ACIDS RES., vol. 32, 2004, pages W 135 - 41
DO, C. B. ET AL.: "CONTRAfold: RNA secondary structure prediction without physics-based models", BIOINFORMATICS, vol. 22, 2006, pages E90 - 8
GONZALEZ DE VALDIVIA, E. I.; ISAKSSON, L. A.: "A codon window in mRNA downstream of the initiation codon where NGG codons give strongly reduced gene expression in Escherichia coli", NUCLEIC ACIDS RES., vol. 32, 2004, pages 5198 - 205
GRUBER, A. R. ET AL.: "The Vienna RNA websuite", NUCLEIC ACIDS RES., vol. 36, 2008, pages W70 - 4
HAMADA, M. ET AL.: "Predictions of RNA secondary structure by combining homologous sequence information", BIOINFORMATICS, vol. 25, 2009, pages I330 - 8
JANSSON, M. ET AL.: "High-level production of uniformly 15N- and 13C-enriched fusion proteins in Escherichia coli.", B. J. BIOMOL. NMR., vol. 7, 1996, pages 131 - 141
KAPUST, R. B. ET AL.: "The PI' specificity of tobacco etch virus protease", BIOCHEM BIOPHYS RES COMMUN., vol. 294, 2002, pages 949 - 55, XP002275331, DOI: doi:10.1016/S0006-291X(02)00574-0
KUDLA, G. ET AL.: "Coding-sequence determinants of gene expression in Escherichia coli", SCIENCE, vol. 324, 2009, pages 255 - 8, XP055059425, DOI: doi:10.1126/science.1170160
LAMIA, T.; ERDMANN, V. A.: "The Nano-tag, a streptavidin-binding peptide for the purification and detection of recombinant proteins", PROTEIN EXPR PURIF., vol. 33, 2004, pages 39 - 47, XP004475011, DOI: doi:10.1016/j.pep.2003.08.014
LUI ET AL.: "Loopy proteins appear conserved in evolution", J MOL BIOL., vol. 322, 2002, pages 53 - 64, XP004449827, DOI: doi:10.1016/S0022-2836(02)00736-2
MARKHAM, N. R.; ZUKER, M.: "UNAFold: software for nucleic acid folding and hybridization", METHODS MOL BIOL., vol. 453, 2008, pages 3 - 31
MATHEWS, D. H. ET AL.: "Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure", PROC NATL ACAD SCI U S A., vol. 101, 2004, pages 7287 - 92
NETZER; HARTL: "Recombination of protein domains facilitated by co- translational folding in eukaryotes", NATURE, vol. 358, 1997, pages 343 - 9
NOMURA, M. ET AL.: "Influence of messenger RNA secondary structure on translation efficiency", NUCLEIC ACIDS SYMP SER., 1984, pages 173 - 6
QUAN, J. ET AL.: "Parallel on-chip gene synthesis and application to optimization of protein expression", NAT BIOTECHNOL., vol. 29, 2011, pages 449 - 52, XP055162152, DOI: doi:10.1038/nbt.1847
REEDER, J. ET AL.: "pknotsRG: RNA pseudoknot folding including near- optimal structures and sliding windows", NUCLEIC ACIDS RES., vol. 35, 2007, pages W320 - 4
RIVAS, E.; EDDY, S. R.: "A dynamic programming algorithm for RNA structure prediction including pseudoknots", J MOL BIOL., vol. 285, 1999, pages 2053 - 68, XP004464314, DOI: doi:10.1006/jmbi.1998.2436
ROCHA, E. P. ET AL.: "Translation in Bacillus subtilis: roles and trends of initiation and termination, insights from a genome analysis", NUCLEIC ACIDS RES., vol. 27, 1999, pages 3567 - 76, XP055053042, DOI: doi:10.1093/nar/27.17.3567
SHARP, P. M.; LI, W. H.: "The codon Adaptation Index-a measure of directional synonymous codon usage bias, and its potential applications", NUCLEIC ACIDS RES., vol. 15, 1987, pages 1281 - 95, XP001122356
SCHOLLE, M. D. ET AL.: "In vivo biotinylated proteins as targets for phage- display selection experiments", PROTEIN EXPR PURIF., vol. 37, 2004, pages 243 - 52, XP004523962, DOI: doi:10.1016/j.pep.2004.05.012
SCHROEDER, S. J. ET AL.: "Ensemble of secondary structures for encapsidated satellite tobacco mosaic virus RNA consistent with chemical probing and crystallography constraints", BIOPHYS J., vol. 101, 2011, pages 167 - 75, XP028236055, DOI: doi:10.1016/j.bpj.2011.05.053
VOSS, B. ET AL.: "Complete probabilistic analysis of RNA shapes", BMC BIOL., vol. 4, 2006, pages 5, XP021019039, DOI: doi:10.1186/1741-7007-4-5
XAYAPHOUMMINE, A. ET AL.: "Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots", NUCLEIC ACIDS RES., vol. 33, 2005, pages W605 - 10
XAYAPHOUMMINE, A. ET AL.: "Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulations", PROC NATL ACAD SCI U S A., vol. 100, 2003, pages 15310 - 5
ZUKER, M.; STIEGLER, P.: "Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information", NUCLEIC ACIDS RES., vol. 9, 1981, pages 133 - 48, XP003010680, DOI: doi:10.1093/nar/9.1.133
Attorney, Agent or Firm:
KRUEGER, Katherine A.D. et al. (Suite 670Bloomington, Minnesota, US)
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1. A method of preparing an expression vector, wherein the expression vector comprises, in order of position: a first nucleic acid sequence encoding a 5' untranslated region of an expressed mRNA that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag; and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the method comprises specifically modifying the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag to minimize RNA secondary structure both within and/or between these two regions of the mRNA.

2. The method of claim 1 , further comprising specifically modifying the second nucleic acid sequence to reduce the presence of rare codons.

3. The method of any one of claims 1-2, wherein nucleotides within about the last 100 nucleotides of the first nucleic acid sequence are modified.

4. The method of any one of claims 1-3, wherein nucleotides within about the first 90 nucleotides of the second nucleic acid sequence are modified.

5. The method of any one of claims 1 -4, wherein the expression vector further comprises a target protein coding sequence inserted into the vector in- frame with the nucleic acid tag sequence to encode a fusion protein comprising the polypeptide tag and the target protein.

6. The method of claim 5, wherein the target protein coding sequence is not modified to minimize RNA secondary structure.

7. The method of any one of claims 5-6, wherein the target protein coding sequence is not modified to reduce the presence of rare codons.

8. The method of any one of claims 1-7, wherein the second nucleic acid sequence encodes at least one affinity purification tag.

9. The method of any one of claims 1-8, wherein the second nucleic acid sequence encodes more than one affinity purification tag.

10. The method of claim 9, wherein the second nucleic acid sequence encodes two affinity purification tags.

11. The method of any one of claims 8- 10, wherein the encoded affinity purification tag(s) is/are selected from a Streptavidin binding moiety, a maltose binding protein moiety, and a HIS tag.

12. The method of claim 11 , wherein the Streptavidin binding moiety is a Nano-tag or a biotinylated Avi-tag.

13. The method of any one of claims 1-12, wherein the second nucleic acid sequence encodes at least one solubility enhancement tag.

14. The method of any one of claims 1-13, wherein the second nucleic acid sequence encodes more than one solubility enhancement tag.

15. The method of claim 14, wherein the second nucleic acid sequence encodes two solubility enhancement tags.

16. The method of any one of claims 13-15, wherein the encoded solubility enhancement tag(s) is/are selected from a maltose binding protein tag, a protein G Bl domain tag, and a myxococcus protein S tag.

17. The method of any one of claim 1-16, wherein the second nucleic acid sequence further encodes at least one protease recognition site.

18. The method of claim 17, wherein the protease recognition site is a Tobacco Etch Virus (TEV), Thrombin, Factor Xa or a HRV 3C protease recognition site.

19. The method of any one of claims 5-18, wherein the target protein coding sequence encodes a transcription factor, a transcription factor domain, an epigenetic regulatory factor, or an epigenetic regulatory factor domain.

20. The method of any one of claims 5-19, wherein the target protein coding sequence encodes a polypeptide sequence as described in Table 2.

21. The method of any one of claims 5-20, wherein the target protein coding sequence encodes a protein antigen for producing an affinity capture reagent.

22. The method of claim 21 , wherein the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

23. The method of any one of claims 5-20, wherein the target protein coding sequence encodes a protein antigen for producing an antibody or Fab by phage display.

24. The method of any one of claims 5-23, wherein the expression of the target protein is 1.5 fold greater than the expression of a target protein generated from an expression vector that was not modified as described in claim 1 or 2.

25. An expression vector prepared using the method of any one of claims 1 - 24.

26. An expression vector comprising, in order of position: a first nucleic acid sequence encoding a 5' untranslated region of an expressed mRNA that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag; and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag has been specifically modified to minimize RNA secondary structure both within and/or between these two regions of the mRNA.

27. The expression vector of claim 26, wherein the second nucleic acid sequence has been specifically modified to reduce the presence of rare codons.

28. The expression vector of any one of claims 26-27, wherein nucleotides within about the last 100 nucleotides of the first nucleic acid sequence have been modified.

29. The expression vector of any one of claims 26-28, wherein nucleotides within about the first 90 nucleotides of the second nucleic acid sequence have been modified.

30. The expression vector of any one of claims 26-29, further comprising a target protein coding sequence inserted into the vector in-frame with the nucleic acid tag sequence to encode a fusion protein comprising the polypeptide tag and the target protein.

31. The expression vector of claim 30, wherein the target protein coding sequence has not been modified to minimize RNA secondary structure.

32. The expression vector of any one of claims 30-31, wherein the target protein coding sequence has not been modified to eliminate rare codons.

33. The expression vector of any one of claims 26-32, wherein the second nucleic acid sequence encodes at least one affinity purification tag.

34. The expression vector of any one of claims 26-33, wherein the second nucleic acid sequence encodes more than one affinity purification tag.

35. The expression vector of claim 34, wherein the second nucleic acid sequence encodes two affinity purification tags.

36. The expression vector of any one of claims 33-35, wherein the encoded affinity purification tag(s) is/are selected from a Streptavidin binding moiety, a maltose-binding protein moiety and a HIS tag.

37. The expression vector of claim 36, wherein the Streptavidin binding moiety is a Nano-tag or a biotinylated Avi-tag.

38. The expression vector of any one of claims 26-37, wherein the second nucleic acid sequence encodes at least one solubility enhancement tag.

39. The expression vector of any one of claims 26-38, wherein the second nucleic acid sequence encodes more than one solubility enhancement tag.

40. The expression vector of claim 39, wherein the second nucleic acid sequence encodes two solubility enhancement tags.

41. The expression vector of any one of claims 38-40, wherein the encoded solubility enhancement tag(s) is/are selected from a maltose binding protein tag, a protein G Bl domain tag, and a myxococcus protein S tag.

42. The expression vector of any one of claims 26-41 , wherein the second nucleic acid sequence further encodes at least one protease recognition site.

43. The expression vector of claim 42, wherein the protease recognition site is a Tobacco Etch Virus (TEV), Thrombin, Factor Xa or a HRV 3C protease recognition site.

44. The expression vector of any one of claims 30-43, wherein the target protein coding sequence encodes a transcription factor, a transcription factor domain, an epigenetic regulatory factor, or an epigenetic regulatory factor domain.

45. The expression vector of any one of claims 30-44, wherein the target protein coding sequence encodes a polypeptide sequence as described in Table 2.

46. The expression vector of any one of claims 30-45, wherein the target protein coding sequence encodes a protein antigen for producing an affinity capture reagent.

47. The expression vector of claim 46, wherein the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

48. The expression vector of any one of claims 30-45, wherein the target protein coding sequence encodes a protein antigen for producing an antibody or Fab by phage display.

49. The expression vector of any one of claims 30-48, wherein the target protein is expressed at a 1.5-fold higher level than a target protein generated from an expression vector that was not modified as described in claim 26 or 27.

50. A host cell comprising the expression vector of any of claims 30-49.

51. A method for expressing a target protein in a host cell, comprising culturing the host cell of claim 50 for a period of time under conditions permitting expression of the target protein.

52. The method of claim 51 , wherein the target protein is a protein antigen for producing an affinity capture reagent.

53. The method of claim 52, wherein the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

54. The method claim 51 , wherein the target protein is a protein antigen for producing an antibody or Fab by phage display.

Description:
TRANSCRIPT OPTIMIZED EXPRESSION ENHANCEMENT FOR HIGH-LEVEL PRODUCTION OF PROTEINS AND

PROTEIN DOMAINS CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims priority from United States Provisional

Application Number 61/558277, filed November 10, 2011, which application is herein incorporated by reference. STATEMENT OF GOVERNMENT SUPPORT

This invention was made with government support under Grant # U54- GM074958 awarded by the National Institute of General Medical Sciences Protein Structure Initiative and Grant # UOl-DCOl 1485 awarded by the National Institute on Deafness and other Communication Disorders under the auspices of the NIH Common Fund. The government has certain rights in the invention.

BACKGROUND

The production of recombinant proteins and protein domains as reagents is extremely valuable to biomedical researchers and the entire biotechnology industry. Escherichia coli expression systems are the most cost effective and widely utilized expression systems for this task. However, production of certain proteins can be challenging in this bacterial system. Often proteins or protein domains fail to express at sufficient levels to allow for the purification of the protein reagents. This is especially true of the protein coding sequences derived from higher eukaryotes (such as humans). For example, using a standard pET E. coli expression system (Acton et al., 2011), nearly one-third of human protein targets produced in a large scale screen of protein expression had no detectable expression levels.

Thus, there is a need for agents and methods for high-level production of recombinant proteins and protein domains that do not require RNA optimization for each individual target gene. SUMMARY OF CERTAIN EMBODIMENTS OF THE INVENTION

This invention relates to a system for high-level production of recombinant proteins and protein domains that does not require RNA

optimization for each individual target gene.

Certain embodiments of the invention provide a method of preparing an expression vector, wherein the expression vector comprises, in order of position: a first nucleic acid sequence encoding a 5' untranslated region of an expressed mRNA that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag; and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the method comprises specifically modifying the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag to minimize RNA secondary structure both within and/or between these two regions of the mRNA.

Certain embodiments of the invention provide an expression vector designed using the methods described herein.

Certain embodiments of the invention provide an expression vector comprising, in order of position: a first nucleic acid sequence encoding a 5' untranslated region of an expressed mR A that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag; and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag has been specifically modified to minimize RNA secondary structure both within and/or between these two regions of the mRNA.

Certain embodiments of the invention provide a host cell comprising an expression vector as described herein.

The details of one or more embodiments of the invention are set forth in the description below. Other features, objects, and advantages of the invention will be apparent from the description and from the claims. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a set of diagrams showing sequences of Avi-tag and Nano-tag based Transcript-Optimized Expression Enhancement Technology (TOEET) expression vectors. The pNESG_Avi6HT Avi-tag sequence (top) (DNA, RNA and protein sequence), the His-tag sequences and the TEV Protease Recognition Site sequences are shown as indicated. Similarly, for pNESG_Nano6HT

(bottom) the Nano-tag sequences, the His-tag sequences and TEV Protease Recognition Site sequences are shown as indicated. The T7 RNA transcript produced by each vector is shown under each vector with untranslated sequences indicated with brackets. The Multiple Cloning Site (MCS) is also shown after the tag sequences, including the positions and identity of restriction sites available for cloning.

FIG. 2 is a diagram showing the predicted mRNA secondary structure resulting from T7-RNA Polymerase based transcription off of the

pNESG_Avi6HT T7 promoter. Numbering of the transcript from nucleotides 1 - 156 is indicated; negative numbers (in italics) show the estimated strength, in kcal/mole, of the predicted base-paired regions. The arrow indicates a predicted open structure (lack of base pairing) at the RBS/translation initiation region. RNA secondary structure predictions were done using GeneBee-NET

(http://www.genebee.msu.su/services/rna2_reduced.html).

FIG. 3 is a set of photographs showing representative SDS-PAGE analysis of expression and solubility for two human protein domains cloned into each of the three vectors pET15_NESG, pNESG_Nano6HT and

pNESG_Avi6HT. Left Panel shows the expression and solubility of HR7724C (HUGO ID: ZNF281 ) residues 291-374. Right Panel shows the expression and solubility of HR8241 (HUGO ID: NR4A21) residues 261-342. Total cell lysate (Tot) and the soluble portion (Sol) of the cell lysate are run in adjacent lanes for each of the two protein domains and the three expression vectors. An asterisk (*) indicates an overexpressed band of the correct size. Note the lack of protein expression in the case of pETl 5 NESG constructs.

FIG. 4. Wild-Type and TOEET-Optimized Pyrococcus furiosus (PfR)

Maltose Binding Protein (MBP). The sequences at the top corresponds to the first 30 residues of the wild-type PfR-MBP DNA sequence lacking the native secretion signal. The protein open reading frame (DNA sequence) is shown above the corresponding protein sequence. Directly below is the T7 RNA polymerase mediated RNA transcript resulting from the cloning of the PfR-MBP into the pET15_NESG backbone. The Ribosome Binding Site (RBS) is underlined and highlighted in bold, the translation initiation codon is shown in bold-italics. The lower set of sequences correspond to TOEET-optimized PfR- MBP. Bold nucleotides with arrows indicate positions where silent mutations were introduced for codon optimization, predicted decrease in RNA secondary structure in the regions of the RBS and translation initiation codon, or both. The RNA transcript for the TOEET optimized sequence is also shown following the parameters outlined above. The silent mutations were introduced using primers incorporating the nucleotide changes and 5 successive rounds of PCR, negating the need for expensive total gene synthesis.

FIG. 5. The predicted mRNA secondary structure resulting from T7- RNA Polymerase based transcription off of the pET15_NESG vector backbone with Pyrococcus furiosus (PfR) Maltose Binding Protein (MBP) without TOEET optimization. The arrows indicate significant secondary structure (base pairing) at both the Ribosome Binding Site (RBS) and the translation initiation site (Initiation Codon). RNA secondary structure predictions were performed using GeneBee-NET (http://www.genebee.msu.su/services/rna2_reduced.html).

FIG. 6. The predicted mRNA secondary structure resulting from T7-

RNA Polymerase based transcription off of the pET15_NESG vector backbone with Pyrococcus furiosus (PfR) Maltose Binding Protein (MBP) after TOEET optimization. The arrows indicates the Ribosome Binding Site (RBS) and the translation initiation site (Initiation Codon) and the prediction of significantly greater open structure (lack of base pairing) after TOEET optimization. RNA secondary structure predictions were done using GeneBee-NET

(¾ttp://www.genebee.msu.su/services/rna2_reduced.html).

FIG. 7. Histogram plots comparing Expression scores (E ranging from 0 to 5) using the TOEET technology (E TOEET) compared to expression scores for the same target protein using a pET vector lacking TOEET technology

(E_pET). The data shown in Figure 7a is for 98 protein target genes cloned into the pNESG_Avi6HT TOEET vector compared with the exact same genes cloned into the pET15_NESG vector (lacking TOEET). The data shown in Figure 7b is for 94 protein target genes cloned into the pNESG_Nano6HT TOEET vector compared with the exact same genes cloned into pET15_NESG vector (lacking TOEET). In these histogram plots, a value E TOEET - E_pET = 0 indicates that the expression levels for both vectors were identical; values E_TOEET - E_pET > 0 indicate that the TOEET technology provided higher level expression, values E TOEET - E_pET < 0 indicate that the TOEET technology provided lower level expression.

DETAILED DESCRIPTION

mRNA stem-loop structures often inhibit translation initiation and therefore reduce recombinant protein expression (Nomura et al., 1984). High level expression of proteins is affected by a lack of mRNA secondary structure near the translation start site (Kudla et al., 2009; Rocha et al., 1999). In addition, rare codons present within the first ten residues of a protein have deleterious effects on protein expression levels (Gonzalez de Valdivia and Isaksson, 2004). E. coli, like all organisms, prefers to use a subset of the possible codons. The codons that an organism utilizes only infrequently are termed "rare codons" of that organism.

Heterologous genes from other organisms, which generally have a different codon bias, often contain E. coli rare codons. Decreasing or minimizing mRNA secondary structure near the Ribosome Binding Site (RBS) and translation initiation site, and separately that a lack of rare codons near the start of translation, are important for high level E. coli protein expression (Gonzalez de Valdivia and Isaksson, 2004; Kudla et al., 2009). However, the DNA coding sequence of a target gene destined for heterologous expression in E. coli has evolved under different conditions and may intrinsically contain deleterious rare codons and mRNA secondary structure when cloned into an expression vector. Deleterious rare codons and mRNA secondary structure features are particularly problematic when expressing domains or specific segments of target proteins; e.g., gene segments coding for fragments other than the native N-terminal region of the protein have not evolved to provide for efficient translation initiation. Total gene synthesis, or the chemical synthesis of a protein coding region, may address these problems to some extent, since the DNA sequence can be optimized to reduce these issues (Quan et al., 2011). However, the costs of total gene synthesis are prohibitive for large sets of protein targets, and generally is not suitable for large-scale screening or projects involving expression of many different proteins.

This invention is based, at least in part, on an unexpected discovery of a new methodology for achieving high-level production of recombinant proteins and protein domains. RNA sequence optimization is a well-known approach for improving protein expression. A feature of the system described herein is that RNA sequence optimization is required only in DNA comprising the vector backbone, including the DNA coding for the 5'-UTR and a common N-terminal polypeptide tag. Each target gene, coding for various target proteins, that is cloned into this vector backbone, need not be optimized individually. Hence, the optimized vector backbone can be used to enhance expression of many different target proteins without the need for target-protein-specific gene sequence optimization. Unlike certain previous methods, gene-by-gene RNA transcript sequence optimization is not required in certain embodiments of the methods described herein. The methodology includes, among others, jointly designing and optimizing sequences encoding 5' untranslated and 5' translated regions of the mRNA transcript produced by an expression vector so as to minimize RNA secondary structure and/or optimize codon usage in the mRNA transcript.

In one aspect, this invention addresses, among others, the problems associated with mRNA secondary structure and codon bias. Accordingly, the invention provides systems for high-level production of recombinant proteins and protein domains based on the Transcript-Optimized Expression

Enhancement Technology (TOEET). As disclosed herein, TOEET is used to design expression vectors that produce mRNA transcripts with minimal RNA secondary structure and optimum codon usage in the nucleotide region around the Ribosomal Binding Site (RBS) and the translation initiation site, as well as minimal RNA secondary structure and optimal codon usage in a region of the transcript coding for an N-terminal polypeptide tag that is encoded directly downstream of the translation initiation site. Optimization can extend up to approximately 100 or more nucleotides on each of the 5' and 3' sides of the RBS. This generally will involve producing a protein with an N-terminal polypeptide tag, which is called an Expression Enhancement Tag (EET). This EET may be designed with other features that support protein production, such as solubility enhancing properties or affinity purification sequence motifs. Solubility enhancing tags known from the literature include the maltose-binding protein, the Bl domain of protein G, and domain of myxococcus protein S, to name a few representative examples. Expression vectors designed with TOEET allow most genes of interest to be produced with enhanced expression.

An advantage of the TOEET strategy over target gene optimization by total gene synthesis is that unless the 5' end of the synthetic gene is optimized in the context of the untranslated vector sequences, detrimental mRNA secondary structure may form near or around the RBS / translation initiation site. More specifically, even if the 5' translated region of the target gene is optimized by gene synthesis or by specific mutations, enhanced expression may not be realized unless the 5'- translated and 5'- untranslated regions of the transcript are jointly optimized, as described herein. Furthermore, by using a sufficiently long N-tenninal EET tag, translated from an optimized RNA sequence that is encoded by the vector itself, there is no need to optimize the sequence of the target gene, avoiding the need for gene-specific synthesis or modification. This feature allows the TOEET technology to be used for target protein expression enhancement in high throughput applications, including expression screening studies and projects involving expression of many different proteins, where gene-specific synthesis or modification would be costly or impractical. The roughly 30 amino-acid residue (or larger) EETs effectively shift any deleterious RNA features of the target gene transcript significantly downstream of the RBS / translation initiation site, so that any potential RNA secondary structure formation with the 5' end of the transcript is avoided, and any RNA secondary structure within the RNA coded for by the target gene itself will likely have little or no effect on expression. This TOEET strategy, which is independent of the target gene sequence, could be used more generally to enhance the expression levels of proteins produced with almost any expression vector or system.

Accordingly, certain embodiments of the invention provide a method of preparing an expression vector, wherein the expression vector comprises, in order of position: a first nucleic acid sequence encoding a 5' untranslated region (UTR) of an expressed mRNA that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag (i.e., at the N-terminal end of the expressed target protein); and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the method comprises specifically modifying the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag to minimize RNA secondary structure both within and/or between these two regions of the mRN A.

As used herein, a vector refers to a nucleic acid molecule capable of transporting another nucleic acid to which it has been linked. The vector can be capable of autonomous replication or integrate into a host DNA. Examples of the vector include a plasmid, cosmid, or viral vector. The vector of this invention includes a nucleic acid in a form suitable for expression of the nucleic acid in a host cell. Preferably the vector includes one or more regulatory sequences operatively linked to the nucleic acid sequence to be expressed. A "regulatory sequence" includes promoters, enhancers, repressor binding sites, and other expression control elements (e.g., polyadenylation signals). Regulatory sequences include those that direct constitutive expression of a nucleotide sequence, as well as tissue-specific regulatory and/or inducible sequences. For example, in certain embodiments of the invention, an expression vector described herein comprises a 5' upstream sequence encoding an operable promoter and associated regulatory sequences. The design of the expression vector can depend on such factors as the choice of the host cell to be

transformed, the level of expression of protein desired, and the like.

As used herein, the 5'UTR of the encoded messenger RNA is transcribed from a promoter and includes a ribosome binding site several nucleotides preceding the start codon.

As used herein, a "cloning site" enables a sequence, such as, e.g., a target protein coding sequence, to be inserted into an expression vector. For example, the cloning site may be a multiple cloning site (MCS), also known as a polylinker, which is a short nucleic acid sequence that contains many restriction sites. For example, Figure 1 shows a multiple cloning site, comprising a series of restriction enzyme recognition sites. In certain embodiments, the sequence is inserted in-frame, enabling expression of the inserted sequence. In certain embodiments, after the sequence, such as, e.g., the target protein coding sequence, has been inserted into the cloning site of the vector, a portion of the cloning site remains as flanking sequence on one or both sides of the inserted sequence. In other embodiments, the cloning site no longer remains after the insertion of the sequence into the cloning site of the vector.

As described herein, the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag may be specifically modified to minimize RNA secondary structure both within and/or between these two regions of the mR A. In certain embodiments, one feature of the method described herein is that RNA optimization is required only in DNA comprising the vector backbone, including the DNA coding for the 5'-UTR and a common N-terminal polypeptide tag, and each gene coding for various target proteins, that is cloned into this vector backbone, need not be optimized individually. Accordingly, nucleic acids within the specific sequence encoding the 5' untranslated region and the adjacent polypeptide tag are replaced with different nucleic acids to minimize RNA secondary structure of the expressed mRNA as described herein. In particular, in certain embodiments, the RNA secondary structure is minimized in the region surrounding the RBS and/or translation initiation site of the expressed mRNA. For example, nucleic acids are replaced to reduce base pairing with the RBS and/or translation initiation site of the expressed mRNA. In certain embodiments, the nucleic acid sequence directly surrounding the RBS site and/or the translation initiation site (e.g., the consensus sequences and sequences between these two sites) is minimally modified or not modified. For example, after modification the RBS site and the translation initiation site remain functionally active. In certain embodiments, nucleotides within the nucleic acid sequence encoding the polypeptide tag are modified in a manner that results in silent mutations.

Prediction of RNA secondary structure can be readily determined by one skilled in the art using techniques and tools known in the art. For example, a skilled artisan may use RNA structure prediction software, including

CentroidFold (Hamada et al., 2009), CentroidHomfold (Hamada et al., 2009), CONTRAfold (Do et al, 2006), CyloFold (Bindewald et al.), KineFold

(Xayaphoummine et al., 2005; Xayaphoummine et al., 2003), Mfold (Zuker and Stiegler, 1981), GeneBee-NET (Brodskii et al., 1995), (Pknots (Rivas and Eddy, 1999), PknotsRG (Reeder et al., 2007), RNA123 (www.rnal23.com), RNAfold (Gruber et al., 2008), RNAshapes (Voss et al., 2006), RNAstructure (Mathews et al., 2004), Sfold (Ding et al., 2004), UNAFold (Markham and Zuker, 2008), Crumple (Schroeder et al., 2011), and Sliding Windows & Assembly (Schroeder et al., 2011) among others.

As described herein, a target protein may refer to any of the following non-limiting embodiments: a full-length naturally occurring protein, a polypeptide sequence corresponding to a fragment or domain of a naturally occurring protein sequence, a mutant or modified form of a full-length protein or protein fragment, or a polypeptide sequence coding for a non-natural protein, such as proteins that have been engineered or designed by artificial methods.

Certain embodiments of the invention provide a method of preparing an expression vector, wherein the expression vector comprises, in order of position, a 5' upstream sequence encoding an operable promoter and associated regulatory signals, a sequence encoding the 5' untranslated region of the messenger RNA transcribed from the promoter including a ribosome binding site several nucleotides preceding the translation start codon, a sequence beginning with the start codon encoding a polypeptide tag, and a cloning site that enables "target protein" coding sequences to be inserted into the vector in-frame with the polypeptide tag thus allowing their expression as fusions to the polypeptide tag, wherein the method comprises specifically modifying the entire sequence encoding the 5' untranslated region of the messenger RNA through and including the sequence encoding the polypeptide tag sequence in order to niinimize RNA secondary structure upstream of the target insertion site.

In certain embodiments, the method further comprises specifically modifying the second nucleic acid sequence to reduce the presence of rare codons (i.e. mRNA codons for which the corresponding tRNAs are in low abundance in the host cell). For example, rare codons are replaced with high frequency codons to increase expression of any target protein expressed by the vector. Codons that are considered rare are dependent on the selected host cell that is used for expression of the vector and are known to and/or can be readily determined by one skilled in the art. For example, rare codons may be identified using computer software programs known in the art, for example, the Rare Codon Calculator (RaCC) for E. coli (http://nihserver.mbi.ucla.edu/RACC/), http://www.jcat.de/, or http://genomes.urv.es/OPTIMIZER/. In certain embodiments, the modified region of the nucleic acid sequence spans from the first 5' nucleotide in the expressed mRNA to the last nucleotide of the polypeptide tag.

In certain embodiments, nucleotides within about the last 20 nucleotides of the first nucleic acid sequence are modified (i.e., from the nucleotide that directly precedes the encoded start codon to 20 nucleotides upstream). In certain embodiments, nucleotides within about the last, e.g., 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975 or 1,000 nucleotides of the first nucleic acid sequence are modified.

In certain embodiments, nucleotides within about the first 20 nucleotides of the second nucleic acid sequence are modified (i.e., from the first nucleotide within the encoded start codon to 20 nucleotides downstream). In certain embodiments, nucleotides within about the first, e.g., 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975 or 1,000 nucleotides of the second nucleic acid sequence are modified.

In certain embodiments, the expression vector further comprises a target protein coding sequence inserted into the vector in-frame with the nucleic acid tag sequence to encode a fusion protein comprising the polypeptide tag and the target protein.

In certain embodiments, the target protein coding sequence is not modified to minimize RNA secondary structure.

In certain embodiments, the target protein coding sequence is not modified to reduce the presence of rare codons.

In certain embodiments, the target protein coding sequence is modified to minimize RNA secondary structure.

In certain embodiments, the target protein coding sequence is modified to reduce the presence of rare codons.

As used herein, the second nucleic acid sequence encodes at least one polypeptide tag. In certain embodiments, the second nucleic acid sequence encodes more than one polypeptide tag. As used herein, when the second nucleic acid sequence encodes more than one polypeptide tag, the respective sequences that encode each polypeptide tag are joined in-frame to result in a fusion protein that comprises each polypeptide tag. In certain embodiments, the second nucleic acid sequence encodes, e.g., two, three, four, five, etc.

polypeptide tags.

As used herein, the second nucleic acid sequence may encode any polypeptide tag appropriate to the particular chosen application or selected target protein (e.g., an affinity purification tag and/or a solubility enhancement tag). Polypeptide tags are known to those skilled in the art. For example, the encoded polypeptide tag may be an Avi-tag, Calmodulin-tag, FLAG-tag, HA-tag, His-tag, Myc-tag, S-tag, SBP-tag, Softag 1, Softag 3, V5 tag, Xpress tag, Isopeptag, Spy tag, BCCP, Glutathione-S-transferase-tag, Green fluorescent protein-tag, Maltose binding protein-tag, Nus-tag, Strep-tag, Thioredoxin-tag, TC tag, Ty tag, Nano-tag, Halo-tag, protein G Bl domain tag, a myxococcus protein S tag or Protein A tag.

Accordingly, in certain embodiments, the at least one encoded

polypeptide tag is selected from an Avi-tag, Calmodulin-tag, FLAG-tag, HA-tag, His-tag, Myc-tag, S-tag, SBP-tag, Softag 1, Softag 3, V5 tag, Xpress tag, Isopeptag, Spy tag, BCCP, Glutathione-S-transferase-tag, Green fluorescent protein-tag, Maltose binding protein-tag, Nus-tag, Strep-tag, Thioredoxin-tag, TC tag, Ty tag, Nano-tag, Halo-tag, protein G Bl domain tag, a myxococcus protein S tag or Protein A tag.

In certain embodiments, the second nucleic acid sequence encodes at least one affinity purification tag.

In certain embodiments, the second nucleic acid sequence encodes more than one affinity purification tag.

In certain embodiments, the second nucleic acid sequence encodes two affinity purification tags.

In certain embodiments, the encoded affinity purification tag(s) is/are selected from a Streptavidin binding moiety, a maltose binding protein moiety, and a HIS tag.

In certain embodiments, the Streptavidin binding moiety is a Nano-tag or a biotinylated Avi-tag. In certain embodiments, the second nucleic acid sequence encodes no affinity purification tags.

In certain embodiments, the second nucleic acid sequence encodes at least one solubility enhancement tag.

In certain embodiments, the second nucleic acid sequence encodes more than one solubility enhancement tag.

In certain embodiments, the second nucleic acid sequence encodes two solubility enhancement tags.

In certain embodiments, the encoded solubility enhancement tag(s) is/are selected from a maltose binding protein tag, a protein G Bl domain tag, and a myxococcus protein S tag.

In certain embodiments, the second nucleic acid sequence encodes no solubility enhancement tags.

In certain embodiments, the second nucleic acid sequence further encodes at least one protease recognition site. In certain embodiments, the second nucleic acid sequence encodes more than one protease recognition site.

As used herein, when the second nucleic acid sequence further encodes a protease recognition site(s), the sequence that encodes this/these site(s) is/are inserted in-frame with the sequence(s) that encode the at least one polypeptide tag to result in a fusion protein that comprises the polypeptide tag(s) and the protease recognition site(s). In certain embodiments, the encoded protease recognition site(s) is/are downstream of the encoded polypeptide tag(s). In certain embodiments, the encoded protease recognition site is/are between a series of encoded polypeptide tag(s).

In certain embodiments, the protease recognition site(s) is/are a Tobacco

Etch Virus (TEV), Thrombin, Factor Xa and/or a human rhinovirus (HRV) 3C (e.g., PreScission Protease, GE Healthcare Life Sciences, Pittsburgh, PA) protease recognition site.

As described herein, the PreScission Protease is a genetically engineered protein consisting of human rhinovirus 3C protease. It is often produced as a fusion protein with a hexaHis or GST affinity purification tag. It specifically cleaves between the Gin and Gly residues of the recognition sequence of LeuGluValLeuPheGln/GlyPro. In certain embodiments, the second nucleic acid sequence is at least about 21 nucleotides in length. In certain embodiments, the second nucleic acid sequence is at least about , e.g., 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117, 120, 123, 126, 129, 132, 135, 138, 141, 144, 147, 150, 201, 252, 303, 354, 405, 456, 507, 558, 609, 660, 711, 762, 813, 864, 915, 966, or 1,017 nucleotides in length.

In certain embodiments, the target protein coding sequence encodes a transcription factor, a transcription factor domain, an epigenetic regulatory factor, or an epigenetic regulatory factor domain.

In certain embodiments, the target protein coding sequence encodes a polypeptide sequence described in Table 2. As described herein, the target protein coding sequence may also encode a polypeptide sequence that has substantial identity to or is a functional equivalent of a polypeptide sequence described in Table 2.

In certain embodiments, the target protein coding sequence encodes a protein antigen for producing an affinity capture reagent.

In certain embodiments, the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

In certain embodiments, the target protein coding sequence encodes a protein antigen for producing an antibody or Fab by phage display.

In certain embodiments, the expression of the target protein is about 1.5 fold greater than the expression of a target protein generated from an expression vector that was not modified as described herein. In certain embodiments, the expression of the target protein is, e.g., about 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5, 18, 18.5, 19, 19.5, or 20, etc., fold greater than the expression of a target protein generated from an expression vector that was not modified as described herein.

As described herein, in certain embodiments, expression of a target protein from a vector that is not TOEET modified as described herein is undetectable, whereas expression of the same target protein from a vector that has been modified as described herein is detectable. Certain embodiments of the invention provide an expression vector prepared using a method as described herein.

Certain embodiments of the invention provide a target protein expression vector (e.g. a target protein expression vector) comprising, in order of position: a first nucleic acid sequence encoding a 5' untranslated region of an expressed mRNA that comprises a ribosome binding site (RBS); a second nucleic acid sequence encoding a polypeptide tag; and a cloning site, wherein the cloning site enables a target protein coding sequence to be inserted into the vector in-frame with the second nucleic acid sequence to encode a fusion protein comprising the polypeptide tag and the target protein; and wherein the nucleic acid sequence encoding (i) the 5' untranslated region and (ii) the adjacent polypeptide tag has been specifically modified to minimize RNA secondary structure both within and/or between these two regions of the mRNA.

In certain embodiments, the second nucleic acid sequence has been specifically modified to reduce the presence of rare codons.

In certain embodiments, the modified region of the nucleic acid sequence spans from the first 5' nucleotide in the expressed mRNA to the last nucleotide of the polypeptide tag.

In certain embodiments, nucleotides within about the last 20 nucleotides of the first nucleic acid sequence have been modified (i.e., from the nucleotide that directly precedes the encoded start codon to 20 nucleotides upstream). In certain embodiments, nucleotides within about the last, e.g., 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675, 700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975 or 1,000 nucleotides of the first nucleic acid sequence have been modified.

In certain embodiments, nucleotides within about the first 20 nucleotides of the second nucleic acid sequence have been modified (i.e., from the first nucleotide within the encoded start codon to 20 nucleotides downstream). In certain embodiments, nucleotides within about the first, e.g., 30, 40, 50, 60, 70,

80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 225, 250, 275,

300, 325, 350, 375, 400, 425, 450, 475, 500, 525, 550, 575, 600, 625, 650, 675,

700, 725, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975 or 1,000 nucleotides of the second nucleic acid sequence have been modified. In certain embodiments, an expression vector as described herein, further comprises a target protein coding sequence inserted into the vector in-frame with the nucleic acid tag sequence to encode a fusion protein comprising the polypeptide tag and the target protein.

In certain embodiments, the target protein coding sequence has not been modified to minimize RNA secondary structure.

In certain embodiments, the target protein coding sequence has not been modified to eliminate rare codons.

In certain embodiments, the target protein coding sequence has been modified to minimize RNA secondary structure.

In certain embodiments, the target protein coding sequence has been modified to eliminate rare codons.

In certain embodiments, the second nucleic acid sequence encodes at least one affinity purification tag.

In certain embodiments, the second nucleic acid sequence encodes more than one polypeptide tag. As used herein, when the second nucleic acid sequence encodes more than one polypeptide tag, the respective sequences that encode each polypeptide tag are joined in-frame to result in a fusion protein that comprises each polypeptide tag. In certain embodiments, the second nucleic acid sequence encodes, e.g., two, three, four, five, etc. polypeptide tags.

As used herein, the second nucleic acid sequence may encode any polypeptide tag appropriate to the particular chosen application or selected target protein (e.g., an affinity purification tag or a solubility enhancement tag).

Polypeptide tags are known to those skilled in the art. For example, the encoded polypeptide tag may be an Avi-tag, Calmodulin-tag, FLAG-tag, HA-tag, His-tag, Myc-tag, S-tag, SBP-tag, Softag 1, Softag 3, V5 tag, Xpress tag, Isopeptag, Spy tag, BCCP, Glutathione-S-transferase-tag, Green fluorescent protein-tag, Maltose binding protein-tag, Nus-tag, Strep-tag, Thioredoxin-tag, TC tag, Ty tag, Nano-tag, Halo-tag, protein G Bl domain tag, a myxococcus protein S tag or Protein A tag.

Accordingly, in certain embodiments, the at least one encoded

polypeptide tag is selected from an Avi-tag, Calmodulin-tag, FLAG-tag, HA-tag,

His-tag, Myc-tag, S-tag, SBP-tag, Softag 1, Softag 3, V5 tag, Xpress tag,

Isopeptag, Spy tag, BCCP, Glutathione-S-transferase-tag, Green fluorescent protein-tag, Maltose binding protein-tag, Nus-tag, Strep-tag, Thioredoxin-tag, TC tag, Ty tag, Nano-tag, Halo-tag, protein G Bl domain tag, a myxococcus protein S tag or Protein A tag.

In certain embodiments, the second nucleic acid sequence encodes more than one affinity purification tag.

In certain embodiments, the second nucleic acid sequence encodes two affinity purification tags.

In certain embodiments, the encoded affinity purification tag(s) is/are selected from a Streptavidin binding moiety, a maltose binding protein moiety, and a HIS tag.

In certain embodiments the Streptavidin binding moiety is a Nano-tag or a biotinylated Avi-tag.

In certain embodiments, the second nucleic acid sequence encodes no affinity purification tags.

In certain embodiments, the second nucleic acid sequence encodes at least one solubility enhancement tag.

In certain embodiments, the second nucleic acid sequence encodes more than one solubility enhancement tag.

In certain embodiments, the second nucleic acid sequence encodes two solubility enhancement tags.

In certain embodiments, the encoded solubility enhancement tag(s) is/are selected from a maltose binding protein tag, a protein G Bl domain tag, and a myxococcus protein S tag.

In certain embodiments, the second nucleic acid sequence encodes at least one protease recognition site.

As used herein, when the second nucleic acid sequence further encodes a protease recognition site(s), the sequence that encodes this/these site(s) is/are inserted in-frame with the sequence(s) that encode the at least one polypeptide tag to result in a fusion protein that comprises the polypeptide tag(s) and the protease recognition site(s). In certain embodiments, the encoded protease recognition site(s) is/are downstream of the encoded polypeptide tag(s). In certain embodiments, the encoded protease recognition site is/are between a series of encoded polypeptide tag(s). In certain embodiments, the protease recognition site(s) is/are a Tobacco Etch Virus (TEV), Thrombin, Factor Xa and/or a HRV 3C protease recognition site.

In certain embodiments, the second nucleic acid sequence is at least about 21 nucleotides in length. In certain embodiments, the second nucleic acid sequence is at least about , e.g., 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117, 120, 123, 126, 129, 132, 135, 138, 141, 144, 147, 150, 201, 252, 303, 354, 405, 456, 507, 558, 609, 660, 711, 762, 813, 864, 915, 966, or 1,017 nucleotides in length.

In certain embodiments, the target protein coding sequence encodes a transcription factor, a transcription factor domain, an epigenetic regulatory factor, or an epigenetic regulatory factor domain.

In certain embodiments, the target protein coding sequence encodes a polypeptide sequence described in Table 2. As described herein, the target protein coding sequence may also encode a polypeptide sequence that has substantial identity to or is a functional equivalent of a polypeptide sequence described in Table 2.

In certain embodiments, the target protein coding sequence encodes a protein antigen for producing an affinity capture reagent.

In certain embodiments, the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

In certain embodiments, the target protein coding sequence encodes a protein antigen for producing an antibody or Fab by phage display.

In certain embodiments, the target protein is expressed at about a 1.5 fold higher level than a target protein generated from an expression vector that was not modified as described herein. In certain embodiments, the target protein is expressed at about, e.g., a 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5, 18, 18.5, 19, 19.5, or 20, etc., higher level than a target protein generated from an expression vector that was not modified as described herein.

As described herein, in certain embodiments, expression of a target protein from a vector not modified as described herein is undetectable, whereas expression of the same target protein from a vector that has been modified as described herein is detectable.

Certain embodiments of the invention provide a host cell comprising the expression vector as described herein. Host cells are used for the expression of vectors and are known in the art. For example, a host cell may be a bacterial cell, such as E. coli.

Certain embodiments of the invention provide a method for expressing a target protein in a host cell, comprising culturing the host cell as described herein for a period of time under conditions permitting expression of the target protein.

In certain embodiments, the target protein is a protein antigen for producing an affinity capture reagent.

In certain embodiments, the affinity capture reagent is an antibody, an antibody fragment, or an aptamer.

In certain embodiments, the target protein is a protein antigen for producing an antibody or Fab by phage display.

In one aspect, the invention features a method of designing an expression vector for expressing a recombinant protein in a host cell, e.g., bacterial cell (such as E. coli. cell). The method includes steps of: obtaining a first sequence encoding the recombinant protein; obtaining an expression vector containing an insertion site for the first sequence, wherein once inserted at the insertion site, the first sequence is joined in frame with a 5' sequence from the expression vector to form a first fusion sequence that encodes a RNA sequence, the RNA sequence having a Ribosomal Binding Site (RBS) and a translation initiation site; modifying the RNA sequence by (i) designing the RNA sequence so as to minimize RNA secondary structure in a region around the RBS site or translation initiation site, or (ii) optimizing codon usage in the RNA sequence based on codon usage of the host cell, to obtain a second fusion sequence; and cloning the second fusion sequence into the expression vector in such a way to replace the first fusion sequence.

In one embodiment, the designing step or optimizing step is carried out using Transcript-Optimized Expression Enhancement Technology (TOEET) as shown and described herein. In another, the designing step or optimizing step is carried out by introducing a third sequence encoding a N-terminal polypeptide expression-enhancement tag (EET) directly downstream of the initiation site.

The expression-enhancement tag can be an affinity purification tag, such as one having the sequence of an Avi tag, a Nano-tag, or a 6xHis tag.

In a second aspect, the invention provides an expression vector that is designed using the method described above. In the expression vector, the second fusion sequence can have a sequence selected from the sequences shown in Figure 1. In one example, the expression vector is selected from the group consisting of pNESG_Avi6HT and pNESG_Nano6HT. The invention also provides a host cell having the expression vector.

In a third aspect, the invention features a method for increasing the expression and solubility of a recombinant protein in a host cell. The method includes obtaining the just described host cell; culturing the host cell in a culture for period of time; and recovering the recombinant protein from the host cell or the culture. To that end, the recombinant protein can be a protein antigen for producing an affinity capture reagent (such as an antibody, an antibody fragment, or an aptamer) or a protein antigen for producing antibody or Fab by phage display.

In a fourth aspect, the invention provides an immunogenic composition having the recombinant protein produced by the method described above. The composition can be administered to a subject in need thereof for generating an immune response in the subject.

In a fifth aspect, the invention provides a method of generating an antibody (either polyclonal or monoclonal) by, among others, administrating to a subject the immunogenic composition described above.

The invention also provides an isolated polypeptide, a nucleic acid encoding it, a high throughput method for identifying a soluble protein or protein domain, and a high throughput method for isolating a soluble protein or protein domain substantially as shown and described herein.

The term "nucleic acid" refers to deoxyribonucleotides (DNA, e.g., a cDNA or genomic DNA), ribonucleotides (RNA, e.g., an mRNA), or a DNA or

RNA analog and polymers thereof, in either single- or double-stranded form, but preferably is double-stranded DNA, made of monomers (nucleotides) containing a sugar, phosphate and a base that is either a purine or pyrimidine. A DNA or

RNA analog can be synthesized from nucleotide analogs. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also encompasses conservatively modified variants thereof {e.g. , degenerate codon substitutions) and complementary sequences, as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues.

The term "nucleotide sequence" refers to a polymer of DNA or RNA which can be single-stranded or double-stranded, optionally containing synthetic, non-natural or altered nucleotide bases capable of incorporation into DNA or RNA polymers. The terms "nucleic acid," "nucleic acid molecule," or "polynucleotide" are used interchangeably.

Certain embodiments of the invention encompass isolated or

substantially purified nucleic acid compositions. An "isolated nucleic acid" is a nucleic acid the structure of which is not identical to that of any naturally occurring nucleic acid or to that of any fragment of a naturally occurring genomic nucleic acid. The term therefore covers, for example, (a) a DNA which has the sequence of part of a naturally occurring genomic DNA molecule but is not flanked by both of the coding sequences that flank that part of the molecule in the genome of the organism in which it naturally occurs; (b) a nucleic acid incorporated into a vector or into the genomic DNA of a prokaryote or eukaryote in a manner such that the resulting molecule is not identical to any naturally occurring vector or genomic DNA; (c) a separate molecule such as a cDNA, a genomic fragment, a fragment produced by polymerase chain reaction (PCR), or a restriction fragment; and (d) a recombinant nucleotide sequence that is part of a hybrid gene, i.e., a gene encoding a fusion protein. Specifically excluded from this definition are nucleic acids present in mixtures of different (i) DNA molecules, (ii) transfected cells, or (iii) cell clones, e.g., as these occur in a DNA library such as a cDNA or genomic DNA library. The nucleic acid described above can be used to express a fusion protein of this invention. For this purpose, one can operatively link the nucleic acid to suitable regulatory sequences to generate an expression vector. The following terms are used to describe the sequence relationships between two or more nucleotide sequences: (a) "reference sequence," (b) "comparison window," (c) "sequence identity," (d) "percentage of sequence identity," and (e) "substantial identity."

(a) As used herein, "reference sequence" is a defined sequence used as a basis for sequence comparison. A reference sequence may be a subset or the entirety of a specified sequence; for example, as a segment of a full-length cDNA or gene sequence, or the complete cDNA or gene sequence.

(b) As used herein, "comparison window" makes reference to a contiguous and specified segment of a polynucleotide sequence, wherein the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. Generally, the comparison window is at least 20 contiguous nucleotides in length, and optionally can be 30, 40, 50, 100, or longer. Those of skill in the art understand that to avoid a high similarity to a reference sequence due to inclusion of gaps in the polynucleotide sequence a gap penalty is typically introduced and is subtracted from the number of matches.

Methods of alignment of sequences for comparison are well-known in the art. Thus, the determination of percent identity between any two sequences can be accomplished using a mathematical algorithm. Non-limiting examples of such mathematical algorithms are the algorithm of Myers and Miller (Myers and Miller, CABIOS, 4, 11 (1988)); the local homology algorithm of Smith et al. (Smith et al, Adv. Appl. Math., 2, 482 (1981)); the homology alignment algorithm of Needleman and Wunsch (Needleman and Wunsch, JMB, 48, 443 (1970)); the search-for-similarity-method of Pearson and Lipman (Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85, 2444 (1988)); the algorithm of Karlin and Altschul (Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 87, 2264 (1990)), modified as in Karlin and Altschul (Karlin and Altschul, Proc. Natl. Acad. Sci. USA 90, 5873 (1993)).

Computer implementations of these mathematical algorithms can be utilized for comparison of sequences to determine sequence identity. Such implementations include, but are not limited to: CLUSTAL in the PC/Gene program (available from Intelligenetics, Mountain View, Calif.); the ALIGN program (Version 2.0) and GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Version 8 (available from Genetics Computer Group (GCG), 575 Science Drive, Madison, Wis., USA). Alignments using these programs can be performed using the default parameters. The CLUSTAL program is well described by Higgins et al. (Higgins et al. , CABIOS, 5, 151 (1989)); Corpet et a/. (Corpet et a/., Nucl. Acids Res., 16, 10881 (1988)); Huang et al. (Huang et al, CABIOS, 8, 155 (1992)); and Pearson et al. (Pearson et al, Meth. Mol. Biol., 24, 307 (1994)). The ALIGN program is based on the algorithm of Myers and Miller, supra. The BLAST programs of Altschul et al. (Altschul et al, JMB, 215, 403 (1990)) are based on the algorithm of Karlin and Altschul supra.

Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information. This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive- valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold. These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always >0) and N (penalty score for mismatching residues; always <0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity X from its maximum achieved value, the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments, or the end of either sequence is reached.

In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a test nucleic acid sequence is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid sequence to the reference nucleic acid sequence is less than about 0.1, less than about 0.01, or even less than about 0.001.

To obtain gapped alignments for comparison purposes, Gapped BLAST (in BLAST 2.0) can be utilized. Alternatively, PSI-BLAST (in BLAST 2.0) can be used to perform an iterated search that detects distant relationships between molecules. When utilizing BLAST, Gapped BLAST, PSI-BLAST, the default parameters of the respective programs (e.g., BLASTN for nucleotide sequences, BLASTX for proteins) can be used. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11 , an expectation (E) of 10, a cutoff of 100, M=5, N=-4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix. Alignment may also be performed manually by inspection.

For purposes of the present invention, comparison of nucleotide sequences for detenriination of percent sequence identity to the promoter sequences disclosed herein may be made using the BlastN program (version 1.4.7 or later) with its default parameters or any equivalent program. By "equivalent program" is intended any sequence comparison program that, for any two sequences in question, generates an alignment having identical nucleotide or amino acid residue matches and an identical percent sequence identity when compared to the corresponding alignment generated by the program.

(c) As used herein, "sequence identity" or "identity" in the context of two nucleic acid or polypeptide sequences makes reference to a specified percentage of residues in the two sequences that are the same when aligned for maximum correspondence over a specified comparison window, as measured by sequence comparison algorithms or by visual inspection. When percentage of sequence identity is used in reference to proteins it is recognized that residue positions which are not identical often differ by conservative amino acid substitutions, where amino acid residues are substituted for other amino acid residues with similar chemical properties e.g., charge or hydrophobicity) and therefore do not change the functional properties of the molecule. When sequences differ in conservative substitutions, the percent sequence identity may be adjusted upwards to correct for the conservative nature of the substitution. Sequences that differ by such conservative substitutions are said to have "sequence similarity" or "similarity." Means for making this adjustment are well known to those of skill in the art. Typically this involves scoring a conservative substitution as a partial rather than a full mismatch, thereby increasing the percentage sequence identity. Thus, for example, where an identical amino acid is given a score of 1 and a non-conservative substitution is given a score of zero, a conservative substitution is given a score between zero and 1. The scoring of conservative substitutions is calculated, e.g., as implemented in the program PC/GENE (Intelligenetics, Mountain View, Calif).

(d) As used herein, "percentage of sequence identity" means the value determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the polynucleotide sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by detennining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison, and multiplying the result by 100 to yield the percentage of sequence identity.

(e)(i) The term "substantial identity" of polynucleotide sequences means that a polynucleotide comprises a sequence that has at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, or 94%, or even at least 95%, 96%, 97%, 98%, or 99% sequence identity, compared to a reference sequence using one of the alignment programs described using standard parameters. One of skill in the art will recognize that these values can be appropriately adjusted to determine corresponding identity of proteins encoded by two nucleotide sequences by taking into account codon degeneracy, amino acid similarity, reading frame positioning, and the like. Substantial identity of amino acid sequences for these purposes normally means sequence identity of at least 70%,

80%, 90%, or even at least 95%.

Another indication that nucleotide sequences are substantially identical is if two molecules hybridize to each other under stringent conditions. Generally, stringent conditions are selected to be about 5°C lower than the thermal melting point (T m ) for the specific sequence at a defined ionic strength and pH.

However, stringent conditions encompass temperatures in the range of about 1°C to about 20°C, depending upon the desired degree of stringency as otherwise qualified herein. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the polypeptides they encode are substantially identical. This may occur, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code. One indication that two nucleic acid sequences are substantially identical is when the polypeptide encoded by the first nucleic acid is

immunologically cross reactive with the polypeptide encoded by the second nucleic acid.

(e)(ii) The term "substantial identity" in the context of a peptide indicates that a peptide comprises a sequence with at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, or 94%, or even 95%, 96%, 97%, 98% or 99%, sequence identity to the reference sequence over a specified comparison window. In certain embodiments, optimal alignment is conducted using the homology alignment algorithm of Needleman and Wunsch (Needleman and Wunsch, JMB, 48, 443 (1970)). An indication that two peptide sequences are substantially identical is that one peptide is immunologically reactive with antibodies raised against the second peptide. Thus, a peptide is substantially identical to a second peptide, for example, where the two peptides differ only by a conservative substitution. Thus, certain embodiments of the invention provide nucleic acid molecules that are substantially identical to the nucleic acid molecules described herein.

For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are input into a computer, subsequence coordinates are designated if necessary, and sequence algorithm program parameters are designated. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence(s) relative to the reference sequence, based on the designated program parameters. As noted above, another indication that two nucleic acid sequences are substantially identical is that the two molecules hybridize to each other under stringent conditions. The phrase "hybridizing specifically to" refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucleotide sequence under stringent conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA or RNA. "Bind(s) substantially" refers to complementary hybridization between a probe nucleic acid and a target nucleic acid and embraces minor mismatches that can be accommodated by reducing the stringency of the hybridization media to achieve the desired detection of the target nucleic acid sequence.

"Stringent hybridization conditions" and "stringent hybridization wash conditions" in the context of nucleic acid hybridization experiments such as Southern and Northern hybridizations are sequence dependent, and are different under different environmental parameters. Longer sequences hybridize specifically at higher temperatures. The thermal melting point (Tm) is the temperature (under defined ionic strength and pH) at which 50% of the target sequence hybridizes to a perfectly matched probe. Specificity is typically the function of post-hybridization washes, the critical factors being the ionic strength and temperature of the final wash solution. For DNA-DNA hybrids, the T m can be approximated from the equation of Meinkoth and Wahl (1984); T m 81.5°C + 16.6 (log M) + 0.41 (%GC) - 0.61 (% form) - 500/L; where M is the molarity of monovalent cations, %GC is the percentage of guanosine and cytosine nucleotides in the DNA, % form is the percentage of formamide in the hybridization solution, and L is the length of the hybrid in base pairs. T m is reduced by about 1°C for each 1% of mismatching; thus, T m , hybridization, and/or wash conditions can be adjusted to hybridize to sequences of the desired identity. For example, if sequences with >90% identity are sought, the T m can be decreased 10°C. Generally, stringent conditions are selected to be about 5°C lower than the T m for the specific sequence and its complement at a defined ionic strength and pH. However, severely stringent conditions can utilize a hybridization and/or wash at 1, 2, 3, or 4°C lower than the T m ; moderately stringent conditions can utilize a hybridization and/or wash at 6, 7, 8, 9, or 10°C lower than the T m ; low stringency conditions can utilize a hybridization and/or wash at 11, 12, 13, 14, 15, or 20°C lower than the T m . Using the equation, hybridization and wash compositions, and desired temperature, those of ordinary skill will understand that variations in the stringency of hybridization and/or wash solutions are inherently described. If the desired degree of mismatching results in a temperature of less than 45°C (aqueous solution) or 32°C

(formamide solution), the SSC concentration is increased so that a higher temperature can be used. Generally, highly stringent hybridization and wash conditions are selected to be about 5°C lower than the T m for the specific sequence at a defined ionic strength and pH.

An example of highly stringent wash conditions is 0.15 M NaCl at 72°C for about 15 minutes. An example of stringent wash conditions is a 0.2 x SSC wash at 65°C for 15 minutes. Often, a high stringency wash is preceded by a low stringency wash to remove background probe signal. An example medium stringency wash for a duplex of, e.g., more than 100 nucleotides, is 1 x SSC at 45°C for 15 minutes. For short nucleotide sequences (e.g., about 10 to 50 nucleotides), stringent conditions typically involve salt concentrations of less than about 1.5 M, less than about 0.01 to 1.0 M, Na ion concentration (or other salts) at pH 7.0 to 8.3, and the temperature is typically at least about 30°C and at least about 60°C for long probes (e.g., >50 nucleotides). Stringent conditions may also be achieved with the addition of destabilizing agents such as formamide. In general, a signal to noise ratio of 2 x (or higher) than that observed for an unrelated probe in the particular hybridization assay indicates detection of a specific hybridization. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the proteins that they encode are substantially identical. This occurs, e.g. , when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code.

Very stringent conditions are selected to be equal to the T m for a particular probe. An example of stringent conditions for hybridization of complementary nucleic acids that have more than 100 complementary residues on a filter in a Southern or Northern blot is 50% formamide, e.g. , hybridization in 50% formamide, 1 M NaCl, 1% SDS at 37°C, and a wash in 0.1 x SSC at 60 to 65°C. Exemplary low stringency conditions include hybridization with a buffer solution of 30 to 35% formamide, 1 M NaCl, 1% SDS (sodium dodecyl sulphate) at 37°C, and a wash in 1 x to 2 x SSC (20 x SSC = 3.0 M NaCl/0.3 M trisodium citrate) at 50 to 55°C. Exemplary moderate stringency conditions include hybridization in 40 to 45% formamide, 1.0 M NaCl, 1% SDS at 37°C, and a wash in 0.5 x to 1 x SSC at 55 to 60°C.

In addition to the chemical optimization of stringency conditions, analytical models and algorithms can be applied to hybridization data-sets (e.g. microarray data) to improve stringency.

An expression vector as described herein can be introduced into host cells to produce a fusion protein of this invention. Also within the scope of this invention is a host cell that contains the above-described nucleic acid. Examples include E. coli cells, insect cells (e.g., using baculovirus expression vectors), yeast cells, plant cells, or mammalian cells. See e.g., Goeddel, (1990) Gene Expression Technology: Methods in Enzymology 185, Academic Press, San Diego, Calif. To produce a fusion protein of this invention, one can culture a host cell in a medium under conditions permitting expression of the protein encoded by a nucleic acid of this invention, and isolate the protein from the cultured cell or the medium of the cell. The presence of the fusion protein in an occlusion body allows one to prepare the protein from the host cell by simply separating the occlusion body from the host cell. Alternatively, the nucleic acid of this invention can be transcribed and translated in vitro, for example, using T7 promoter regulatory sequences and T7 polymerase.

The terms "peptide," "polypeptide," and "protein" are used herein interchangeably to describe the arrangement of amino acid residues in a polymer. A peptide, polypeptide, or protein can be composed of the standard 20 naturally occurring amino acid, in addition to rare amino acids and synthetic amino acid analogs. They can be any chain of amino acids, regardless of length or post-translational modification (for example, glycosylation or

phosphorylation). The peptide, polypeptide, or protein "of this invention" includes recombinantly or synthetically produced fusion versions having the particular domains or portions that are soluble. The term also encompasses polypeptides that have an added amino-terminal methionine (useful for expression in prokaryotic cells).

A "recombinant" peptide, polypeptide, or protein refers to a peptide, polypeptide, or protein produced by recombinant DNA techniques; i.e., produced from cells transformed by an exogenous DNA construct encoding the desired peptide. A "synthetic" peptide, polypeptide, or protein refers to a peptide, polypeptide, or protein prepared by chemical synthesis. The term "recombinant" when used with reference, e.g., to a cell, or nucleic acid, protein, or vector, indicates that the cell, nucleic acid, protein or vector, has been modified by the introduction of a heterologous nucleic acid or protein or the alteration of a native nucleic acid or protein, or that the cell is derived from a cell so modified.

Within the scope of this invention are fusion proteins containing one or more of the afore-mentioned sequences and a heterologous sequence. A heterologous polypeptide, nucleic acid, or gene is one that originates from a foreign species, or, if from the same species, is substantially modified from its original form. Two fused domains or sequences are heterologous to each other if they are not adjacent to each other in a naturally occurring protein or nucleic acid.

An "isolated" peptide, polypeptide, or protein refers to a peptide, polypeptide, or protein that has been separated from other proteins, lipids, and nucleic acids with which it is naturally associated. The polypeptide/protein can constitute at least 10% (i.e., any percentage between 10% and 100%, e.g., 20%, 30%, 40%, 50%, 60%, 70 %, 80%, 85%, 90%, 95%, and 99%) by dry weight of the purified preparation. Purity can be measured by any appropriate standard method, for example, by column chromatography, polyacrylamide gel electrophoresis, or HPLC analysis. An isolated polypeptide/protein described in the invention can be purified from a natural source, produced by recombinant DNA techniques, or by chemical methods.

A functional equivalent of a peptide, polypeptide, or protein of this invention refers to a polypeptide derivative of the peptide, polypeptide, or protein, e.g., a protein having one or more point mutations, insertions, deletions, truncations, a fusion protein, or a combination thereof. It retains substantially the activity of the corresponding unmodified peptide/polypeptide/protein (e.g., the activity of transcription factor). The isolated polypeptide can contain a sequence of a protein as listed in Table 1 or 2 or a functional fragment thereof.

In general, the functional equivalent is at least 75% (e.g., any number between

75% and 100%, inclusive, e.g., 70 %, 80%, 85%, 90%, 95%, and 99%) identical to the corresponding unmodified peptide/polypeptide/protein. The amino acid composition of the above-mentioned peptide/polypeptide/protein may vary without disrupting their biological activity, e.g., a transcription factor activity, i.e., ability to bind to a DNA element and or trigger or inhibit the respective cellular response. For example, it can contain one or more conservative amino acid substitutions. A "conservative amino acid substitution" is one in which the amino acid residue is replaced with an amino acid residue having a similar side chain. Families of amino acid residues having similar side chains have been defined in the art. These families include amino acids with basic side chains (e.g., lysine, arginine, histidine), acidic side chains (e.g., aspartic acid, glutamic acid), uncharged polar side chains (e.g., glycine, asparagine, glutamine, serine, threonine, tyrosine, cysteine), nonpolar side chains (e.g., alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan), β-branched side chains (e.g., threonine, valine, isoleucine) and aromatic side chains (e.g., tyrosine, phenylalanine, tryptophan, histidine). Thus, a predicted nonessential amino acid residue in a polypeptide is preferably replaced with another amino acid residue from the same side chain family.

Alternatively, mutations can be introduced randomly along all or part of the sequences, such as by saturation mutagenesis, and the resultant mutants can be screened for the respective biological activities.

A polypeptide described in this invention can be obtained as a

recombinant polypeptide. To prepare a recombinant polypeptide, a nucleic acid encoding it can be linked to another nucleic acid encoding a fusion partner, e.g., the tags disclosed herein, glutathione-s-transferase (GST), 6x-His epitope tag (or Hexa-His), 8x-His (or Octa-His) epitope tag, or Ml 3 Gene 3 protein. The resultant fusion nucleic acid expresses in suitable host cells a fusion protein that can be isolated by methods known in the art. The isolated fusion protein can be further treated, e.g., by enzymatic digestion (e.g., TEV protease digestion), to remove the fusion partner and obtain the recombinant polypeptide of this invention.

The peptide/polypeptide/protein of this invention covers chemically modified versions. Examples of chemically modified peptide/protein include those subjected to conformational change, addition or deletion of a sugar chain, and those to which a compound such as polyethylene glycol has been bound.

Once purified and tested by standard methods or according to the methods described in the examples below, the peptide/polypeptide/protein can be included in a composition, e.g., a pharmaceutical composition or an

immunogenic composition.

The term "immunogenic" refers to a capability of producing an immune response in a host animal against an antigen or antigens. This immune response forms the basis of the protective immunity elicited by a vaccine against a specific infectious organism. "Immune response" refers to a response elicited in an animal, which may refer to cellular immunity (CMI); humoral immunity or both. "Antigenic agent," "antigen," or "immunogen" means a substance that induces a specific immune response in a host animal. The antigen can be a protein described above, a vector encoding it, a cell having the vector or protein, or any combination thereof.

The term "animal" includes all vertebrate animals including humans. It also includes an individual animal in all stages of development, including embryonic and fetal stages. In particular, the term "vertebrate animal" includes, but not limited to, humans, canines (e.g., dogs), felines (e.g., cats); equines (e.g., horses), bovines (e.g., cattle), porcine (e.g., pigs), as well as in avians. The term "avian" refers to any species or subspecies of the taxonomic class ava, such as, but not limited to, chickens (breeders, broilers and layers), turkeys, ducks, a goose, a quail, pheasants, parrots, finches, hawks, crows and ratites including ostrich, emu and cassowary.

The immunogenic composition can be used to generate antibodies against the peptide/polypeptide/protein of this invention. As used herein, "antibody" is used in the broadest sense and specifically covers monoclonal antibodies (including full length monoclonal antibodies), polyclonal antibodies, multispecific antibodies (e.g., bispecific antibodies), and antibody fragments so long as they exhibit the desired biological activity.

As used herein, "antibody fragments", may comprise a portion of an intact antibody, generally including the antigen binding or variable region of the intact antibody, the Fab region of the antibody, or the Fc region of an antibody which retains FcR binding capability. Examples of antibody fragments include linear antibodies; single-chain antibody molecules; and multispecific antibodies formed from antibody fragments. The antibody fragments preferably retain at least part of the hinge and optionally the CHI region of an IgG heavy chain. More preferably, the antibody fragments retain the entire constant region of an IgG heavy chain, and include an IgG light chain.

As used herein, Affinity Capture Reagents are cognate molecules capable or recognizing and binding to a protein antigen, including protein antigens produced by TOEET-optimized expression vectors. Affinity Capture reagents include (but are not limited to) monoclonal and polyclonal antibodies, Fab or Fab fragments generated by phage and related antigen display methods, RNA aptamers, and various protein binding scaffolds which can be used to generate antigen-recognizing molecules.

As used herein, the term "Fc fragment" or "Fc region" is used to define a

C-terminal region of an immunoglobulin heavy chain. The "Fc region" may be a native sequence Fc region or a variant Fc region. Although the boundaries of the Fc region of an immunoglobulin heavy chain might vary, the human IgG heavy chain Fc region is usually defined to stretch from an amino acid residue at position Cys226, or from Pro230, to the carboxyl-terminus thereof.

A "native sequence Fc region" comprises an amino acid sequence identical to the amino acid sequence of an Fc region found in nature. A "variant Fc region" as appreciated by one of ordinary skill in the art comprises an amino acid sequence which differs from that of a native sequence Fc region by virtue of at least one "amino acid modification." Preferably, the variant Fc region has at least one amino acid substitution compared to a native sequence Fc region or to the Fc region of a parent polypeptide, e.g., from about one to about ten amino acid substitutions, and preferably from about one to about five amino acid substitutions in a native sequence Fc region or in the Fc region of the parent polypeptide. The variant Fc region herein will preferably possess at least about 80% homology with a native sequence Fc region and/or with an Fc region of a parent polypeptide, and more preferably at least about 90% homology therewith, more preferably at least about 95% homology therewith, even more preferably, at least about 99% homology therewith.

Within the scope of this invention is a composition that contains a suitable carrier and one or more of the agents described above. The composition can be a pharmaceutical composition that contains a pharmaceutically acceptable carrier. The term "pharmaceutical composition" refers to the combination of an active agent with a carrier, inert or active, making the composition especially suitable for diagnostic or therapeutic use in vivo or ex vivo. A "pharmaceutically acceptable carrier," after administered to or upon a subject, does not cause undesirable physiological effects. The carrier in the pharmaceutical composition must be "acceptable" also in the sense that it is compatible with the active ingredient and can be capable of stabilizing it. One or more solubilizing agents can be utilized as pharmaceutical carriers for delivery of an active compound. Examples of a pharmaceutically acceptable carrier include, but are not limited to, biocompatible vehicles, adjuvants, additives, and diluents to achieve a composition usable as a dosage form. Examples of other carriers include colloidal silicon oxide, magnesium stearate, cellulose, and sodium lauryl sulfate.

As used herein, a "subject" refers to a human and a non-human animal. Examples of a non-human animal include all vertebrates, e.g., mammals, such as non-human mammals, non-human primates (particularly higher primates), dog, rodent (e.g., mouse or rat), guinea pig, cat, and rabbit, and non-mammals, such as birds, amphibians, reptiles, etc. In one embodiment, the subject is a human. In another embodiment, the subject is an experimental, non-human animal or animal suitable as a disease model.

The composition of this invention can include an adjuvant agent or adjuvant. As used herein, the term "adjuvant agent" or "adjuvant" means a substance added to an immunogenic composition or a vaccine to increase the immunogenic composition or the vaccine's immunogenicity. Examples of an adjuvant include a cholera toxin, Escherichia coli heat-labile enterotoxin, liposome, unmethylated DNA (CpG) or any other innate immune-stimulating complex. Various adjuvants that can be used to further increase the

immunological response depend on the host species and include Freund's adjuvant (complete and incomplete), mineral gels such as aluminum hydroxide, surface-active substances such as lysolecithin, pluronic polyols, polyanions, peptides, oil emulsions, keyhole limpet hemocyanin, and dinitrophenol. Useful human adjuvants include BCG (bacille Calmette-Guerin) and Corynebacterium parvum.

Pharmaceutical compositions comprising an adjuvant and an antigen may be manufactured by means of conventional mixing, dissolving, granulating, dragee-making, levigating, emulsifying, encapsulating, entrapping or

lyophilizing processes. Pharmaceutical compositions may be formulated in conventional manner using one or more physiologically acceptable carriers, diluents, excipients or auxiliaries which facilitate processing of the antigens of the invention into preparations which can be used pharmaceutically. Proper formulation is dependent upon the route of administration chosen.

A pharmaceutical composition of this invention can be administered parenterally, orally, nasally, rectally, topically, or buccally. The term

"parenteral" as used herein refers to subcutaneous, intracutaneous, intravenous, intramuscular, intraarticular, intraarterial, intrasynovial, intrasternal, intrathecal, intralesional, or intracranial injection, as well as any suitable infusion technique. For injection, immunogenic or vaccine preparations may be formulated in aqueous solutions, preferably in physiologically compatible buffers such as Hanks's solution, Ringer's solution, phosphate buffered saline, or any other physiological saline buffer. The solution may contain formulatory agents such as suspending, stabilizing and/or dispersing agents. Alternatively, the peptides, polypeptides, or proteins may be in powder form for constitution with a suitable vehicle, e.g., sterile pyrogen-free water, before use.

Determination of an effective amount of the immunogenic or vaccine formulation for administration is well within the capabilities of those skilled in the art, especially in light of the detailed disclosure provided herein. An effective dose can be estimated initially from in vitro assays. For example, a dose can be formulated in animal models to achieve an induction of an immune response using techniques that are well known in the art. One having ordinary skill in the art could readily optimize administration to all animal species based on results described herein. Dosage amount and interval may be adjusted individually. For example, when used as a vaccine, the vaccine formulations of the invention may be administered in about 1 to 3 doses for a 1-36 week period. Preferably, 1 or 2 doses are administered, at intervals of about 3 weeks to about 4 months, and booster vaccinations may be given periodically thereafter.

Alternative protocols may be appropriate for individual animals. A suitable dose is an amount of the vaccine formulation that, when administered as described above, is capable of raising an immune response in an immunized animal sufficient to protect the animal from an infection for at least 4 to 12 months. In general, the amount of the antigen present in a dose ranges from about 1 pg to about 100 mg per kg of host, typically from about 10 pg to about 1 mg, and preferably from about 100 pg to about 1 pg. Suitable dose range will vary with the route of injection and the size of the patient, but will typically range from about 0.1 ml to about 5 ml.

This invention also provides methods for making antibodies against the above-described proteins. The antibodies can be either polyclonal or

monoclonal.

Polyclonal antibodies against a protein of the invention can be obtained as follows. After verifying that a desired serum antibody level has been reached, blood is withdrawn from the mammal sensitized with the antigen. Serum is isolated from this blood using well-known methods. The serum containing the polyclonal antibody may be used as the polyclonal antibody, or according to needs, the polyclonal antibody-containing fraction may be further isolated from the serum. For instance, a fraction of antibodies that specifically recognize the protein of the invention may be prepared by using an affinity column to which the protein is coupled. Then, the fraction may be further purified by using a Protein A or Protein G column in order to prepare immunoglobulin G or immunoglobulin M.

To obtain monoclonal antibodies, after verifying that the desired serum antibody level has been reached in the mammal sensitized with the above- described antigen, immunocytes are taken from the mammal and used for cell fusion. For this purpose, splenocytes can be preferable immunocytes. As parent cells fused with the above immunocytes, mammalian myeloma cells are preferably used. More preferably, myeloma cells that have acquired the feature, which can be used to distinguish fusion cells by agents, are used as the parent cell.

The cell fusion between the above immunocytes and myeloma cells can be conducted according to known methods, for example, the method of Milstein et al. (Methods Enzymol., 73:3-46, 1981). The hybridoma obtained from cell fusion is selected by culturing the cells in a standard selective culture medium, for example, HAT culture medium (hypoxanthine, aminopterin, thymidine- containing culture medium). The culture in this HAT medium is continued for a period sufficient enough for cells (non-fusion cells) other than the objective hybridoma to perish, usually from a few days to a few weeks. Next, the usual limiting dilution method is carried out, and the hybridoma producing the objective antibody is screened and cloned.

Other than the above method for obtaining hybridomas, by immunizing an animal other than humans with the antigen, a hybridoma producing the objective human antibodies having the activity to bind to proteins can be obtained by the method of sensitizing human lymphocytes, for example, human lymphocytes infected with the EB virus, with proteins, protein-expressing cells, or lysates thereof in vitro, fusing the sensitized lymphocytes with myeloma cells derived from human having a permanent cell division ability.

The obtained monoclonal antibodies can be purified by, for example, ammonium sulfate precipitation, protein A or protein G column, DEAE ion exchange chromatography, an affinity column to which the protein of the present invention is coupled, and so on. The antibody may be useful for the purification or detection of a protein of the invention. It may also be a candidate for an agonist or antagonist of the protein. Furthermore, it is possible to use it for the antibody treatment of diseases in which the protein is implicated. For in vivo administration (in such antibody treatment), human antibodies or humanized antibodies may be favorably used because of their reduced antigenicity.

For example, a human antibody against a protein can be obtained using hybridomas made by fusing myeloma cells with antibody-producing cells obtained by immunizing a transgenic animal comprising a repertoire of human antibody genes with an antigen such as a protein, protein-expressing cells, or a cell lysate thereof. Other than producing antibodies by using hybridoma, antibody-producing immunocytes, such as sensitized lymphocytes that are immortalized by oncogenes, may also be used.

Such monoclonal antibodies can also be obtained as recombinant antibodies produced by using the genetic engineering technique. Recombinant antibodies are produced by cloning the encoding DNA from immunocytes, such as hybridoma or antibody-producing sensitized lymphocytes, incorporating this into a suitable vector, and introducing this vector into a host to produce the antibody. The present invention encompasses such recombinant antibodies as well.

Moreover, the antibody of the present invention may be an antibody fragment or a modified-antibody, so long as it binds to a protein of the invention. For example, Fab, F (ab') 2 , Fv, or single chain Fv in which the H chain Fv and the L chain Fv are suitably linked by a linker (scFv, Huston et al., Proc. Natl. Acad. Sci. USA, 85:5879-5883, 1988) can be given as antibody fragments. Specifically, antibody fragments are produced by treating antibodies with enzymes, for example, papain, pepsin, and such, or by constructing a gene encoding an antibody fragment, introducing this into an expression vector, and expressing this vector in suitable host cells (for example, Co et al., J. Immunol., 152:2968-2976, 1994; Better et al., Methods Enzymol., 178:476-496, 1989; Pluckthun et al., Methods Enzymol., 178:497-515, 1989; Lamoyi, Methods Enzymol., 121 :652-663, 1986; Rousseaux et al., Methods Enzymol., 121 :663- 669, 1986; Bird et al., Trends Biotechnol., 9:132-137, 1991).

As modified antibodies, antibodies bound to various molecules such as polyethylene glycol (PEG) can be used. The antibody of the present invention encompasses such modified antibodies as well. To obtain such a modified antibody, chemical modifications are done to me obtained antibody. These methods are already established in the field.

The antibody of the invention may be obtained as a chimeric antibody, comprising non-human antibody-derived variable region and human antibody- derived constant region, or as a humanized antibody comprising non-human antibody-derived complementarity determining region (CDR), human antibody- derived framework region (FR), and human antibody-derived constant region by using conventional methods.

Antibodies thus obtained can be purified to uniformity. The separation and purification methods used in the present invention for separating and purifying the antibody may be any method usually used for proteins. For instance, column chromatography, such as affinity chromatography, filter, ultrafiltration, salt precipitation, dialysis, SDS-polyacrylamide gel

electrophoresis, isoelectric point electrophoresis, and so on, may be

appropriately selected and combined to isolate and purify the antibodies (Antibodies: a laboratory manual. Ed Harlow and David Lane, Cold Spring

Harbor Laboratory, 1988), but is not limited thereto. Antibody concentration of the above mentioned antibody can be assayed by measuring the absorbance, or by the enzyme-linked immunosorbent assay (ELISA), etc. Protein A or Protein G column can be used for the affinity chromatography. Protein A column may be, for example, Hyper D, POROS, Sepharose F.F., and so on.

Other chromatography may also be used, such as ion exchange chromatography, hydrophobic chromatography, gel filtration, reverse phase chromatography, and adsorption chromatography (Strategies for Protein

Purification and Characterization: A laboratory Course Manual. Ed. by Marshak D.R. et al., Cold Spring Harbor Laboratory Press, 1996). These may be performed on liquid chromatography such as HPLC or FPLC.

Examples of methods that assay the antigen-binding activity of the antibodies of the invention include, for example, measurement of absorbance, enzyme-linked immunosorbent assay (ELISA), enzyme immunoassay (EIA), radio immunoassay (RIA), or fluorescent antibody method. For example, when using ELISA, a protein of the invention is added to a plate coated with the antibodies of the invention, and next, the objective antibody sample, for example, culture supematants of antibody-producing cells, or purified antibodies are added. Then, secondary antibody recognizing the antibody, which is labeled by alkaline phosphatase and such enzymes, is added, the plate is incubated and washed, and the absorbance is measured to evaluate the antigen-binding activity after adding an enzyme substrate such as/7-nitrophenyl phosphate. As the protein, a protein fragment, for example, a fragment comprising a C-terminus, or a fragment comprising an N-terminus may be used. To evaluate the activity of the antibody of the invention, BIAcore may be used.

The following non-limiting examples set forth herein below illustrate certain aspects of the invention.

EXAMPLE 1

This example describes two specific EET tags designed utilizing

TOEET. These EETs were engineered and subcloned into the pET15_NESG expression vector (Acton et al., 2011). They contain dual tandem protein purification tags and a protease cleavage site to facilitate purification of the resulting proteins. These include the 6X-His tag (Crowe et al., 1994), and one of two Streptavidin binding moieties, either the Avi-tag (Scholle et al., 2004) or the

Nano-tag (Lamia and Erdmann, 2004). The Nano-tag binds directly to streptavidin (Lamia and Erdmann, 2004); the Avi-tag is a substrate for the enzyme BirA which can be used to catalyze the covalent attachment of biotin to the Avi Tag (Scholle et al., 2004). These tandem tags allow for two separate affinity purification steps, (i) Ni-based immobilized metal affinity

chromatography (IMAC) and (ii) high-affinity Streptavidin-based

chromatography. This dual purification strategy allows preparation of highly purified proteins using high-throughput affinity purification methods. The Tobacco Etch Virus (TEV) protease recognition site (Kapust et al., 2002) engineered into these EETs allows removal of the affinity tags, if required, after expression and purification of the protein target.

Briefly, in designing the DNA sequences coding for these EETs, the coding sequence of one of the two Streptavidin binding moieties i.e., Avi-tag (SEQ ID NO:l - MSGLNDIFEAQKIEWHE) or Nano-tag (SEQ ID NO:2 - MDVEAWLDERVPLVET) (Lamia and Erdmann, 2004; Scholle et al., 2004), a 6X-His tag (Crowe et al., 1994), and a TEV protease recognition site (Kapust et al., 2002) were fused in frame and optimized to have a high Codon Adaptation Index (Sharp and Li, 1987) (Fig. 1). The DNA sequence coding for the EET was optimized with TOEET, together with the 5'- untranslated region of the pET15- NESG expression vector, to generate the expression vectors pNESG_Avi6HT and pNESG_Nano6HT, shown in Fig. 1. These features functioned together to enhance translation initiation and protein expression levels.

Using these expression vectors (Fig. 1), protein expression resulted in T7 RNA Polymerase mediated transcription producing an mRNA transcript consisting of (i) vector sequence (pET15_NESG-5'- untranslated region), (ii) nucleotides coding for the EET, and (iii) nucleotides coding for the target protein sequence. Both the untranslated region of the vector upstream of the EET-coding region, and the RNA coding for the EET itself were optimized to avoid secondary structure formation within and between these regions of the mRNA transcript. In this particular implementation, the length of the optimized nucleotide sequence coding for the EET was about 90 nucleotides. Together with the 70 upstream 5'- untranslated nucleotides of the transcript driven by the

T7 promoter of the vector, the 5'- region of the transcript was optimized as a unit of about 160 nucleotides. Longer optimized nucleotide sequences, and potentially somewhat shorter optimized nucleotide sequences may also be effective in creating TOEET-based expression-enhanced vectors. The optimized regions of the pNESG_Avi6HT and pNESG_Nano6HT based TOEET vectors are shown in Fig. 1. The figure shows the DNA sequences, RNA sequences, and the translated protein tag (SEQ ID NO:3- MSGLNDIFEAQKIEWHEHHHHHHENLYFQSH and SEQ ID NO:4 - MDVEAWLDERVPLVETHHHHHHENLYFQSH, respectively) sequences of the expression vectors, along with the DNA sequence coding for the multiple cloning site (MCS), a series of restriction endonuclease sites used for cloning into the expression plasmids. Figure 2 shows, as an example, the predicted RNA secondary structure in transcripts generated from the pNESG_Avi6HT vector, highlighting the lack of predicted RNA secondary structure near the RBS / translation initiation site.

A third vector comprising the Pyrococcus furiosus (PfR) Maltose Binding Protein (MBP) was also constructed and optimized using TOEET. The MBP from Pyrococcus furiosus is much more thermally stable than that of E coli, and is expected to provide a more robust solubilization enhancement tag and affinity purification tag. Proteins that are expressed but not soluble in cell extracts can be solubilized and used successfully as antigens using various methods of solublization, including urea and guanidine denaturtants (Agaton et al, 2003). The PfR MBP provides improved purification of target proteins under such partially denaturing conditions or other harsh conditions. The sequences shown at the top of Figure 4 correspond to the first 30 residues of the wild-type PfR-MBP DNA sequence lacking the native secretion signal. The protein open reading frame (DNA sequence) is shown above the corresponding protein sequence and directly below is the T7 RNA polymerase mediated RNA transcript resulting from the cloning of the PfR-MBP into the pETl 5 NESG backbone. The lower set of sequences shown in Figure 4 correspond to TOEET optimized PfR-MBP. Silent mutations were introduced for codon optimization or to decrease the predicted RNA secondary structure in the regions of the RBS and translation initiation codon, or both. The silent mutations were introduced using primers incorporating the nucleotide changes and 5 successive rounds of PCR, negating the need for expensive total gene synthesis.

The predicted mRNA secondary structure resulting from T7-RNA Polymerase based transcription off of the pET15_NESG vector backbone with Pyrococcus furiosus (PfR) Maltose Binding Protein (MBP) without TOEET optimization is shown in Figure 5. Significant secondary structure (base pairing) at both the Ribosome Binding Site (RBS) and the translation initiation site (Initiation Codon) is predicted. The predicted mRNA secondary structure resulting from T7-RNA Polymerase based transcription off of the pET15_NESG vector backbone with Pyrococcus furiosus (PfR) Maltose Binding Protein

(MBP) after TOEET optimization is shown in Figure 6. As illustrated by Figure 6, significantly greater open structure (lack of base pairing) after TOEET optimization is predicted. EXAMPLE 2

The results obtained from expression studies with the above-described new vectors demonstrated that the TOEET strategy is both extremely successful and robust. In this example, similar expression and solubility studies were carried out using a high throughput methodology for the identification and isolation of soluble proteins and protein domains.

As mentioned above, the isolation of soluble, well-folded proteins and protein domains is of great use and importance to the biotechnology industry and biological researchers as a whole. However, the production of such protein reagents remains extremely challenging, especially in the cost effective, commonly used bacterial expression systems. These Escherichia coli expression systems are often successful in the production of simple bacterial proteins but are far less amenable to the production of eukaryotic, mulitdomain proteins or protein complexes, often resulting in no or low levels of expression and/or solubility (greatly complicating or thwarting their production as a protein reagent). There are a variety of reasons that contribute to the lower success rate of these proteins in bacterial expression systems including the fact that eukaryotic proteins are frequently multidomain in nature, this often results in misfolding when expressed using simple prokaryotic expression systems (Netzer and Hartl, 1997). Another major reason for the higher attrition rate relates to the increased levels of disordered regions in human and other eukaryotic proteins in comparison to simpler organisms (Lui et al., 2002). These disordered regions likely cause aggregation and misfolding in E. coli expression systems leading to proteins or domains with low expression and/or solubility, again, greatly interfering with their production. To circumvent these issues, the NESG Construct Optimization Software and High Throughput (HTP) Molecular Cloning and Expression Screening Platform and Automated Purification Pipeline methods were developed for assaying multiple alternative constructs to identify soluble proteins or domains (Methods in Enzymology, Vol. 493, Burlington: Academic Press, 20-11, pp. 21- 60.). Briefly, the NESG Construct Optimization Software used reports from the from the DisMeta Server (http://www- rimr.cabm.rutgers.edu/bioinformatics/disorder , a metaserver that generated a consensus analysis of eight sequence-based disorder predictors to identify protein regions that are likely to be disordered. In addition, secondary structure, transmembrane and signal peptides among others were also predicted. This data along with multiple sequence alignments of homologous proteins were used to predict possible structural domain boundaries. Based on this information, the NESG Construct Optimization software generated nested sets of alternative constructs, for full-length proteins, multidomain constructs, and single domain constructs. Primers for cloning were then designed using the software Primer Primer (Everett, J.K.; Acton, T.B.; Montelione, G.T.J. Struct. Funct. Genomics 2004, 5: 13-21. Primer Prim'r: A web based server for automated primer design.). Thus for a single targeted region, multiple open reading frames were generally designed varying the N and/or C-terminal sequences. These alternative constructs often possessed significantly better expression, solubility and biophysical behavior than their full-length parent sequences, increasing the possibility of successfully producing a protein reagent.

Although the NESG Construct Optimization Software identified protein subsequences that were more likely to produce soluble well-behaved samples, several variants of each were assayed to identify constructs amenable to protein sample production. Therefore the high-throughput NESG Molecular Cloning and Expression Screening Platform was developed utilizing 96-well parallel cloning / E. coli expression and Qiagen BioRobotSOOO-based liquid handling. Briefly, protein target sequences (constructs) were PCR amplified from Reverse

Transcriptase (RT) generated cDNA pools or genomic DNA, gel purified and extracted in 96-well format (robotic liquid handling) and subcloned into pET NESG, a series of T7 based (Novagen) bacterial expression vectors generated at Rutgers, using InFusion (Clonetech) Ligation Independent Cloning (LIC). The RT generated cDNA pools were derived from normal and disease tissue (tumor cells and cell lines) allowing for the isolation of wild-type and polymorphic proteins. Correct clones (containing the desired protein open reading frame) were identified using plate based-PCR assays. An automated DNA Miniprep Protocol isolated the nascent expression vectors and a 96-well transformation protocol was used to introduce the plasmids into the BI21(DE3) pMgK E. coli expression strain. Following overnight growth, a single representative colony from each well (96) was transferred to LB in a 96-well S- Block and incubated for 6 hours. Automated liquid handling was then utilized to produce a 500 microliter overnight subculture of each of the 96 constructs in a single 96-well S-block. An aliquot of each well was then subcultured into the corresponding well of one of four 24-well blocks containing 2 ml of fresh media and incubated at 37 °C until mid-log phase growth. Protein expression is induced with IPTG (Isopropyl 13-D-l-thiogalactopyranoside) and incubated overnight at 17 °C. The cells were harvested using automated liquid handling and sonicated in 96-well format. The expression and solubility of each construct was visualized by SDS-PAGE analysis and constructs suitable for protein production were identified.

The soluble expression constructs were then fermented in large volume using parallel fermentation system, consisting of 2.5-L baffled Ultra YieldTM Fernbach flasks, low-cost platform shakers, controlled temperature rooms and specialized MJ9 media (Jansson et al. 1996). This generally produced 10 - 100 mg of protein per liter of culture. The resulting proteins were then purified using high-throughput AKTAxpress-based parallel protein purification system. This consisted of a two-step automated Ni-affinity purification (pET NESG imparts a 6X-His tag) followed by gel filtration chromatography. The purified proteins were then analyzed for quality including molecular weight validation by

MALDI-TOF mass spectrometry, homogeneity analysis by SDS-PAGE, aggregation screening by analytical gel filtration with static light scattering, and finally concentration determination was performed.

Together the NESG Construct Optimization Software, Molecular

Cloning and Expression Screening Platform and Automated Purification Pipeline allow for identification and isolation of large numbers of soluble well-behaved protein reagents in a time efficient and cost effective manner. Without this technology, many of the proteins would prove elusive in regard to production as a protein reagent.

In this process, target protein expression constructs were designed using proprietary bioinformatics methods, cloning was done using robotic methods and protocols, and Expression (E, ranging from 0 to 5) and Solubility (S, ranging from 0 to 5) screening were performed in a high throughput fashion and assessed using SDS-PAGE analysis. The read out (ES score = E score x S score, ranging from 0 to 25) provided a measure of the usability of a particular target construct and expression vector system combination for large-scale protein sample production. In general, constructs providing ES scores > 9 in this high throughout expression and solubility assay provided milligram-per-liter (or tens- of-milligram per liter) quantities of protein samples in medium scale (0.5 - 3 L) shake flask fermentations.

As a demonstration of the TOEET technology, a set of approximately 96 human transcription factor genes and epigenetic regulatory factor genes were cloned into the pET15_NESG vector (Acton et al., 2011) lacking a TOEET sequence, and into both the pNESG_Avi6HT and pNESG_Nano6HT vectors. These expression vectors were constructed, and the expression and solubility of target proteins assessed, using the technology outlined above. The results of this study are summarized in Table 1.

It was found that, using the pET15_NESG vector, only 20 of 99 constructs provided expression and solubility levels that can support scale-up protein sample production (ES score > 9; highlighted in grey shade in Table 1). In contrast, using the pNESG_Nano6HT or pNESG_Avi6HT on this same set of target genes provided a significant increase in the number of highly-expressed and soluble targets suitable for scale-up production. As shown in Table 1, 42 of 98 tested, and 34 of 94 tested protein targets exhibited an ES score > 9

(highlighted in grey shade in Table 1) in the pNESG_Avi6HT and

pNESG_Nano6HT vectors, respectively. Several SDS-PAGE gels illustrating these expression and solubility enhancements are shown in Figure 3. Not only were more of these 99 human protein target genes expressed using TOEET, but both expression levels and solubility were generally increased. For example, while about half of the 99 protein targets had expression value E = 0 (i.e. no detectable expression) in the pET15_NESG vector (lacking TOEET), 95 of the 99 protein targets had expression values E > 2 in either the pNESG_Nano6HT and pNESG_Avi6HT vectors (Table 1); many have E values E = 5 (the maximum level typically observed) in the expression vectors using TOEET.

Construct designs for a larger set of more than 2,000 human transcription factor proteins and domains are listed in Table 2. A large number of the proteins listed in Table 2 have been cloned into vectors optimized by TOEET, such as the pNESG_Nano6HT and pNESG_Avi6HT vectors, and exhibit high levels expression and solubility. Analysis of these data indicates that both the pNESG_Nano6HT vector and pNESG_Avi6HT vectors produced greater expression and solubility levels than a standard pETl 5 NESG vector that has not been optimized using the TOETT technology described in this disclosure.

Overall, TOEET allows for the production of a significantly greater number of human proteins and protein domains. The higher ES values obtained using TOETT also allow for simpler production and purification of the target proteins, since high ES scores mean that the cell extract has a larger amount of the target protein relative to background proteins.

The pNESG_Avi6HT also allows for the production of protein samples that can be readily biotinylated in the EET tag sequence. The pNESG_Nano6HT tag also provides a means for simple production of a streptavidin-binding protein (Scholle et al., 2004). Such biotinylated or Nano-tagged protein samples can be used for a variety of processes, including phage display antibody production, as well as for screening and discovering protein-protein and protein - nucleic acid interactions. EXAMPLE 3

In certain applications, proteins that are expressed but not soluble in cell extracts can be solubilized and used successfully as antigens using various methods of solubilization, including urea and guanidine denaturants (Agaton et al. 2003). Accordingly, the ability to express a protein target, even it is not soluble in the high throughput Expression-Solubility screen described above

[NESG High Throughput (HTP) Molecular Cloning and Expression Screening

Platform methods] is critical, since if the protein cannot be expressed at all it is not possible to generate a suitable antigen. Accordingly, a particularly important value of the TOEET technology is enhancement of protein expression (E), regardless of the resulting solubility. To illustrate this point, histogram plots are presented in Figures 7a and 7b comparing Expression scores (E ranging from 0 to 5) using the TOEET technology (E TOEET) compared to expression scores for the same target protein using a pET vector lacking TOEET technology (E_pET). The data shown in Figure 7a is for 98 protein target genes cloned into the pNESG_Avi6HT TOEET vector compared with the exact same genes cloned into the pET15_NESG vector (lacking TOEET). The data shown in Figure 7b is for 94 protein target genes cloned pNESG_Nano6HT TOEET vectors compared with the exact same genes cloned into pET15_NESG vector (lacking TOEET). In these histogram plots, a value E TOEET - E_pET = 0 indicates that the expression levels for both vectors were identical; values E TOEET - E_pET > 0 indicate that the TOEET technology provided higher level expression, values E TOEET - E_pET < 0 indicate that the TOEET technology provided lower level expression. For both target sets, the vast majority of genes exhibit much higher expression in the pNESG_Avi6HT TOEET and pNESG_Nano6HT

TOEET vectors compared with the pET15_NESG vector (lacking TOEET). In many cases, E TOEET - E_pET is 4 or 5, indicating that the expression in the non-TOEET vector was 0 or 1, which is too low to be useful for antigen production. Thus the TOEET vectors often provide high level expression of proteins which cannot be expressed at all, or those with are otherwise expressed as such marginal levels as to be useless for antigen production.

EXAMPLE 4

A representative method for practicing certain embodiments of the invention is described below.

The first step in the method is to identify the residues of the chosen tag/protein and the corresponding DNA sequences to be modified, for example, the 1 st 30 residues of the tag/protein. Low usage codons are identified and are changed to optimal codons either manually or using servers, for example, such as http://www.jcat.de/ or http://genomes.urv.es/OPTIMIZER/, among others

(Step 2). The transcription start site of vector and the resulting 5' untranslated region is then identified (Step 3). The 5' UTR RNA sequence is fused in silico with the optimized RNA sequence encoding the tag/protein (e.g., the first 30 residues of the tag/protein) (Step 4). Various RNA secondary structure prediction methods may then be used to analyze the fused sequence, such as, for example : http ://www.genebee.msu. su/services/rna2_reduced.html,

http://ma^bi.uriivie.ac.at/cgi-bin/R Afold.cgi (Maximum Free Energy with partition function) or http://www.ncrna.org/cenuOidfold/ (Centroid Estimators- Statistical Decision Theory) (Step 5). The RBS and Initiation codon (IC) are then identified in the secondary structure prediction and the RNA positions in the first, e.g., 30 residues of the tag/protein that pair to the RBS/IC regions are determined (Step 6). Subsequently, alternative high frequency codons for the given residues base pairing with the RBS/IC are substituted and secondary structure is recalculated (Step 7). Steps 5 through 7 may be repeated until the secondary structure in RBS/IC is minimized and there is general agreement with the between the prediction servers (e.g., multiple predication servers may be used, such as the three servers listed above). This information is then used to design and produce the TOEET-optimized expression vector. Target proteins may then be cloned and expressed into the resulting expression system using the NESG Construct Optimization Software and High Throughput (HTP) Molecular Cloning and Expression Screening Platform and Automated Purification Pipeline methods, as outlined above.

Table 1. Expression Results

pET15 NESG pNESG Avi6HT pNESG Nano6HT

13 HR6832 0 0 0 5

14 HR6832-1-194 3 0 0 5

15 HR6832A-33-194 3 0 0 5

16 HR6832A-72-207 5 0 0 5

17 HR6956-16-584 0 0 0 0

18 HR6956A-115-584 0 0 0

19 HR6956B-170-584 5 0 0 4

38 HR7133B-182-468 5 0 0 5

39 HR7133C-192-468 5 0 0 5

40 HR7224B-363-626 0 0 0 3

41 HR7224B-396-626 0 0 0 1

42 HR7224B-411-626 0 0 0 5 HR7224C-291-374 0 0

HR7224C-294-356 0 0 0

HR7370A-209-460 0 0 0

HR7372A-210-470 5 0 0

HR7372A-237-470 3 0 0

HR7372A-245-470 3 0 0

HR7378A-18-385 0 0 0

HR7378B-65-385 4 0 0

HR7378B-76-385 0 0 0

HR7378C-18-109 0 0 0

HR7378C-18-83 0 0 0

HR7378C-18-96 0 0

HR7469A-9-91 0 0

HR7469A-9-95 0 0

HR7870B-130-434 5 0

HR7870C-142-434 5 0

HR7906B-114-410 5 5 1

HR7906C-50-126 0 0

HR7993A-220-461 5 0

B

HR7993B-10-111 4 3 I

HR8028A-67-146 0 0

HR8028A-67-156 0 0

HR8028A-73-146 0 0 89 HR8028A-73-156 0 0 0 3 1 3 4 0 0

90 HR8028B-163-441 4 0 0 5 0 0 4 0 0

91 HR8234A-206-287 2 2 4 5 1 5

92 HR8241A-261-342 0 0 0 5

93 HR8241A-264-328 0 0 0 2

94 HR8241B-328-598 0 0 0 5

95 HR8278A-123-216 0 0 0 4

96 HR8341A-100-171 0 0 0 3

97 HR8341A-100-181 0 0 0 5

98 HR8341A-100-200 0 0 0 4

99 HR8341B-381-579 5 4 ip 5

E=Expression ; E=0-5 (no to high expression)

S=Solubility ; S= 0 - 5 (no to high solubility)

ES = E * S = ( 0 - 25) ES > 9 usability (highlighted with grey fill)

ES > 9 (typically results in≥5 milligrams of protein per one liter

Fermentation)

Table 2. Human transcription factor protein and domain constructs designed using the NESG Construct Optimization Software for production using TOEET technologies. Each line in the table describes a unique protein construct for RT-PCR cloning, defined by the NESG Vector ID, the HUGO protein identifier, the Uniprot protein identifier, the first 15 amino acid residues in the targeted construct, the last 15 amino acid residues i the target construct, and the length of the targeted gene. The actual length of the targeted gene obtained by RT-PCR may be shorter or longer than indicated in the table due RNA s icing variations.

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR8040A-384-551-AV6HT AKAP8L Q9ULX6 VERIQFVCSLCKYRT KKLERYLKGENPFTD

HR7916A-159-235-Av6HT ALX3 TFSTFQLEELEKVFQ RNPFTAAYDISVLPR

HR6941A-510-703-NHT ANAPC2 Q9UJX6 GSKDLFINEYRSLLA VALLRRRMSVWLQQG

HR6941B-732-822-Av6HT ANAPC2 Q9UJX6 SDDESDSGMASQADQ LVYSAGVYRLPKNCS

HR6941C-498-713-Av6HT ANAPC2 Q9UJX6 SSDIISLLVSIYGSK WLQQGVLREEPPGTF

HR8423A-486-593-Av6HT AN ZF1 AKAPGQPELWNALLA STRNEFRRFMEKNPD

HR5083A-1-319-14 APEX2 MLRVVSWNINGIRRP CPVGAVLSVSSVPAK

HR5083A-1-352-14 APEX2 MLRVVSWNINGIRRP ILRFLVPLEQSPVL 352

HR8294A-15-116-TEV APTX RVCWLVRQDSRHQRI HMVNELYPYIVEFEE 100

HR7542A-507-616-NHT ARID2 Q68CP9 QHVAPPPGIVEIDSE R Al P LP I QM YYQQQP

HR4394C-15 ARID3A Q99856 MPDHGDWTYEEQFKQ ELQAAIDSNRREGRR 135

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4394C-218-351-TEV ARID3A Q99856 PDHGDWTYEEQFKQL ELQAAIDSNRREGRR

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4572B-103- 181-14 ATF3 P18847 MCRNKKKEKTECLQK RNLFIQQIKEGTLQS 80

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

UB 11 tu.nn./iac.Av/CUT OA71 A nQMDI ^ I DDITIC rnAI hAMl EC CI I CCCI ΤΛΙΕΠΛΙΛ

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7019A-867-954-TEV CAMTA1 Q9Y6Y1 SGRVFMVTDYSPEWS NNQIISNSWFEY A 88

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7688A-98-186-NHT CLOCK QDWKPTFLSNEEFTQ THLLESDSLTPEYLK

H R7654B-1987-2369-TEV CNOT1 A5YKK6 QLPYHRIFIMLLLEL SVAQCCMGQKQAQQV

HR2981-28-443-15 COPS2 MPNVDLENQYYNSKA NQLNSLNQAVVSKLA

HR2981-45-443-15 COPS2 P61201 MDDPKAALSSFQKVL NQLNSLNQAVVSKLA

HR2981B-339-418-14 COPS2 P61201 DDPFIREHIEELLRN QVNQLLELDHQKRGG

HR2981C-45-184-15 COPS2 MDDPKAALSSFQKVL KILRQLHQSCQTDDG

HR3016-1-411-14 COPS3 Q9UNS2 MASALEQFVNSVRQL ITV PQFVQKSMGSQ

HR3016A-49-114-15 COPS3 Q9UNS2 LDVQEHSLGVLAVLF FAGLCHQLTNALVER

HR3016C-270-368-15 COPS3 Q9UNS2 NNPSELRNLVNKHSE KDGMVSFHDNPEKYN

HR3016C-270-411-15 COPS3 Q9UNS2 NNPSELRNLVNKHSE ITVNPQFVQKSMGSQ

HR3016D-358-409-15 COPS3 VSFHDNPEKYNNPAM QEITVNPQFVQKSMG

i2

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR8371A-114-182-AV6HT DMRT2 Q9Y5R5 PRKLSRTPKCARCRN LRRQQATEDKKGLSG

HR6947A-318-361-NHT D RTA1 Q5VZB9 SLPTVSSRPRDPLDI G I LRFCKG DVVQAI E

HR7387A-44-84-Av6HT DMRTC2 Q8IXT2 RCRNHGVTAHLKGHK KCVLILERRRVMAAQ

HR8011A-205-293-15 DMTF1 MSTEPGDIVTQGVSW LAEGWSSVRSPQWLR

HR8011B-255-356-15 DMTF1 MDEINLILRIAELDV NSNTNSSVQHVQIRV

HR8011B-268-356-15 DMTF1 MVADENDINWDLLAE NSNTNSSVQHVQIRV

HR7581A-1-76-NHT DNAJC21 Q5F1R6 KCHYEALGVRRDASE RAWYDNHREALLKGG

HR8202A-15-83-Av6HT DPRX A6NFQ7 HSHRKRTMFTKKQLE AKLKKAKCKHIHQKQ

HR7601-1-176-TEV DR1 ASSSGNDDDLTIPRA NQAGSSQDEEDDDDI 175

HR7517A-25-174-NHT DUSP12 GQMLEVQPGLYFGGA WQLKLYQAMGYEVDT

HR4713B-251-345-TEV DVL1 014640 TVTLNMERHHFLGIS ISLTVAKCWDPTPRS

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4622-24-237-15 075461 RRCRDPINVEGLLP HIRSTNGPIDVYLCE 215

HR4622-24-281-Av6HT RRCRDPINVEGLLPS EENPQQSEELLEVSN 258

HR4622B-128-247-15 GSDLSNFGAVPQQKK VYLCEVEQGQTSNKR

HR4622C-127-242-15 IGSDLSNFGAVPQQK NGPIDVYLCEVEQGQ

HR4622D-54-137-15 075461 RKALKVKRPRFDVSL HIRWIGSDLSNFGAV

HR4622D-54-242-15 RKALKVKRPRFDVSL NGPIDVYLCEVEQGQ

HR4622D-58-175-15 075461 KVKRPRFDVSLVYLT IKDCAQQLFELTDDK

HR8499A-141-251-Av6HT SRKQKSLGLLCQKFL YLQQKELDLIDYKFG

HR8342-l-508-Av6HT Q66K89 EGAMAVRVTAAHTAE G DCG KLYKTI AH VRG

HR8342A-522-586-15 Q66K89 MPKCGKRYKTKNAQQ EKPFKCYKCGRGFAE

HR8342A-527-581-15 MRYKTKNAQQVHFRT RHHTGEKPFKCYKCG

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6332A-203-348-15 GKGNTIYLWEFLLAL SPGVKGGATTVLKPG 146

HR6332B-157-299-15

HR6332B-198-304-15

HR7067A-150-308-15

HR7067A-200-308-15

HR7867B-269-371-TEV

HR8186A-1-87-TEV

HR7396A-166-265-TEV ELF5

HR4449C-1-93-TEV

HR7174A-264-457-Av6HT EOMES 095936

HR4540F-1226-1281-14 EP300

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7097B-179-423-AV6HT ESRRA P11474 GPLAVAGGPRKTAAP PMHKLFLEMLEAMMD 245

Vector HUGO First 15aa Last 15aa Construct Length

HR4739B-114-198-AV6HT FLU Q01543 MPPNMTTNERRVIVP TEVLLSH LSYLRESS

r4

'5

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4429D-233-303-14 Q99684 MKGAGVKVESELLCT CGKTFGHAVSLEQHK 72

HR4429D-238-298-14 MKVESELLCTRLLLG FACEMCGKTFGHAVS

HR4429E-311-392-TEV Q99684 SFDCKICGKSFKRSS SQSSNLITHSRKHTG

HR7924A-436-590-TEV P10070 ETNCHWEDCTKEYDT TDPSSLRKHVKTVHG

HR7155A-189-350-NHT GLIS1 RVVAGRQACRWVDCC PSSLRKHVKAHSAKE 162

HR7416A-116-318-Av6HT GLIS2 DFQPLRYLDGVPSSF YTDPSSLRKHIKAHG

HR7416A-163-298-Av6HT GLIS2 PLPKQLVCRWAKCNQ TRTHYVDKPYYCKMP 136

HR7200A-261-553-TEV GLYR1 Q49A26 GSITPTDKKIGFLGL QSDNDMSAVYRAYIH 293

HR7418A-64-203-NHT GMEB2 Q9UKD1 AFTASSQLKEAVLVK LSSPTSAEYIPLTPA 140

HR7418A-87-203-AV6HT GMEB2 Q9UKD1 EAEIVYPITCGDSRA LSSPTSAEYIPLTPA 117

HR7418B-83-179-AV6HT GMEB2 Q9UKD1 GENLEAEIVYPITCG YQHDKVCSNTCRSTK 97

Π

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7205B-53-236-15 GTF2H2C I Q6P1K8 MVRLGMMRHLYVWD ILDESHYKELLTHHL 185

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7469A-9-95-15 HNF4G Q14541 MVLDPTYTTLEFETM HVYSCRFSRQCVVDK 88

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4478B-263-312-14 HOXC10 Q9NYD6 MLTAKSGRKKRCPYT RLEISKTINLTDRQV

HR4478C-247-342-14 HOXC10 Q9NYD6 MNEAKEEIKAENTTG RENRIRELTSNFNFT

HR7847A-205-271-NHT HOXC12 P31275 APWYPINSRSRKKRK QVKIWFQNRRMKKKR

HR8257A-163-216-NHT HOXC4 P09017 YTRQQVLELEKEFHY IWFQNRRMKWKKDHR 54

HR7011A-156-219-TEV HOXC5 Q00444 KRSRTSYTRYQTLEL NRRMKWKKDSKMKSK 64

HR6394A-149-208-TEV HOXC8 P31273 RRSGRQTYSRYQTLE KIWFQNRRMKWKKEN

H R8256A-200-280-Av6HT HOXD1 Q9GZZ0 AAFSTFEWMKVKRNA LHLNDTQVKIWFQNR

HR8148A-274-327-Av6HT HOXD10 P28358 TKHQTLELEKEFLFN WFQNRRMKLKKMSRE

HR8017A-257-326-AV6HT HOXD11 P31277 SSSAVAPQRSRKKRC IWFQ RRMKEKKLNR

HR7220A-181-257-Av6HT HOXD3 P31249 GESCEDKSPPGPASK QNRRMKYKKDQKAKG 77

HR7832A-197-256-TEV HOXD8 P13378 RRRGRQTYSRFQTLE KIWFQNRRMKWKKEN

!6

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR3173A-14 Q13568 NQSIPVAPTPPRRV PPQPYKIYEVCSNGP

HR3173F-8-114-14 APTPPRRVRLKPWLV FRLIYDGPRDMPPQP

HR7755A-198-455-AV6HT LEMEVPQAPIQPFYS RILQTQESWQPMQPT

HR5527A-15 MALAPERAAPRVLFG RRFVMLRDNSGDPAD

HR7337A-9-115-TEV Q02556 RLRQWLIEQIDSSMY LDISEPYKVYRIVPE

H R7302A-209-393-Av6HT Q00978 EFLLPPEPDYSLLLT LEQTPEQQAAILSLV

HR7304A-126-209-NHT IRX6 PYERTLGQYQYERYG TWFANARRRLKKENK

HR8291A-190-254-TEV ISL2 Q96A47 KTTRVRTVLN EKQLH QNKRCKDKKKSILMK 65

HR8400B-804-1099-AV6HT JARID2 Q92833 KGVLNDFHKCIYKGR LDELRDTELRQRRQL 296

HR8400B-809-1104-AV6HT JARID2 Q92833 DFHKCIYKGRSVSLT DTELRQRRQLFEAGL

HR8400C-900-1104-AV6HT JARID2 Q92833 GSILRHLGAVPGVTI DTELRQRRQLFEAGL

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4484C-253-308-AV6HT JUN MIKAERKRMRNRIAA LASTANMLREQVAQL 57

HR4765B-273-324-TEV P17275 RKRLRN RLAATKCRK LSSTAG LLREQVAQL

HR2962A-5-79-Av6HT KAT5 Q92993 GEIIEGCRLPVLRRN LKKIQFPKKEAKTPT

HR7375B-685-750-Av6HT KDM5B Q9UGL1 LPDDERQCVKCKTTC YTLDDLYPMMNALKL

HR7375C-1487-1544-Av6HT KDM5B Q9UGL1 CPAVSCLQPEGDEVD YICVRCTVKDAPSRK

HR7375E-1123-1227-Av6HT KDM5B Q9UGL1 ESLSDLERALTESKE LRIWLCPHCRRSEKP

HR7375E-1132-1230-Av6HT KDM5B Q9UGL1 LTESKETASAMATLG WLCPHCRRSEKPPLE

HR7375E-1143-1230-AV6HT KDM5B Q9UGL1 ATLGEARLREMEALQ WLCPHCRRSEKPPLE

HR7188A-306-385-TEV KDM5D Q9BY66 HSSAQFIDSYICQVC EAFGFEQATQEYSLQ

HR7682A-10-80-Av6HT KIAA2018 Q68DE3 PTKKQHRKKNRETHN ITELKRQNDELLLNG 71

HR7553A-272-338-15 KLF1 MARKRQAAHTCAHPG DELTRHYRKHTGQRP 68

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7553B-319-362-Av6HT KLF1 Q13351 RFARSDELTRHYR H FSRSDHLALHMKRHL 44

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4668C-173-283-TEV KLF6 Q99612 GKVRSGTSGKPGDKG FSRSDHLALHMKRHL 111

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7181A-199-265-TEV MIER1 Q8N108 YKENEKVYENDDQLL KDNEQALYELVKCNF

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR8195A-l-143-Av6HT MLLT1 Q03111 DNQCTVQVRLELGHR TEFRYKLLRAGGVMV 142

HR7716A-121-220-NHT MLX

HR7887A-726-802-Av6HT MLXIP Q9HAP2

HR7223A-347-541-Av6HT

HR4485B-103-183-14 MSC

HR4485B-103-188-Av6HT MSC

HR4485B-103- 194-14 MSC

HR4485B-135-194-14 MSC 060682

HR7186A-122-193-NHT MSGN1 A6NI15

HR4585B-167-224-TEV MSX1 P28360

HR4538-1-540-14 MTA1

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6436A-58-155-14 MXI1 P50539 SSGSSNTSTANRSTH KWRLEQLQGPQEMER

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6427A-38-145-14 NFYA P23511 E AQVAS ASG QQVQTL QIIIQQPQTAVTAGQ 108

[05

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7372A-245-470-TEV P51843 PVALKSPQWCEAAS VSMDDMMLEMLCTKI

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7378B-76-385-Av6HT NR2E1 Q9Y466 KKCLEVNMNKDAVQH PITRLLSDMYKSSDI

HR7378C-18-109-15 NR2E1 Q9Y466 MVCGDRSSGKHYGVY RTSTIRKQVALYFRG

HR7378C-18-109-TEV NR2E1 Q9Y466 VCGDRSSGKHYGVYA RTSTIRKQVALYFRG

HR7378C-18-83-Av6HT NR2E1 Q9Y466 VCGDRSSGKHYGVYA QCRACRLKKCLEVNM

HR7378C-18-96-15 NR2E1 MVCGDRSSGKHYGVY NM KDAVQHERGPRT

HR7378C-18-96-TEV NR2E1 VCGDRSSGKHYGVYA NMNKDAVQHERGPRT

HR7906B-101-410-Av6HT NR2E3 QACRLKKCLQAGMNQ GNTPMEKLLCDMFKN

HR7906B-114-410-15 NR2E3 Q9Y5X4 MNQDAVQNERQPRST GNTPMEKLLCDMFKN

HR7906B-114-410-TEV NR2E3 NQDAVQNERQPRSTA GNTPMEKLLCDMFKN

HR7906C-45-119-15 NR2E3 MLQCRVCGDSSSGKH LKKCLQAGM QDAVQ

HR7906C-45-142-15 NR2E3 MLQCRVCGDSSSGKH AQVHLDSMESNTESR

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7224B-396-626-TEV NR4A3 ICMMNALVRALTDST PPSIIDKLFLDTLPF

HR7224B-411-626-AV6HT NR4A3 PRDLDYSRYCPTDQA PPSIIDKLFLDTLPF

HR7224C-291-374-15 NR4A3 Q92570 MTCAVCGDNAACQHY EVVRTDSLKGRRGRL

HR7224C-291-374-TEV NR4A3 TCAVCGDNAACQHYG EVVRTDSLKGRRGRL

HR7224C-294-356-AV6HT NR4A3 Q92570 VCGDNAACQHYGVRT NRCQYCRFQKCLSVG

HR7993A-220-461-15 NR5A1 MGPNVPELI LQLLQL PRNNLLIEMLQAKQT

HR7993A-220-461-TEV NR5A1 GPNVPELILQLLQLE PRNNLLIEMLQAKQT

HR7993B-10-lll-Av6HT NR5A1 Q13285 DELCPVCGDKVSGYH PMYKRDRALKQQKKA 102

HR8211A-79-187-Av6HT NR5A2 000482 MDEDLEELCPVCGDK KRDRALKQQKKALIR 110

HR8211A-79-187-TEV NR5A2 000482 DEDLEELCPVCGDKV KRDRALKQQKKALIR 109

HR7049A-49-474-Av6HT NR6A1 DRAEQRTCLICGDRA LFKVVLHSCKTSVGK 426

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7555A-320-466-TEV 0NECUT3 I 060422 EINTKEVAQRITAEL GLELNTVSNFFMNAR 147

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7140A-124-182-TEV PCGF6 Q9BYE7 NLSELTPYILCSICK RCPKCNIWHQTQPL 59

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6832A-33-207-Av6HT PHB2 Q99623 AYGVRESVFTVEGGH YTAAVEAKQVAQQEA 175

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7108A-119-241-AV6HT PIKFYVE Q9Y2I7 GHDPRTAVQLRSLST NSIGEDLNALSDSAC 123

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7109-1-441-15 PLAG1 Q6DJT9 MATVIPGDLSEVRDT LNGPPYNPLSVGSLG 441

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7739A-238-353-TEV PLEK2 Q9NYT0 SLSTVELSGTVVKQG KAERAEWIEAIKKLT 116

Vector HUGO First 15aa Construct Length

H R6946 A-353-407-Av6HT POU3F2 P20265 RKRKKRTSIEVSVKG LEKEVVRVWFCNRRQ

HR6946A-356-432-15 POU3F2 P20265 MKKRTSIEVSVKGAL TLPGAEDVYGGSRDT

HR6946A-377-427-15 POU3F2 MPKPSAQEITSLADS TPPGGTLPGAEDVYG

HR8200A-192-260-TEV POU3F4 P49335 DELEQFAKQFKQRRI CKLKPLLNKWLEEAD

HR7479A-183-261-NHT POU4F3 Q15319 DPRELEAFAERFKQR VLQAWLEEAEAAYRE 79

HR8237A-224-288-NHT POU5F1B Q06416 ETLMQARKRKRTSIE RVWFCNRRQKGKRSS

HR7133A-97-168-15 PPARA MALNIECRICGDKAS QYCRFHKCLSVGMSH

HR7133A-97-168-TEV PPARA ALNIECRICGDKASG QYCRFHKCLSVGMSH

HR7133A-97-174-AV6HT PPARA Q07869 ALNIECRICGDKASG KCLSVGMSHNAIRFG

HR7133A-97-187-15 PPARA Q07869 MALNIECRICGDKAS FGRMPRSEKAKLKAE

HR7133A-97-187-TEV PPARA Q07869 ALNIECRICGDKASG FGRMPRSEKAKLKAE

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7133B-182-468-Av6HT PPARA Q07869 AKLKAEILTCEHDIE AALHPLLQEIYRDMY

HR7133C-192-468-15 PPARA MEHDIEDSETADLKS AALHPLLQEIYRDMY

HR7133C-192-468-TEV PPARA Q07869 EHDIEDSETADLKSL AALHPLLQEIYRDMY

HR8028A-67-146-Av6HT PPARD Q03181 CGSLNMECRVCGDKA KCLALGMSHNAIRFG

HR8028A-67-156-15 PPARD MCGSLNMECRVCGDK AIRFGRMPEAEKRKL

HR8028A-67-156-TEV PPARD Q03181 CGSLNMECRVCGDKA AIRFGRMPEAEKRKL

HR8028A-73-146-Av6HT PPARD Q03181 ECRVCGDKASGFHYG KCLALGMSHNAIRFG

HR8028A-73-156-15 PPARD MECRVCGDKASGFHY AIRFGRMPEAEKRKL 85

HR8028A-73-156-TEV PPARD Q03181 ECRVCGDKASGFHYG AIRFGRMPEAEKRKL 84

HR8028B-163-441-Av6HT PPARD Q03181 NEGSQYNPQVADLKA TSLHPLLQEIYKDMY

HR4464B-222-504-TEV PPARG P37231 LAEISSDIDQLNPES DMSLHPLLQEIYKDL

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

11mo 1-11 011

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6970C-82-160-TEV RARB P10826 FVCQDKSSGYHYGVS MSKESVRNDRNKKKK

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7540-171-714-15 RBAK Q9NYW8 MTYHGEKMCEFNQNG IHRRGNM VLDVENL 545

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7631B-172-237-Av6HT RCOR1 Q9UKL0 KHNIEKSLADLPNFT IASLVKFYYSW KTR

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4568B-112-233-TEV RUNX2 Q13950 ELVRTDSPNFLCSVL VTVDGPREPRRHRQK

HR4643B-135-200-TEV RXRA CAICGDRSSGKHYGV RCQYCRYQKCLAMGM

HR8407C-205-270-TEV RXRB P28702 CAICGDRSSGKHYGV RCQYCRYQKCLATGM 66

HR7653A-909-977-NHT SALL2 Q9Y467 SRKACEVCGQAFPSQ HHQVQPFAPHGPQNI 69

HR6875A-376-433-Av6HT SALL4 Q9UJQ4 MEAALYKHKCKYCSK FTTKGNLKVHFHRHP 59

HR4435B-174-250-14 SATBl Q01826 MPKLEDLPPEQWSHT FGRWYKHFKKTKDMM

HR4435B-179-244-14 SATBl MLPPEQWSHTTVRNA AAKCQEFGRWYKHFK

HR4435C-368-452-TEV SATBl Q01826 NTEVSSEIYQWVRDE AERDRIYQDERERSL

HR4435D-56-250-15 SATBl MPLKHSGHLMKTNLR FGRWYKHFKKTKDMM

HR4435E-56-175-15 SATBl Q01826 MPLKHSGHLMKTNLR YHVVTLKIQLHSCPK

HR7571B-610-674-TEV SATB2 Q9UPW6 SCAKKPRSRTKISLE IKFFQNQRYHVKHHG

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4810B-218-313-TEV SKI P12755 VRVYHECFGKCKGLL RLGRCLDDVKEKFDY 96

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

H R4503 D-314-552-Av6HT

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7103A-82-143-Av6HT SOX8 P57073 SLVPMPVRGGGGGAL NAELSKTLGKLWRLL 62

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4679B-163-262-14 SPIB MAGTR KLRLYQFLL TYQFDSALLPAVRRA

HR7954A-lll-207-Av6HT SPIC Q8N5J4 LRLFEYLHESLYNPE FSEAILQRLSPSYFL

HR7260B-583-850-Av6HT SRCAP Q6ZRS2 EITDIAAAAESLQP FLLRRVKVDVEKQMP

HR7260B-601-850-AV6HT SRCAP LATTQVKTPIPLLLR FLLRRVKVDVEKQMP

HR7260B-607-850-AV6HT Q6ZRS2 KTPIPLLLRGQLREY FLLRRVKVDVEKQMP

HR4448G-521-624-14 SREBF1 P36956 MVYHSPGRNVLGTES RALGRPLPTSHLDLA

HR4448G-526-619-14 SREBF1 P36956 MGRNVLGTESRDGPG LWLALRALGRPLPTS 95

HR4448G-530-624-14 SREBF1 P36956 MLGTESRDGPGWAQW RALGRPLPTSHLDLA

HR4448G-535-619-14 SREBF1 P36956 MRDGPGWAQWLLPPV LWLALRALGRPLPTS

HR4448H-319-400-TEV SREBF1 P36956 QSRGEKRTAHNAIEK SLRTAVHKSKSLKDL

HR6329A-1075-1134-14 SREBF2 Q12772 MPGQRERATAILLAC RSCNDCQQMIVKLGG 61

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6924A-56-131-AV6HT SRY Q05066 VQDRVKRPMNAFIVW QAMHREKYPNYKYRP 76

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR4753B-177-253-14 TALI P17542 I MEITDGPHTKVVRRI LAKLLNDQEEEGTQR 78

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7868A-127-326-AV6HT TBX21 Q9UL17 LPAG LEVSG KLRVAL QLKIDNNPFAKGFRE

HR7369A-99-311-TEV VEDDPKVHLEAKELW LTLQSMRVFDERHKK

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

PARCVAAHCGNTTKS QRLRLVAGAVPTLHR

HR7799A-1-72-TEV THAP11 Q96EK4

HR8301A-1-87-TEV THAP2 Q9H0W7

HR8415A-l-149-Av6HT THAP6 Q8TBB0

HR6978A-1-82-15 THAP8 Q8NA92

HR6978A-16-87-15 THAP8 Q8NA92

HR7271A-1-88-NHT THAP9

HR7130A-202-461-Av6HT THRB P10828

HR7130B-104-206-15 THRB

HR7130B-104-206-TEV THRB P10828

HR7457A-14-77-NHT TIGD4

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR3466D-14 TP73 015350 MSPAPVIPSNTDYPG KADEDHYREQQALNE 209

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7321A-615-678-Av6HT TTF1 Q15361 NYKGRYSEGDTEKLK RNRGAWSKSETRKLI 64

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7877B-270-376-15 ZBTB25 P24278 MPASILES DLGEVH SQLLEHMYTHKGKSY 108

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7784A-370-422-15 ZFP91 Q96JP5 MRDYICEYCARAFKS TCRQKASLNWHMKKH

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7907F-468-532-Av6HT Q9UKY1 RAKKTKEQLAELKVS NQRNSKSNQCLHLNN

HR8292A-524-605-TEV ZHX2 AYPDFAPQKFKEKTQ VLDSMGSGKKGQDVG

HR7728A-219-303-TEV ZIC1 Q15915 QPIKQELICKWIEPE LVNHIRVHTGEKPFP 85

HR8404A-122-262-NHT QPIKQELICKWLAAD SSDRKKHSHVHTSDK

HR7356A-417-487-NHT ZIK1 Q3SY52 SQSSILIQHRRIHTG SQCSSLIHHQKCHNT

H R8306A-277-33 l-Av6HT ZIM3 Q96PE6 KSYQCNECEKSFRQN IYKSDLVKHQRIHTG

HR8296A-7-131-AV6HT ZKSCAN2 Q63HK3 SQIDAPLEVEGCLIM VALVVHLEKETGRLR

HR7362A-49-138-NHT ZKSCAN4 Q969J2 PERSRQRFRGFRYPE VWLLEYLERQLDEP

HR7288A-62-140-NHT ZMAT2 Q96NC0 LG KTI VITKTTPQSE FEVNKKKMEEKQKDY

HR7144A-212-289-NHT ZNF10 P21506 SNECGQTFCQNIHLI SWRSNLTRHQLIHTG 78

HR8250A-364-426-AV6HT ZNF101 Q8IZC7 KPYECTRCGKAFGWC HERTHLAGRSQCFGR 63

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7096A-560-615-15 ZNF107 I Q9UII5 MVFNQSSNLTTQKII IYTGEKPHKCEECGK

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR8395A-l-68-Av6HT ZNF146 Q15072 SHLSQQRIYSGENPF SQKQYVIKHQNTHTG 67

HR7492A-160-244-NHT ZNF154 Q13106 CYICSECGKSFSKSY SNLIKHRRVHTGERP 85

HR7481A-394-466-NHT ZNF157 P51786 AFYVKARUEHQRMH YVKVRLIEHQRIHTG

HR7677A-461-550-AV6HT ZNF160 Q9HCG1 AFSMHSNLATHQVIH PYKCI ECG KSFTQKS

HR7169A-37-134-NHT ZNF167 Q9P0L1 GQGSSLQKNYPPVCE ESGEEAVAWEDFQR

HR8144A-504-578-Av6HT ZNF17 GKSFRCRSTLDTHQR SQNSHLIRHQKVHTR

HR7382A-565-650-NHT ZNF175 Q9Y473 GKAFTSKSQFKEHQR THMGEKPYECLDCGK 86

HR8047A-6-143-Av6HT ZNF18 P17022 GQALGLLPSLAKAED WISIQVLGQDILSEK 138

HR7508A-221-306-NHT Q2M3W8 QGKSLTLPQTCNREK PYKCI ECG KAFSHVS

HR8281A-147-213-Av6HT ZNF19 P17023 IQGKVPRIPCARKPF NGNSSLIRHQRIHTG

HR7141A-10-132-NHT ZNF193 SLGVQVPEAWEELLT ESGEEAVILLEDLER

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7949A-310-387-Av6HT ZNF20 P17024 PYECKQCGKAFRCGS G KG F RCASQLQI HER

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7401D-l-114-Av6HT ZNF295 Q9UU3 EGLLHYINPAHAISL AAVQELGYSLGISFL

Vector H UGO Uniprot First 15aa Last 15aa Construct Length

HR7294A-14-105-NHT ZNF444 Q8N0Y2 LALDSPWHRFRRFHL AVALLEELWGPAASP 92

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7670C-595-683-Av6HT ZNF510 Q9Y2H8 TGEKPFKCNECGKKF TLSLYQKIQGEGNPY

HR8051A-493-578-TEV ZNF512B Q96KM6 PGGPEEQWQRAIHER SAKPSDAEASEGGEQ

HR7203A-240-299-NHT ZNF516 Q92618 KPELSPGEFPCEVCG FKEPWFLKNHMKAHG

HR6938A-228-328-NHT ZNF518A Q6AHZ1 RHNEIHYKCGKCHHV ILKRYKIGASRKTFW

HR8275A-281-327-Av6HT ZNF519 Q8TB69 GHQKIHTGEKPYKCK IHTGEKPFKCKECGK

HR8035B-928-981-Av6HT ZNF521 Q96K83 GNYKCNVCSRTFFSE RFPSLLTLTEHKVTH

HR8035C-1253-1311-Av6HT ZNF521 Q96K83 GGTFKCPVCFTVFVQ FQTELQ HTMTQHSS

HR8035D-1187-1247-Av6HT ZNF521 Q96K83 SQSDEKKTYQCIKCQ TFDSPAKLQCHLIEH 61

HR8035E-750-805-Av6HT ZNF521 Q96K83 KVYRCTSCNWDFRNE SFGTEVELQCHITTH

HR8035F-632-686-Av6HT ZNF521 Q96K83 GEYICNQCGAKYTSL EFPNQESLLKHVTIH

HR7651A-311-395-NHT ZNF526 Q8TF50 QRSFSSANRLQAHGR AHTANPLHRCRCGKT 85

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7761A-499-570-NHT ZNF528 Q3MIS6 GKVFSRSSNLVCHQK KAFRGCSGLTAHLAI 72

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7135A-110-164-NHT ZNF576 Q9H609 PTFPCPDCGKTFGQA QDFAQEAGLHQHYI

HR7332A-85-197-NHT ZNF581 Q9P0T4 KCYSCPVCSRVFEYM MEQNTLQKHTRWKHP

HR8213A-490-546-Av6HT ZNF583 Q96ND8 KPYECN VCG KAFSYS RAHLAHHERIHTMES

HR7126A-700-769-NHT ZNF585B Q52M93 TKKSQLQVHQRIHTG FVQKSVFSVHQSSHA

HR8398-1-385-15 ZNF587 Q96SQ5 MAAAVPRRPTQQGTV QRVHTGERPYKCGEC

HR8398A-8-73-15 ZNF587 Q96SQ5 MRPTQQGTVTFEDVA GSKDEEAPCKQRISV

HR8398B-90-144-15 ZNF587 Q96SQ5 MKAHPCEMCGLILED DDTAYLHQHQKQHIG

HR7253A-8-134-TEV ZNF593 000488 GAHRAHSLARQMKAK PTEVSTEVPEMDTST

HR8535A-211-338-Av6HT ZNF595 Q8IYB9 RSTSLSKHKRIHTGE SRSLNEHKNIHTGEK

HR6958A-339-424-NHT ZNF597 KPLQCPDCDMTFPCF LHLITHKRTHIKNTT

HR7536A-401-473-NHT ZNF605 Q86T29 AFFKKSELIRHQKIH TQKSSLISHQRTHTG

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7972A-628-696-Av6HT ZNF607 Q96SK3 CASYLVRHESVHADG FRLRSILEVHQRIHI

HR7693A-602-688-NHT ZNF611 Q8N823 TFSRRSSLHCHRRLH AEKPYKCNECGKAFN

HR7312A-239-311-NHT ZNF614 Q8N883 KLSRSVLFTKHLKTN TMKRYLIAHQRTHSG

HR8536A-215-271-Av6HT ZNF619 Q8N2I2 PYTCKECGKTFRYNS SHLLQHQKLHGGQRP

HR6865A-260-345-NHT ZNF621 Q6ZSS3 EKLYKCKECWKAFGC YGSFVQHQKLHPVEK

HR7076A-233-357-NHT ZNF622 Q969S3 QDAEEEEAEEGPPLG FADFYDFRSSYPDHK

HR8159A-235-289-Av6HT ZNF625 Q96I27 KPYECKQCGKAFRSA GCASSVKIHERTHTG

HR7221A-166-234-Av6HT ZNF627 Q7L945 PYDCKECGETFISLV EKPYECKQCGKAFSC 69

HR7098A-597-651-NHT ZNF630 Q2M218 KTPECAESGMTFFWK CQHVYFTGHQNPYRK 55

HR7646-158-485-AV6HT ZNF639 Q9UID6 NSSESLQDQTDEEPP NERELISHLPVHETT

HR7646-24-485-Av6HT ZNF639 Q9UID6 ISRIADGFNGIFSDH NERELISHLPVHETT

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7646-Av6HT ZNF639 Q9UID6 NEYPKKRKRKTLHPS ERELISHLPVHETT* 485

Vector H UGO Uniprot First 15aa Last 15aa Construct Length

HR8058A-136-204-Av6HT ZNF680 Q8NEM1 KEGYNELNQCLRTTQ SFCMLSHLTQHIRIH

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7717A-105-196-NHT ZNF705G A8MUZ8 TMENSLILEDPFECN TNCFHLRRHKMTHTG

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR6964C-227-294-Av6HT ZNF76 P36508 CPEELCSKAFKTSGD TGERPYTCPEPHCGR 68

Vector HUGO Uniprot First 15aa Last 15aa Construct Length

HR7002B-74-145-15 ZNF83 MNFVDSLFTQKEKAN NKKSNLASHQRIHTG

HR8533A-9-176-AV6HT ZNF833P Q6ZTB9 PYKCKFCGKAFDNLH FSSFHSHEGVHTGEK

HR8234A-206-287-15 ZNF836 Q6ZNA1 MTQLEKTHIREKPY PYQCGVCGKIFRQNS

HR7704A-427-482-NHT ZNF837 Q96EG3 AFKGRSGLVQHQRAH LHSGEKPYICRDCGK

HR8489A-197-283-Av6HT ZNF841 Q6Z 19 RGKPYQCDVCGRIFR SSSLATHQTVHTGDK

HR7777A-476-525-NHT ZNF846 Q147U1 KPYACKECGKAFRYS CGKNFTQSSALAKHL

HR8493A-449-519-Av6HT ZNF852 B4DLD7 SYNSSLMVHQRTHTG SQRSTFNHHQRTHAG

HR6923A-519-589-NHT ZNF90 Q03938 KRSSVLSKHKIIHTG NLSSDLNTHKRIHIG 71

H R8425A-706-753-Av6HT ZNF99 A8MXY4 AFNNSSTLRKHEIIH IHTGKKPYKCEECGK 48

HR6880A-166-356-NHT ZRSR2 EKDRANCPFYSKTGA AN RDIYLSPDRTGSS 191

HR7806A-1-70-NHT ZSCAN10 Q96SZ4 GPRASLSRLRELCGH DGEEWLLLEGIHRE 69

References;

Acton, T. B., et al., 2011. Preparation of protein samples for NMR structure, function, and small-molecule screening studies. Methods Enzymol. 493, 21-60.

Agaton et al, Molecular & Cellular Proteomics 2:405-414, 2003.

Bindewald, E., et al., CyloFold: secondary structure prediction including

pseudoknots. Nucleic Acids Res. 38, W368-72.

Brodskii, L. I., et al., 1995. [GeneBee-NET: An Internet based server for

biopolymer structure analysis]. Biokhimiia. 60, 1221-30.

Crowe, J., et al., 1994. 6xHis-Ni-NTA chromatography as a superior technique in recombinant protein expression/purification. Methods Mol Biol. 31, 371-87.

Ding, Y., et al., 2004. Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res. 32, Wl 35-41.

Do, C. B., et al., 2006. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics. 22, e90-8.

Gonzalez de Valdivia, E. I., Isaksson, L. A., 2004. A codon window in mRNA downstream of the initiation codon where NGG codons give strongly reduced gene expression in Escherichia coli. Nucleic Acids Res. 32, 5198-205.

Gruber, A. R., et al., 2008. The Vienna RNA websuite. Nucleic Acids Res. 36, W70-4.

Hamada, M., et al., 2009. Predictions of RNA secondary structure by combining homologous sequence information. Bioinformatics. 25, Ϊ330-8.

Jansson, M.; et al., 1996. High-level production of uniformly 15 N- and 13 C- enriched fusion proteins in Escherichia coli. B. J. Biomol. NMR. 7, 131 - 141.

Kapust, R. B., et al., 2002. The ΡΓ specificity of tobacco etch virus protease.

Biochem Biophys Res Commun. 294, 949-55.

Kudla, G., et al., 2009. Coding-sequence determinants of gene expression in Escherichia coli. Science. 324, 255-8.

Lamia, T., Erdmann, V. A., 2004. The Nano-tag, a streptavidin-binding peptide for the purification and detection of recombinant proteins. Protein Expr Purif. 33, 39-47.

Lui et al., 2002, Loopy proteins appear conserved in evolution. J Mol Biol. 322- 53-64) Markham, N. R., Zuker, M., 2008. UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol. 453, 3-31.

Mathews, D. H., et al., 2004. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci U S A. 101, 7287-92.

Netzer and Hartl, 1997. Recombination of protein domains facilitated by co- translational folding in eukaryotes. Nature. 358- 343-9.

Nomura, M., et al., 1984. Influence of messenger RNA secondary structure on translation efficiency. Nucleic Acids Symp Ser. 173-6.

Quan, J., et al., 2011. Parallel on-chip gene synthesis and application to

optimization of protein expression. Nat Biotechnol. 29, 449-52.

Reeder, J., et al., 2007. pknotsRG: RNA pseudoknot folding including near- optimal structures and sliding windows. Nucleic Acids Res. 35, W320-4.

Rivas, E., Eddy, S. R, 1999. A dynamic programming algorithm for RNA

structure prediction including pseudoknots. J Mol Biol. 285, 2053-68.

Rocha, E. P., et al., 1999. Translation in Bacillus subtilis: roles and trends of initiation and termination, insights from a genome analysis. Nucleic Acids Res. 27, 3567-76.

Sharp, P. M., Li, W. H., 1987. The codon Adaptation Index—a measure of

directional synonymous codon usage bias, and its potential applications.

Nucleic Acids Res. 15, 1281-95.

Scholle, M. D., et al., 2004. In vivo biotinylated proteins as targets for phage- display selection experiments. Protein Expr Purif. 37, 243-52.

Schroeder, S. J., et al., 2011. Ensemble of secondary structures for encapsidated satellite tobacco mosaic virus RNA consistent with chemical probing and crystallography constraints. Biophys J. 101, 167-75.

Voss, B., et al., 2006. Complete probabilistic analysis of RNA shapes. BMC Biol. 4, 5.

Xayaphoummine, A., et al., 2005. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic

Acids Res. 33, W605-10.

Xayaphoummine, A., et al., 2003. Prediction and statistics of pseudoknots in

RNA structures using exactly clustered stochastic simulations. Proc Natl Acad Sci U S A. 100, 15310-5.

Zuker, M., Stiegler, P., 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9, 133-48. The foregoing examples and description of the preferred embodiments should be taken as illustrating, rather than as limiting the present invention as defined by the claims. As will be readily appreciated, numerous variations and combinations of the features set forth above can be utilized without departing from the present invention as set forth in the claims. Such variations are not regarded as a departure from the scope of the invention, and all such variations are intended to be included within the scope of the following claims. All references cited herein are incorporated herein in their entireties.