Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DE NOVO DESIGNED LUCIFERASE
Document Type and Number:
WIPO Patent Application WO/2024/097640
Kind Code:
A2
Abstract:
Proteins having luciferase activity, nucleic acids encoding such proteins, and methods for their use are provided, wherein the proteins have a secondary' structure arrangement H1- L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5-L8-E6, where "H" is a helical domain, "L" is a loop domain, and "E" is a beta strand domain: and where (I) 1, 2, or ail 3 of the following are true: (a) H3 is at least 9, 10, 11, 12, 13, or 14 ammo acids in length and residue 8 is R (b) E3 is at least 6, 7, 8, 9, or 10 amino acids in length, and residue 2 is Y; and/or (c) E5 is at least 8, 9, 10, 1 1, or 12 amino acids in length, and residue 5 is S; or (II) (a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 3 is T; and/or (b) E6 is at least 8 or 9 ammo acids in length, and residue 8 is R.

Inventors:
BAKER DAVID (US)
YEH HSIEN-WEI (US)
Application Number:
PCT/US2023/078161
Publication Date:
May 10, 2024
Filing Date:
October 30, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV WASHINGTON (US)
International Classes:
C12N9/00; C12Q1/66
Attorney, Agent or Firm:
HARPER, David, S. (US)
Download PDF:
Claims:
We claim

1. A protein having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein 1, 2, or all 3 of the following are true:

(a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 8 is R

(b) E3 is at least 6, 7, 8, 9, or 10 amino acids in length, and residue 2 is Y; and/or

(c) E5 is at least 8, 9, 10, 11 , or 12 amino acids in length, and residue 5 is S.

2. The protein of claim 1, wherein: the H1 domain is at least 14, 15, 16, 17, 18 or 19 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3 or 4 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 6, 7, 8, 9, or 10 amino acids in length; the E4 domain is at least 7, 8, 9, 10, or 11 amino acids in length; the E5 domain is at least 7, 8, 9, 10, or 11 amino acids in length; and the E6 domain is at least 7, 8, 9, 10, or 11 amino acids in length.

3. The protein of claim 1, wherein: the H1 domain is 19 amino acids in length; the H2 domain is 7 amino acids in length; the E1 domain is 4 amino acids in length; the E2 domain is 5 amino acids in length; the H3 domain is 14 amino acids in length; the E3 domain is 12 amino acids in length; the E4 domain is 12 amino acids in length; the E5 domain is 15 amino acids in length; and the E6 domain is 9 amino acids in length

4. The protein of any one of claims 1-3, wherein 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or all 24, of the following are true: (a) the H1 domain residue 13 is I or another hydrophobic residue;

(b) the H1 domain residue 14 is I or another hydrophobic residue;

(c) the H1 domain residue 17 is I or another hydrophobic residue;

(d) the H2 domain residue 3 is F;

(e) the H2 domain residue 7 is F;

(f) the L2 domain is at least 4 amino acids in length, and residue 4 of L2 is V or another hydrophobic residue;

(g) the E1 domain residue 2 is F;

(h) the E1 domain residue 4 is N;

(i) the L3 domain residue 1 is H;

(j) the H3 domain residue 4 is L or another hydrophobic residue;

(k) the H3 domain residue 7 is Q;

(l) the H3 domain residue 11 is Y;

(m) the E3 domain residue 3 is H;

(n) the E3 domain residue 4 is V or another hydrophobic residue;

(o) the E4 domain residue 5 is I or another hydrophobic residue;

(p) the E4 domain residue 7 is G;

(q) the E3 domain residue 8 is Y;

(r) the E3 domain residue 9 is V or another hydrophobic residue;

(s) the E5 domain residue 6 is I or another hydrophobic residue;

(t) the E5 domain residue 7 is V or another hydrophobic residue;

(u) the E5 domain residue 9 is L or another hydrophobic residue;

(v) the E6 domain residue 3 is V or another hydrophobic residue;

(w) the E5 domain residue 6 is A or another hydrophobic residue; and/or

(x) the E5 domain residue 8 is V or another hydrophobic residue.

5. The protein of any one of claims 1-4, comprising an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 382 or 383.

6. A protein having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein one or both of the following are true: (a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 3 is T; and/or

(b) E6 is at least 8 or 9 amino acids in length, and residue 8 is R.

7. The protein of claim 6, wherein: the H1 domain is at least 15, 16, 17, 18, 19, or 20 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3, 4, or 5 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E4 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E5 domain is at least 10, 11, 12, 13, 14 or 15 amino acids in length; and the E6 domain is at least 6, 7, 8, or 9 amino acids in length.

8. The protein of claim 6, wherein: the H1 domain is 20 amino acids in length; the H2 domain is 7 amino acids in length; the E1 domain is 4 amino acids in length; the E2 domain is 5 amino acids in length; the H3 domain is 14 amino acids in length; the E3 domain is 12 amino acids in length; the E4 domain is 12 amino acids in length; the E5 domain is 15 amino acids in length; and the E6 domain is 9 amino acids in length.

9. The protein of any one of claims 6-8, wherein 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, or all 27 of the following are true:

(a) the H1 domain residue 11 is V or another hydrophobic residue;

(b) the H1 domain residue 14 is A or another hydrophobic residue;

(c) the H1 domain residue 15 is A or another hydrophobic residue;

(d) the H1 domain residue 18 is A or another hydrophobic residue;

(e) the H1 domain residue 19 is L or another hydrophobic residue;

(f) the H2 domain residue 4 is L or another hydrophobic residue; (g) the L3 domain residue 1 is H;

(h) the L3 domain is at least 5 amino acids in length, and L3 domain residue 5 is

F;

(i) the E1 domain residue 2 is A or another hydrophobic residue;

(j) the E1 domain residue 3 is K;

(k) the E1 domain residue 4 is D;

(l) the E2 domain residue 3 is W;

(o) the H3 domain residue 6 is V or another hydrophobic residue;

(p) the H3 domain residue 7 is I or another hydrophobic residue;

(q) the H3 domain residue 10 is Y;

(r) the H3 domain residue 11 is Y;

(s) the E3 domain residue 2 is V or another hydrophobic residue;

(t) the E3 domain residue 4 is A or another hydrophobic residue;

(u) the E4 domain residue 6 is M;

(v) the E4 domain residue 8 is V or another hydrophobic residue;

(w) the E4 domain residue 10 is F;

(x) the E5 domain residue 9 is V or another hydrophobic residue;

(y) the E6 domain residue 3 is L or another hydrophobic residue;

(z) the E6 domain residue 5 is E; and/or

(aa) the E6 domain residue 6 is F.

10. The protein of any one of claims 6-9, comprising an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 384 or 385, wherein optional residues may be present or absent, and when absent are not considered in determining percent identity of the amino acid sequence relative to SEQ ID NO: 384 or 385.

11. A protein, comprising an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to any one of SEQ ID NO: 382-385.

12. The protein of any one of claims 1-10, wherein any substitutions relative to the reference sequence are conservative amino acid substitutions.

13. A self-complementing multipartite protein having luciferase activity, comprising at least a first polypeptide component and a second polypeptide component, wherein the at least first polypeptide component and the second polypeptide component are not covalently linked, wherein in total the at least first polypeptide component and the second polypeptide component comprise the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3- L5-E3-L6-E4-L7-E5-L8-E6, wherein each domain is as defined in any one of claims 1-11; wherein (a) each H and E domain is fully present within one polypeptide component of the at least first polypeptide component and the second polypeptide component, and (b) none of the at least first polypeptide component and the second polypeptide component include all of the H and E domains.

14. The self-complementing multipartite protein of claim 13, wherein the split occurs at L4, L5, L6, L7, or L8.

15. A fusion protein comprising:

(a) the protein or polypeptide component of any preceding claims; and

(b) one or more additional functional domains.

16. A nucleic acid encoding the protein, polypeptide component, or fusion protein of any preceding claim.

17. The nucleic acid of claim 16, comprising a nucleotide sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the nucleotide sequence of SEQ ID NO: 386 or 387 , wherein optional residues may be present or absent, and when absent are not considered in determining percent identity of the nucleotide sequence relative to SEQ ID NO: 386 or 387

18. An expression vector comprising the nucleic acid of claim 17 operatively linked to a suitable control element.

19. A host cell comprising the protein, polypeptide component, fusion protein, nucleic acid, and/or expression vector of any preceding claim.

20. A kit comprising:

(a) the protein, polypeptide component, fusion protein, nucleic acid, expression vector, and/or host cell of any preceding claim; and

(b) instructions for their use.

21. The kit of claim 20, further comprising 2-deoxycoelenterazine.

22. Use of the protein, polypeptide component, fusion protein, nucleic acid, expression vector, host cell, and/or kit of any preceding claim for any suitable pinpose, including but not limited to luminescent reporting assays, diagnostic assays, cellular localization of targets of interest, cellular imaging, gene editing, live animal imaging, cancer labeling, CART-cells reporting, secreted assay, gene delivery, tissue engineering, etc.

Description:
De novo designed luciferase

Cross Reference

This application claims priority to U.S. Provisional Application Serial Number 63/381,924 filed November 1, 2022, incorporated by reference herein in its entirety.

Federal Funding Statement

This invention was made with government support under Grant No. K99EB031913, awarded by the National Institutes of Health. The government has certain rights in the invention.

Sequence Listing Statement

A computer readable form of the Sequence Listing is filed with this application by electronic submission and is incorporated into this application by reference in its entirety. The Sequence Listing is contained in the file created on October 17, 2023 having the file name “22-1899-WO. xml” and is 675,658 bytes in size.

Background

Bioluminescent light produced by the enzymatic oxidation of a luciferin substrate is widely used for bioassays and imaging in biomedical research. Because no excitation light source is needed, luminescent photons are produced in the dark which results in higher sensitivity than fluorescence imaging in live animal models and in biological samples where autofluorescence or phototoxicity is a concern. However, the development of luciferases as molecular probes has lagged behind that of well-developed fluorescent protein toolkits for a number of reasons: (i) very few native luciferases have been identified; (ii) many of those that have been identified require multiple disulfide bonds to stabilize the structure and are therefore prone to misfolding in mammalian cells; (iii) most native luciferases do not recognize synthetic luciferins with more desirable photophysical properties; and (iv) multiplexed imaging to follow multiple processes in parallel using mutually orthogonal luciferase-luciferin pairs has been limited by the low substrate specificity of native luciferases. Summary

In one aspect, the disclosure provides proteins having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5- L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein 1, 2, or all 3 of the following are true:

(a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 8 is R

(b) E3 is at least 6, 7, 8, or 910 amino acids in length, and residue 2 is Y; and/or

(c) E5 is at least 8, 9, 10, 11, or 12 amino acids in length, and residue 5 is S.

In one embodiment: the H1 domain is at least 14, 15, 16, 17, 18 or 19 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3 or 4 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 6, 7, 8, 9, or 10 amino acids in length; the E4 domain is at least 7, 8, 9, 10, or 11 amino acids in length; the E5 domain is at least 7, 8, 9, 10, or 11 amino acids in length; and the E6 domain is at least 7, 8, 9, 10, or 11 amino acids in length.

In various embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or all 24, of the following are true:

(a) the H1 domain residue 13 is I or another hydrophobic residue;

(b) the H1 domain residue 14 is I or another hydrophobic residue;

(c) the H1 domain residue 17 is I or another hydrophobic residue;

(d) the H2 domain residue 3 is F;

(e) the H2 domain residue 7 is F;

(f) the L2 domain is at least 4 amino acids in length, and residue 4 of L2 is V or another hydrophobic residue;

(g) the E1 domain residue 2 is F;

(h) the E1 domain residue 4 is N;

(i) the L3 domain residue 1 is H;

(j) the H3 domain residue 4 is L or another hydrophobic residue;

(k) the H3 domain residue 7 is Q;

(l) the H3 domain residue 11 is Y;

(m) the E3 domain residue 3 is H; (n) the E3 domain residue 4 is V or another hydrophobic residue;

(o) the E4 domain residue 5 is I or another hydrophobic residue;

(p) the E4 domain residue 7 is G;

(q) the E3 domain residue 8 is Y;

(r) the E3 domain residue 9 is V or another hydrophobic residue;

(s) the E5 domain residue 6 is I or another hydrophobic residue;

(t) the E5 domain residue 7 is V or another hydrophobic residue;

(u) the E5 domain residue 9 is L or another hydrophobic residue;

(v) the E6 domain residue 3 is V or another hydrophobic residue;

(w) the E5 domain residue 6 is A or another hydrophobic residue; and/or

(x) the E5 domain residue 8 is V or another hydrophobic residue.

In another embodiment, the proteins comprise an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 382 or 383.

In another aspect, the disdosure provides proteins having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6- E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein one or both of the following are true:

(a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 3 is T; and/or

(b) E6 is at least 8 or 9 amino acids in length, and residue 8 is R.

In one embodiment, the H1 domain is at least 15, 16, 17, 18, 19, or 20 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3, 4, or 5 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E4 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E5 domain is at least 10, 11, 12, 13, 14 or 15 amino acids in length; and the E6 domain is at least 6, 7, 8, or 9 amino acids in length.

In various embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, or all 27 of the following are true:

(a) the H1 domain residue 11 is V or another hydrophobic residue; (b) the H1 domain residue 14 is A or another hydrophobic residue;

(c) the H1 domain residue 15 is A or another hydrophobic residue;

(d) the H1 domain residue 18 is A or another hydrophobic residue;

(e) the H1 domain residue 19 is L or another hydrophobic residue;

(f) the H2 domain residue 4 is L or another hydrophobic residue;

(g) the L3 domain residue 1 is H;

(h) the L3 domain is at least 5 amino acids in length, and L3 domain residue 5 is

F;

(i) the E1 domain residue 2 is A or another hydrophobic residue;

(j) the E1 domain residue 3 is K;

(k) the E1 domain residue 4 is D;

(l) the E2 domain residue 3 is W;

(o) the H3 domain residue 6 is V or another hydrophobic residue;

(p) the H3 domain residue 7 is I or another hydrophobic residue;

(q) the H3 domain residue 10 is Y;

(r) the H3 domain residue 11 is Y;

(s) the E3 domain residue 2 is V or another hydrophobic residue;

(t) the E3 domain residue 4 is A or another hydrophobic residue;

(u) the E4 domain residue 6 is M;

(v) the E4 domain residue 8 is V or another hydrophobic residue;

(w) the E4 domain residue 10 is F;

(x) the E5 domain residue 9 is V or another hydrophobic residue;

(y) the E6 domain residue 3 is L or another hydrophobic residue;

(z) the E6 domain residue 5 is E; and/or

(аа) the E6 domain residue 6 is F.

In another embodiment, the proteins comprise an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 384 or 385.

In one aspect, the disclosure provides proteins comprising an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to any one of SEQ ID NO: 382- 385.

In another embodiment, the disclosure provides multipartite protein having luciferase activity, comprising at least a first polypeptide component and a second polypeptide component, wherein the at least first polypeptide component and the second polypeptide component are not covalently linked, wherein in total the at least first polypeptide component and the second polypeptide component comprise the secondary structure arrangement H1-L1- H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5-L8-E6, wherein each domain is as defined herein; wherein (a) each H and E domain is fully present within one polypeptide component of the at least first polypeptide component and the second polypeptide component, and (b) none of the at least first polypeptide component and the second polypeptide component include all of the H and E domains.

The disclosure also provides fusion protein comprising:

(a) the protein or polypeptide component of any embodiment of the disclosure; and

(b) one or more additional functional domains.

In other embodiments, the disclosure provides nucleic acid encoding the protein, polypeptide component, or fusion protein of any embodiment of the disclosure, expression vector comprising the nucleic acid operatively linked to a suitable control element, recombinant host cell comprising the protein, polypeptide component, fusion protein, nucleic acid, and/or expression vectors, kits comprising the protein, polypeptide component, fusion protein, nucleic acid, expression vector, and/or host cell of any embodiment of the disclosure, and instructions for their use; and methods for using the proteins, polypeptide components, fusion proteins, nucleic acids, expression vectors, host cells, and/or kits of the disclosure.

Description of the Figures

Figure 1: Generation of idealized scaffolds and computational design of de novo luciferases, a, Family-wide hallucination. Sequences encoding proteins with the desired topology are optimized by Monte Carlo sampling with a multicomponent loss function. Structurally conserved regions are evaluated based on consistency with input residue-residue distance and orientation distributions obtained from 85 experimental structures of NTF2-like proteins, while variable non-ideal regions are evaluated based on the confidence of predicted inter-residue geometries calculated as the KL-divergence between network predictions and the background distribution The sequence-space MCMC-sampling incorporates both sequence changes and insertions/deletions (see Methods) to guide the hallucinated sequence towards encoding structures with the desired folds. Hydrogen-bonding networks are incorporated into the designed structures to increase structural specificity (SEQ ID NO: 397). b-d) The design of luciferase active sites, b, Generation of DTZ conformers using AIMNet. c, Generation of a retainer interaction field (RIF) to stabilize the anionic DTZ and form hydrophobic packing interactions around the DTZ conformers, d, Docking of the RIF into the hallucinated scaffolds, and optimization of substrate-scaffold interactions using positionspecific score matrices (PSSM)-biased sequence design, e, Selection of the NTF2 topology. The RIF was docked into 4000 native small molecule binding proteins, excluding proteins that bind the luciferin substrate using more than 5 loop residues. Most of the top hits were from the NTF2-like protein superfamily. Using the family-wide hallucination scaffold generation protocol, we generated 1615 scaffolds and found that these yielded better predicted RIF binding energies than the native proteins, f, Scaffolds generated with familywide hallucination sample more within the space of the native structures than previous blueprint generated scaffolds and g, have stronger sequence to structure relationships than native or blueprint de novo NTF2 scaffolds.

Figure 2. Biophysical characterization of LuxSit. a, Coomassie-stained SDS-PAGE of purified recombinant LuxSit from E. coli. b, Size-exclusion chromatography of purified LuxSit suggested monodispersed and monomeric properties, c, Far-ultraviolet CD spectra at 25 °C, 95 °C, and cooled back to 25 °C. Insert: CD melting curve of LuxSit at 220 nm. d, Luminescence emission spectra of DTZ in the presence and absence of LuxSit. e, Structural alignment of the design model and AlphaFold2 predicted model , which are in close agreement at both backbone (left) and sidechain level (right), f-i) Site saturation mutagenesis of substrate interacting residues. Zoomed-in views (left) of design and AlphaFold2 models at sidechain level illustrated the designed enzyme-substrate interactions off, Tyr14-His98 core HBNets, g, Aspl8-Arg65 dyad, h, π-stacking, and i, hydrophobic packing residues. Sequence profiles (right) are scaled by the activities of different sequence variants: (activity for the indicated amino acid) / (the sum of activities over all tested amino acids at the position). Substitutions with increased activity (Ala96 and Metl 10) are highlighted.

Figure 3. Characterization of luciferase activity in vitro and in human cells, a, Substrate concentration dependence of LuxSit, LuxSit-f, and LuxSit-i activity. Numbers indicate the signal-to-background ratio at V max . b, Luminescence images acquired by a BioRad Imager (top) or an Apple iPhone 8 (bottom) of DTZ only (left tube), DTZ plus 100 nM purified LuxSit (middle tube), and DTZ plus 100 nM purified LuxSit-i (right tube), showing high efficiency of photon production, c, Fluorescence and luminescence imaging of live HEK293T cells transiently expressing LuxSit-i-mTagBFP2; LuxSit-i activity can be detected at single-cell resolution. Left: fluorescence channel representing mTagBFP2 signal. Right: total luminescence photons were collected during a course of 10 s exposure. Inserts: negative control, untransfected cells with DTZ. The luminescence images were acquired immediately after adding 25 μM DTZ without excitation light. Scale bar: 20 μm. 40X

Figure 4. High substrate specificity of designed luciferases allows multiplexed bioassay, a, Chemical structures of Coelenterazine substrate analogs, b, Activity of LuxSit-i on selected luciferin substrates. Luminescence image (top) and signal quantification (bottom) of the indicated substrate in the presence of 100 nM LuxSit-i. LuxSit-i has high specificity for the design target substrate, DTZ. c, Heatmap visualization of the substrate specificity of LuxSit-i, Renilla luciferase (RLuc), Gaussia luciferase (GLuc), engineered NLuc from Oplophorus luciferase, and the de novo luciferase (HTZ3-G4) designed for h-CTZ. The heatmap shows luminescence for each enzyme on each substrate; values are normalized on a per-enzyme basis to the highest signal for that enzyme over all substrates, d, Luminescence emission spectrum of LuxSit-i/DTZ and RLuc/PP-CTZ can be spectrally resolved by 528/20 and 390/35 filters (shown in dashed bars) and only recognize the cognate substrate, e, Schematic of the multiplex luciferase assay. HEK293T cells transiently transfected with CRE-RLuc, NFkB-LuxSit-i, and CMV-CyOFP plasmids were treated with either Forskolin or human tumor necrosis factor alpha (TNFα) to induce the expression of labeled luciferases, f- g) Luminescence signals from cells can be measured under either substrate-resolved or spectrally resolved methods by a plate reader, f, For the substrate-resolved method, luminescence intensity was recorded without a filter after adding either PP-CTZ or DTZ. g, For the spectrally resolved method, both PP-CTZ and DTZ were added, and the signals were acquired using 528/20 and 390/35 filters simultaneously. In f and g, the lower panel indicates the addition of Forskolin or TNFα. Luminescence signals were acquired from the lysate of 15,000 cells in CelLytic M reagent while CyOFP fluorescence signal was used to normalize cell numbers and transfection efficiencies. All data were normalized to the corresponding non-stimulated control. Data are presented as mean ± SD (n = 3).

Figure 5. Proposed catalytic mechanism of coelenterazme-utilizmg luciferases. Density-functional theory (DFT) calculation suggested that the formation of an anionic state is the essential electron source for the activation of triplet oxygen ( 3 O 2 ). Supported by both theoretical 26,27 and experimental evidence 28,29 , the next oxygenation process is likely through a single electron transfer (SET) mechanism in which the surrounding reaction field could highly influence the change of Gibbs free energy (ΔG SET ). Finally, the thermolysis of a dioxetane light emitter intermediate can produce photons via the mechanism of gradually reversible charge-transfer-induced luminescence (GRCTIL), which is generally exergonic. Since all the historical pieces of evidence are based on calculations in the virtual solvents or chemiluminescence in ideal organic solvents. The detailed mechanism of a luciferase- catalyzed luminescence reaction has remained unclear. We proposed that the key step of the enzyme is to promote the formation of an anionic state and create a suitable environment to facilitate efficient SET. Hence, the goal of this study is to design an enzyme reaction field surrounding the substrate to stabilize the anionic substrate state and alter the local proton activity, solvent polarity, and hydrophobicity for the efficient activation of 3 O 2 .

Figure 6. Schematic representative of colony-based luciferase screening. Computationally designed DNA sequences were purchased in an oligo array, where the fragments were amplified by PCR, assembled, and ligated into a pBAD bacterial expression vector. The plasmid library was used to transform DH10B cells. Each colony grown on the LB agar plate represented one luciferase design. The plates were sprayed with DTZ solution and imaged to identify active colonies using a ChemiDoc imager. All active colonies were inoculated in 96-well plates, expressed, and purified to confirm individual luciferase activity. Selected plasmids can then be sequenced to point out active design models that provide insights into the design principle and enzyme functions or can be subjected to random mutagenesis for further evolution. Insert: three luciferases were identified from this screening. We refer to the most active and DTZ-specific luciferase as “LuxSit”.

Figure 7. Expression, purification, and structural characterization of LuxSit variants, a-c) The recombinant expression of a, LuxSit, b, LuxSit-i, and c, LuxSit-f in E. coli. Annotations for each lane are the following - 1: Pre-IPTG; 2: Post-IPTG, 3: Soluble lysate; 4: Flow-through; 5: Wash; 6: Elusion; 7: Post-TEV cleavage; 8: Post-SEC. d-f) Sizeexclusion chromatography of the purified d, LuxSit; e, LuxSit-i; and f, LuxSit-f monomer, g- i) Deconvoluted mass spectrum of g, LuxSit, h, LuxSit-i, and i, LuxSit-f. j-k) Far-ultraviolet circular dichroism (CD) spectra (Left panel) of j, LuxSit-i; and k, LuxSit-f at 25 °C, 95 °C and cooled back to 25 °C. CD melting curve at 220 nm (Right panel). 1, Dimeric SEC peak was observed when LuxSit-i was concentrated to high concentration (~50 μM) in Tris pH 8.0 buffer. Both dimeric and monomeric SEC fractions showed the expected size on SDS PAGE and both peaks were catalytically active to emit luminescence in the presence of 25 μM DTZ.

Figure 8. Expression, purification, and activity measurement of selected de novo designed luciferases for h-CTZ. a, Coomassie-stained SDS PAGE of HTZ3-D2 and HTZ3- G4 purified from recombinant expression in E. coli. b, Zoomed-in views of HTZ3-D2 (left panel) and HTZ3-G4 (right panel) illustrated the sidechain preorganization of luciferase-h- CTZ interactions, c-d) Size-exclusion chromatography (left), deconvoluted mass spectrum (middle), and the normalized luciferase activities on selected compounds (right) of c, HTZ3- D2 and d, HTZ3-G4, which suggested high specificity for the design target substrate, h-CTZ. e, Substrate concentration dependence of LuxSit (w/ DTZ), HTZ3-D2 (w/ h-CTZ), and HTZ3-G4 (w/ h-CTZ) activity in PBS. All data points were fitted to Michaelis-Menten equation. HTZ3-D2 and HTZ3-G4 showed K M values of 7.9 and 19.5 μM with ~25% and ~58% I max of LuxSit, respectively. Error bars represent ± SD (n = 3).

Figure 9. Screening of a randomized NNK library at 60, 96, and 110 positions and sequence alignment between LuxSit and its variants. We generated a fully randomized library at 60, 96, and 110 positions to exhaustively screen all possible combinations. After the colony-based screening, we identified many colonies with strong luciferase activities with DTZ. Each colony was expressed individually in each well of 96- well plates (1 mL culture) and purified accordingly (see Methods), a, Individual luminescence activity of each selected mutant was plotted and compared to the parent LuxSit. Luminescence activities were measured in the presence of 25 μM DTZ. Luminescence activity (RLU) was shown as the integrated signal over the first 15 min. Statistical analysis of the amino acid frequency versus the luciferase activity at residue b, 60, c, 96, and d, 110. Among all selected mutants, Arg60 is confirmed to be mutable as Arg60 may be structurally less well defined as it emanates from a loop and has no hydrogen-bonding partner. Ala96 prefers larger sidechain (Leu, Ile, Met, and Cys), and Metl10 favors hydrophobic residues (Val, Ile, and Ala). A newly discovered variant (R60S/A96L/M110V) with more than 100- fold higher photon flux over LuxSit was assigned LuxSit-i for its high brightness. In the sequence alignment, mutations are highlighted. The conserved catalytic dyads of Aspl8- Arg65 and Tyr14-His98 are shown.

Figure 10. Additional characterization of LuxSit variants, a, Normalized emission kinetics of 15,000 intact HeLa cells expressing LuxSit-i (red), 100 nM purified LuxSit-i, or 100 nM purified LuxSit-f in the presence of 50 μM DTZ. The more extended emission kinetics in HeLa cells is likely due to the diffusion rate of DTZ across cell membranes, b, Normalized luminescence decay curves of LuxSit-i in various pH buffers revealed a pH- dependent catalytic mechanism, c, Luminescent quantum yield was estimated from the integrated luminescence signal until completely converting 125 pmol substrates to photons in the presence of 50 nM corresponding luciferase (see Methods). All data points were plotted as the average of triplicate measurements.

Figure 11. Expression, localization, and luminescence activity of LuxSit-i in live HEK293T and HeLa cells, a-b) Fluorescence imaging of live a, HEK293T and b, HeLa cells expressing LuxSit-i-mTagBFP2, which is untargeted or localized to the nucleus (Histone2B), plasma membrane (KRasCAAX), or mitochondria (DAKAP) cellular compartments. Scale bar: 10 μm. c-d) Luminescence signals were measured with 15,000 intact c, HEK293T or d, HeLa cells in the presence of 25 μM DTZ in DPBS. Transfection efficiencies range from 60-70% for HEK293T cells and 5-10% for HeLa cells, e, Luminescence emission spectra acquired from LuxSit-i expressing HEK293T cells is consistent with the emission spectra of recombinant LuxSit-i purified from E. coli. f-g) Luminescence signals were measured with 15,000 f, intact LuxSit-i expressing HEK293T cells or g, cell lysate in the presence of 25 μM indicated substrate in DPBS. Luminescence intensities were normalized to DTZ signal, showing high DTZ specificity over other substrates in cell-based assays. Data were shown as total luminescence signal over the first 20 min and were done in technical triplicates, h, Normalized luminescence intensity profile of lines traversing across different cells (n=10) of main Fig. 3c luminescence image; lines represent untransfected cells. Error bars represent ± SEM.

Figure 12. Substrate specificity of LuxSit-i and spectrally resolved luciferase- luciferin pairs allow multiplexed bioassay, a, The orthogonality relationship between LuxSit-i-DTZ and RLuc-PP-CTZ (Prolume Purple, methoxy e-Coelenterazine) luminescent pairs. Indicated amounts of each luciferase were mixed at different ratios totaling 100%. b, After the addition of both 25 μM DTZ and PP-CTZ substrates, filtered light from 528/20 and 390/35 were measured simultaneously. Data are presented as mean ± SD (n = 3). Heatmap shows the luminescence signal for individual luciferase (100 nM) or 1 : 1 mixture in the presence of the cognate or non-cognate (DTZ or PP-CTZ or both) substrates. Response signals were acquired by aNeo2 plate reader with 528/20 and 390/35 filters simultaneously, c, Multiplex luciferase assay in live HEK293T after co-transfection of CRE-RLuc, NFκB- LuxSit-i, and CMV-CyOFP plasmids and stimulation by Forskolin (FSK) or human tumor necrosis factor alpha (TNFα). d,e) 15,000 intact cells were assayed (see Methods) by either d, substrate-resolved or e, spectrally resolved modes after adding DTZ, PP-CTZ, or both DTZ and PP-CTZ in DPBS without cell lysis. Area-scanning of CyOFP fluorescence signal was used to estimate cell numbers and transfection efficiency. The reported unit was RLU/a.u.; relative light units/fluorescence intensity measurements at Ex./Em.=480/580 nm. All data were normalized to the corresponding non-stimulated control. Data are presented as mean ± SD (n = 3).

Figure 13. Secondary structure is shown mapped onto exemplary proteins of the disclosure (top: SEQ ID NO: 382 and bottom: SEQ ID NO: 384). Detailed Description

All references cited are herein incorporated by reference in their entirety.

As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise.

As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn; N), aspartic acid (Asp; D), arginine (Arg; R), cysteine (Cys; C), glutamic acid (Glu; E), glutamine (Gln; Q), glycine (Gly; G), histidine (His; H), isoleucine (Ile; I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Trp; W), tyrosine (Tyr; Y), and valine (Val; V).

In all embodiments of polypeptides disclosed herein, any N-terminal methionine residues are optional (i.e.: the N-terminal methionine residue may be present or may be deleted, and if deleted the residue is not considered when determining percent identity).

All embodiments of any aspect of the disclosure can be used in combination, unless the context clearly dictates otherwise.

Unless the context clearly requires otherwise, throughout the description and the claims, the words ‘comprise’, ‘comprising’, and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to”. Words using the singular or plural number also include the plural and singular number, respectively. Additionally, the words “herein,” “above,” and “below” and words of similar import, when used in this application, shall refer to this application as a whole and not to any particular portions of the application

In a first aspect, the disclosure provides proteins having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6- E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein:

(a) the H1 domain is at least 18 or 19 amino acids in length; residue 14 of the H1 domain is Y, D, or E, and residue 18 of the H1 domain is D or E;

(b) the E3 domain is at least 6, 7, 8, 9, or 10 amino acids in length and residue 2 of the E3 domain is R; and

(c) the E5 domain is at least 10, 11, 12, 13, or 14 amino acids in length and residue 9 of the E5 domain is H or N. As disclosed in the examples that follow, the proteins of the disclosure have luciferase activity and share this recited secondary structure arrangement. The inventors have conducted extensive studies to assess key residues in the polypeptides for retaining luciferase activity and made a large number of modified versions of the polypeptides as detailed in SEQ ID NO:4 and the examples that follow. The required amino acids noted above are those involved in the catalytic dyads, as described below and in the examples.

Except as noted, the different domains may be any suitable length.

In one embodiment, residue 7 of the E5 domain is M. In another embodiment, the E6 domain is at least 9, 10, 11, 12, or 13 amino acids in length and residue 5 of the E6 domain is V. In a further embodiment, residue 1 of the L5 domain is S. In one embodiment, residue 7 of the E5 domain is M and residue 5 of the E6 domain is V. In a further domain, residue 7 of the E5 domain is M, residue 5 of the E6 domain is V, and residue 1 of the L5 domain is S. In another embodiment, the H2 domain is at least 5, 6, or 7 amino acids in length, the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length, the E1 domain is at least 3 or 4 amino acids in length, the E2 domain is at least 3 or 4 amino acids in length, and the E4 domain is at least 8, 9, 10, 11, or 12 amino acids in length. In all these embodiments, one or more of the recited domains may independently include amino acid residues. In one embodiment, one or more of the domains may independently include an additional 1, 2, 3, 4, or 5 residues.

In one embodiment: the H1 domain is 19 amino acids in length; the H2 domain is 7 amino acids in length; the E1 domain is 4 amino acids in length; the E2 domain is 4 amino acids in length; the H3 domain is 14 amino acids in length; the E3 domain is 10 amino acids in length; the E4 domain is 12 amino acids in length; the E5 domain is 14 amino acids in length; and the E6 domain is 12 or 13 amino acids in length.

The loop domains may be of any length and may include insertions, relative to the sequences exemplified herein, of any residues or functional domains as deemed appropriate, including but not limited to metal binding domains, drug binding domains, GPCR receptors, protein switches, and small molecule binding domains. In another embodiment, the proteins comprise an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from the group consisting of SEQ ID NO: 1-3, wherein residues in parentheses are optional and may be present or may be deleted.

SEQ ID NO:1 is the Lux-Sit construct disdosed herein.

SEQ ID NO:2 is the Lux-Sit-i construct disclosed herein.

SEQ ID NO:3 is the Lux-Sit-f construct disclosed herein.

In each of the annotated sequences shown for SEQ ID NO: 1-3:

(a) Bold and underlined and increased font size positions are Dyad 1 (catalytic residues) Y14 (H1 domain residue 14) + H98 (E5 domain residue 9);

(b) Bold and increased font size positions are Dyad 2 (catalytic residues) D18 (H1 domain residue 9) + R65 (E3 domain residue 2);

(c) Increased font and not bolded positions are core packing (recognition residues) F13 (residue 13 of domain H1), 135 (residue 2 of domain E1), W38 (residue 1 of domain L3), F49 (residue 4 of domain H3), V81 (residue 6 of domain E4), L83 (residue 8 of domain E4), V94 (residue 5 of domain E5), A/L 97 (residue 8 of domain E5), W100 (residue 11 of domain E5), M/V110 (residue 5 of domain E6), V112 (residue 7 of domain E6); and

(d) Underlined and not bolded positions are regions (loop domains or immediately adjacent) for splitting the enzyme or inserting other functional domains.

In some embodiments of the proteins, 1, 2, 3, 4, or all 5 of the following is true:

(a) residue 13 of domain H1 is F;

(b) residue 1 of domain L3 is W;

(c) residue 5 of domain E5 is V or another hydrophobic residue; (d) residue 8 of domain E5 is A or L or another hydrophobic residue; and/or

(e) residue 11 of domain E5 is W.

In other embodiments of the proteins, the H2 domain is 7 amino acids in length, the H3 domain is 14 amino acids in length, the El domain is 4 amino acids in length, the E2 domain is 4 amino acids in length, and/or the E4 domain is 12 amino acids in length. In further embodiments, 1, 2, 3, 4, 5, or all 6 of the following are true:

(a) residue 2 of domain E1 is I or another hydrophobic residue;

(b) residue 4 of domain H3 is F;

(c) residue 6 of domain E4 is V or another hydrophobic residue; (d) residue 8 of domain E4 is L or another hydrophobic residue;

(e) residue 5 of domain E6 is M or V or another hydrophobic residue; and/or

(f) residue 7 of domain E6 is V or another hydrophobic residue.

In another embodiment, the protein comprises the amino acid sequence of SEQ ID NO:4.

SEQ ID NO:4

In another embodiment, the proteins comprise an amino acid sequence at least 30%,

35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%,

95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid selected from the group consisting of SEQ ID NO:l-3 and 5-181, as shown in Table 1. SEQ ID NOS:5-181 in Table

1 are re-designed amino acid sequences based on LuxSit-i (SEQ ID NO:2), with their luciferase activities shown.

Table 1

In another aspect, the disclosure provides protein having luciferase activity, comprising an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%,

70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 1, wherein:

Residue 14 is Y, D, or E and residue 98 is H orN;

Residue 18 is D or E and residue 65 is R. In one embodiment, the percent identity relative to the reference sequence is carried out by sequence alignment with the Needleman-Wunsch algorithm, a common sequence alignment tool for those of skill in the art, which allows for insertions and deletions.

In one embodiment, the protein comprises one or both of A96M and M110V substitutions relative to SEQ ID NO: 1. In another embodiment, the protein comprises comprising an R60S substitution relative to SEQ ID NO: 1. In a further embodiment, the protein comprises R60S, A96M, and Ml 10V substitutions relative to SEQ ID NO: 1. In another embodiment, any substitutions relative to SEQ ID NO: 1 at residues F12, 135, W38, F49, V81, L83, V94, A 97, W100, M110, V112 are conservative amino acid substitutions. In one embodiment, the protein comprises an amino acid sequence at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO:l-3. In another embodiment, the protein comprises an amino acid sequence at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence selected from SEQ ID NO: 1-3 and 5-181.

In another embodiment, the proteins comprise the formula X1-Z1 -X2-Z2-X3-Z3-X4- Z4-X5-Z5-X6-Z6-X7-Z7-X8-Z8, wherein: X1 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of MS EEQIRQFLRRFYEALD ( SEQ ID NO : 182 ) , wherein residue 14 is Y, D, or E and residue 18 is D or E;

X2 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of ADTAASLF ( SEQ ID NO : 183 ) ;

X3 has an amino acid sequence at least 50%, 75%, or 100% identical to the amino acid sequence of TIHL ( SEQ ID NO : 184 ) ;

X4 has an amino acid sequence at least 33%, 66%, or 100% identical to the amino acid sequence of VTF;

X5 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence Of EEFREWFERLFST ( SEQ ID NO : 185 ) ; X6 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of QREIKSLEVR (SEQ ID NO: 186 ) , wherein residue 2 is R;

X7 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of VEVHVQLHATH (SEQ ID NO: 187 ) ;

X8 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of KHTVDATHHWHFR ( SEQ ID NO : 188 ) , wherein residue 8 is H or N;

X9 has an amino acid sequence at least 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the amino acid sequence of VTEMRVHINPTG (SEQ ID NO: 189) ; and wherein Z1, Z2, Z3, Z4, Z5, Z6, Z7, and Z8 are independently present or absent, and when present may comprise any amino acid sequence.

In one embodiment, 1, 2, 3, 4, 5, 6, 7, or all 8 of the following are true

Z1 comprises SGD;

Z2 comprises HPGV (SEQ ID NO: 190) ;

Z3 comprises WDG;

Z4 comprises TSR;

Z5 comprises RKDA (SEQ ID NO: 191 ) ;

Z6 comprises GDT ;

Z7 comprises NGQ;

Z8 comprises GNR; and wherein 0, 1, 2, 3, 4, 5, 6, 7, or all 8 of Z1, Z2, Z3, Z4, Z5, Z6, Z7, and Z8 further comprising an additional polypeptide domain.

In another aspect, the inventors have designed other proteins possessing luciferase activity with the 2-deoxycoelenterazine (h-CTZ) luciferin substrate. The designs were highly soluble, monodisperse, and monomeric; the ludferase activities were of the same order of magnitude as LuxSit embodiments described above. The most active design for h-CTZ, HTZ3-G4 was also highly specific for its target substrate (Fig. 4c and Fig. 8d).

Thus, in another aspect, the disclosure provides proteins having ludferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6- E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein 1, 2, or all 3 of the following are true:

(a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 8 is R

(b) E3 is at least 6, 7, 8, or 910 amino acids in length, and residue 2 is Y; and/or

(c) E5 is at least 8, 9, 10, 11, or 12 amino acids in length, and residue 5 is S.

In one embodiment: the H1 domain is at least 14, 15, 16, 17, 18 or 19 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3 or 4 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 6, 7, 8, 9, or 10 amino acids in length; the E4 domain is at least 7, 8, 9, 10, or 11 amino acids in length; the E5 domain is at least 7, 8, 9, 10, or 11 amino acids in length; and

In another embodiment: the H1 domain is 19 amino acids in length; the H2 domain is 7 amino acids in length; the E1 domain is 4 amino acids in length; the E2 domain is 5 amino acids in length; the H3 domain is 14 amino acids in length; the E3 domain is 12 amino acids in length; the E4 domain is 12 amino acids in length; the E5 domain is 15 amino acids in length; and the E6 domain is 9 amino acids in length

In various further embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or all 24, of the following are true (residue number in parentheses referring to residue relative to SEQ ID NO:382 or 383):

(a) the H1 domain residue 13 is I or another hydrophobic residue (residue 13);

(b) the H1 domain residue 14 is I or another hydrophobic residue (residue 14);

(c) the H1 domain residue 17 is I or another hydrophobic residue (residue 17)

(d) the H2 domain residue 3 is F (residue 25);

(e) the H2 domain residue 7 is F (residue 29); (f) the L2 domain is at least 4 amino acids in length, and residue 4 of L2 is V or another hydrophobic residue (residue 33);

(g) the E1 domain residue 2 is F (residue 35);

(h) the E1 domain residue 4 is N (residue 37)

(i) the L3 domain residue 1 is H (residue 38);

(j) the H3 domain residue 4 is L or another hydrophobic residue (residue 49);

(k) the H3 domain residue 7 is Q (residue 52);

(l) the H3 domain residue 11 is Y (residue 56);

(m) the E3 domain residue 3 is H (residue 64);

(n) the E3 domain residue 4 is V or another hydrophobic residue (residue 65);

(o) the E4 domain residue 5 is I or another hydrophobic residue (residue 78);

(p) the E4 domain residue 7 is G (residue 80);

(q) the E3 domain residue 8 is Y (residue 81);

(r) the E3 domain residue 9 is V or another hydrophobic residue (residue 82);

(s) the E5 domain residue 6 is I or another hydrophobic residue (residue 92);

(t) the E5 domain residue 7 is V or another hydrophobic residue (residue 93);

(u) the E5 domain residue 9 is L or another hydrophobic residue (residue 95);

(v) the E6 domain residue 3 is V or another hydrophobic residue (residue 103);

(w) the E5 domain residue 6 is A or another hydrophobic residue (residue 106); and/or

(x) the E5 domain residue 8 is V or another hydrophobic residue (residue 108).

In another embodiment, the protein comprises an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 382 or 383:

Figure 13 shows the domain structure mapped onto the sequence of SEQ ID NO:382.

In another aspect, the disclosure provides proteins having luciferase activity, comprising the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6- E4-L7-E5-L8-E6, wherein “H” is a helical domain, “L” is a loop domain, and “E” is a beta strand domain; wherein one or both of the following are true:

(a) H3 is at least 9, 10, 11, 12, 13, or 14 amino acids in length and residue 3 is T; and/or

(b) E6 is at least 8 or 9 amino acids in length, and residue 8 is R.

In one embodiment: the H1 domain is at least 15, 16, 17, 18, 19, or 20 amino acids in length; the H2 domain is at least 5, 6, or 7 amino acids in length; the E1 domain is at least 3 or 4 amino acids in length; the E2 domain is at least 3, 4, or 5 amino acids in length; the H3 domain is at least 9, 10, 11, 12, 13, or 14 amino acids in length; the E3 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E4 domain is at least 8, 9, 10, 11, or 12 amino acids in length; the E5 domain is at least 10, 11, 12, 13, 14 or 15 amino acids in length; and the E6 domain is at least 6, 7, 8, or 9 amino acids in length.

In another embodiment: the H1 domain is 20 amino acids in length; the H2 domain is 7 amino acids in length; the E1 domain is 4 amino acids in length; the E2 domain is 5 amino acids in length; the H3 domain is 14 amino acids in length; the E3 domain is 12 ammo acids in length; the E4 domain is 12 ammo acids in length; the E5 domain is 15 ammo acids in length; and the E6 domain is 9 amino acids in length.

In various further embodiments, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,

18, 19, 20, 21, 22, 23, 24, 25, 26, or all 27 of the following are true:

(a) the H1 domain residue 11 is V or another hydrophobic residue (residue 11);

(b) the H1 domain residue 14 is A or another hydrophobic residue (residue 14);

(c) the H1 domain residue 15 is A or another hydrophobic residue (residue 15)

(d) the H1 domain residue 18 is A or another hydrophobic residue (residue 18);

(e) the H1 domain residue 19 is L or another hydrophobic residue (residue 19);

(f) the H2 domain residue 4 is L or another hydrophobic residue (residue 26); (g) the L3 domain residue 1 is H (residue 30);

(h) the L3 domain is at least 5 amino acids in length, and L3 domain residue 5 is F (residue 34)

(i) the E1 domain residue 2 is A or another hydrophobic residue (residue 36);

(j) the E1 domain residue 3 is K (residue 37);

(к) the E1 domain residue 3 is D (residue 38);

(l) the E2 domain residue 3 is W (residue 43);

(o) the H3 domain residue 6 is V or another hydrophobic residue (residue 52);

(p) the H3 domain residue 7 is I or another hydrophobic residue (residue 53);

(q) the H3 domain residue 10 is Y (residue 56);

(r) the H3 domain residue 11 is Y (residue 57);

(s) the E3 domain residue 2 is V or another hydrophobic residue (residue 65);

(t) the E3 domain residue 4 is A or another hydrophobic residue (residue 67);

(u) the E4 domain residue 6 is M (residue 83);

(v) the E4 domain residue 8 is V or another hydrophobic residue (residue 85);

(w) the E4 domain residue 10 is F (residue 87);

(x) the E5 domain residue 9 is V or another hydrophobic residue (residue 100);

(y) the E6 domain residue 3 is L or another hydrophobic residue (residue 111);

(z) the E6 domain residue 5 is E (residue 113); and/or

(аа) the E6 domain residue 6 is F (residue 114).

In another embodiment, the protein comprises an amino acid sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to SEQ ID NO: 384 or 385:

Figure 13 shows the domain structure mapped onto the sequence of SEQ ID NO:384. In exemplary embodiments, the proteins of these aspects may be encoded by a nucleic acid comprising the nucleotide sequence of SEQ ID NO:386 or 387.

>HTZ3-D2 (codon optimized for E. coli expression)

>HTZ3-G4 (codon optimized for E . coli expression)

In one embodiment of all of the proteins of the disclosure, amino acid substitutions relative to the reference protein are conservative amino acid substitutions. As used herein, “conservative amino acid substitution” means a given amino acid can be replaced by a residue having similar physiochemical characteristics, e.g., substituting one aliphatic residue for another (such as Ile, Val, Leu, or Ala for one another), or substitution of one polar residue for another (such as between Lys and Arg; Glu and Asp; or Gln and Asn). Other such conservative substitutions, e.g., substitutions of entire regions having similar hydrophobicity characteristics, are known. Proteins comprising conservative amino acid substitutions can be tested in any one of the assays described herein to confirm that a desired activity, is retained. Amino acids can be grouped according to similarities in the properties of their side chains (in A. L. Lehninger, in Biochemistry, second ed., pp. 73-75, Worth Publishers, New York (1975)): (1) non-polar: Ala (A), Val (V), Leu (L), Ile (I), Pro (P), Phe (F), Trp (W), Met (M); (2) uncharged polar: Gly (G), Ser (S), Thr (T), Cys (C), Tyr (Y), Asn (N), Gln (Q); (3) acidic: Asp (D), Glu (E); (4) basic: Lys (K), Arg (R), His (H). Alternatively, naturally occurring residues can be divided into groups based on common side-chain properties: (1) hydrophobic: Norleucine, Met, Ala, Val, Leu, Ile; (2) neutral hydrophilic: Cys, Ser, Thr, Asn, Gln; (3) acidic: Asp, Glu; (4) basic: His, Lys, Arg; (5) residues that influence chain orientation: Gly, Pro; (6) aromatic: Trp, Tyr, Phe. Non-conservative substitutions will entail exchanging a member of one of these classes for another class. Particular conservative substitutions include, for example; Ala into Gly or into Ser; Arg into Lys; Asn into Gln or into H is; Asp into Glu; Cys into Ser; Gln into Asn; Glu into Asp; Gly into Ala or into Pro; His into Asn or into Gln; Ile into Leu or into Val; Leu into Ile or into Val; Lys into Arg, into Gln or into Glu; Met into Leu, into Tyr or into Ile; Phe into Met, into Leu or into Tyr; Ser into Thr; Thr into Ser; Tip into Tyr; Tyr into Trp; and/or Phe into Val, into Ile or into Leu.

In another embodiment, the disclosure provides self-complementing multipartite protein having luciferase activity, comprising at least a first polypeptide component and a second polypeptide component, wherein the at least first polypeptide component and the second polypeptide component are not covalently linked, wherein in total the at least first polypeptide component and the second polypeptide component comprise domains X1-Z1- X2-Z2-X3-Z3-X4-Z4-X5-Z5-X6-Z6-X7-Z7-X8-Z8-X9, wherein each domain is as defined above; and wherein (a) each X domain is fully present within one polypeptide component of the at least first polypeptide component and the second polypeptide component, and (b) none of the at least first polypeptide component and the second polypeptide component include each of X1, X2, X3, X4, X5, X6, X7, X8, and X9.

The split proteins comprise at least a first polypeptide component and a second polypeptide component in which X domains are preserved while split points are taken only in the Z domains. In other words, each X strand or (X1, X2, X3, X4, X5, X6, X7, X8, and X9) is fully present within one polypeptide component of the at least first polypeptide component and the second polypeptide component, while the protein is split into separate components at a Z domain (of Z1, Z2, Z3, Z4, Z5, Z6, Z7, and Z8), wherein the Z domain that the split occurs at may be absent, or may be partially present in one or both of the first and second polypeptide components. By way of non-limiting example, in various embodiments of a split luciferase protein, the first polypeptide component and the second polypeptide component may comprise components as exemplified in Table 2.

Table 2

In various embodiments, the split may occur at Z4, Z5, Z6, or Z7. In another embodiment, the disclosure provides self-complementing multipartite protein having luciferase activity, comprising at least a first polypeptide component and a second polypeptide component, wherein the at least first polypeptide component and the second polypeptide component are not covalently linked, wherein in total the at least first polypeptide component and the second polypeptide component comprise the secondary structure arrangement H1-L1-H2-L2-E1-L3-E2-L4-H3-L5-E3-L6-E4-L7-E5-L8-E6, wherein each domain is as defined above; wherein (a) each H and E domain is folly present within one polypeptide component of the at least first polypeptide component and the second polypeptide component, and (b) none of the at least first polypeptide component and the second polypeptide component include all of the H and E domains.

In this embodiment, the split proteins comprise at least a first polypeptide component and a second polypeptide component in which H and E domains are preserved while split points are taken only in the L domains. In various embodiments, the split occurs at L4, L5, L6, L7, orL8.

The split proteins of these embodiments are only active when they are brought together, and thus are conditionally active.

In another embodiment the disclosure provides fusion proteins comprising:

(a) the protein or polypeptide component of any embodiment or combination of embodiments herein; and

(b) one or more additional functional domains.

As used herein, a “functional domain” is any polypeptide that can be usefolly fused to the luciferase protein or split protein component of the disclosure. By way of non-limiting examples, the one or more additional functional domains may comprise a diagnostic polypeptide, any protein that one might want to localize within a cell, tissue, or organism; etc.

In another aspect the disclosure provides nucleic acids encoding the protein, protein component, or fusion protein of any embodiment or combination of embodiments of the disclosure. The nucleic acid sequence may comprise single stranded or double stranded RNA (such as an mRNA) or DNA in genomic or cDNA form, or DNA-RNA hybrids, each of which may include chemically or biochemically modified, non-natural, or derivatized nucleotide bases. Such nucleic acid sequences may comprise additional sequences usefol for promoting expression and/or purification of the encoded polypeptide, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the polypeptides of the disclosure. In various non-limiting embodiments, the nucleic acid may comprise a nucleotide sequence at least 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical to the nucleotide sequence of any one of SEQ ID N0:200-380 or 386-387, wherein residues in parentheses are optional and may be present or absent, and when absent are not considered in determining percent identity relative to the reference sequence.

In a further aspect, the disclosure provides expression vectors comprising the nucleic acid of any aspect of the disclosure operatively linked to a suitable control sequence. "Expression vector" includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product “Control sequences” operably linked to the nucleic acid sequences of the disclosure are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered "operably linked" to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors can be of any type, including but not limited plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In various embodiments, the expression vector may comprise a plasmid, viral-based vector, or any other suitable expression vector.

In another aspect, the disdosure provides host cells that comprise the nucleic acids, expression vectors (i..e.: episomal or chromosomally integrated), non-naturally occurring polypeptides, fusion protein, or compositions disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cdls can be transiently or stably engineered to incorporate the nucleic acids or expression vector of the disdosure, using techniques including but not limited to bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, poly cationic mediated-, or viral mediated transfection.

The disclosure also provide kits, comprising:

(a) the protein, polypeptide component, fusion protein, nucleic acid, expression vector, and/or host cell of any embodiment or combination of embodiments herein; and

(b) instructions for their use.

In one embodiment, the kits further comprise diphenylterazine (DTZ).

In another aspect, the disclosure provides methods for use of the protein, polypeptide component, fusion protein, nucleic acid, expression vector, host cell, and/or kit of any preceding claim for any suitable purpose, inducting but not limited to use luminescent reporting assays, diagnostic assays, cellular localization of targets of interest, cellular imaging, gene editing, live animal imaging, cancer labeling, CART-cells reporting, secreted assay, gene delivery, tissue engineering, etc. Additional details can be found in the examples.

Examples

Abstract

De novo enzyme design has sought to introduce active sites and substrate binding pockets predicted to catalyze a reaction of interest into geometrically compatible native scaffolds 1,2 , but has been limited by a lack of suitable protein structures and the complexity of native protein sequence-structure relationships. Here we describe a deep-learning based “family-wide hallucination” approach that generates large numbers of idealized protein structures containing diverse pocket shapes and designed sequences that encode them We use these scaffolds to design artificial luciferases that selectively catalyze the oxidative chemiluminescence of the synthetic luciferin substrates, diphenylterazine (DTZ) 3 and 2- deoxycoelenterazine (h-CTZ), through the placement of an arginine guanidinium group adjacent to an anion species that develops during the reaction in a high shape complementarity binding pocket. For both luciferin substrates, we obtain designed luciferases with high selectivity; the most active of these is a small (13.9 kDa) and thermostable (T M >95°C) enzyme with a catalytic efficiency on DTZ (k cat /K M = 10 6 M -1 s -1 ) comparable to native luciferases but with much higher substrate specificity. The design of highly active and specific biocatalysts from scratch with broad applications in biomedicine is an important milestone for computational enzyme design, and our approach should enable the design of a wide range of new and useful luciferases and other enzymes. Bioluminescent light produced by the enzymatic oxidation of a luciferin substrate is widely used for bioassays and imaging in biomedical research. Because no excitation light source is needed, luminescent photons are produced in the dark which results in higher sensitivity than fluorescence imaging in live animal models and in biological samples where autofluorescence or phototoxicity is a concern 4,5 . However, the development of luciferases as molecular probes has lagged behind that of well-developed fluorescent protein toolkits for a number of reasons: (i) very few native luciferases have been identified; (ii) many of those that have been identified require multiple disulfide bonds to stabilize the structure and are therefore prone to misfolding in mammalian cells; (iii) most native luciferases do not recognize synthetic luciferins with more desirable photophysical properties; and (iv) multiplexed imaging to follow multiple processes in parallel using mutually orthogonal luciferase-luciferin pairs has been limited by the low substrate specificity of native luciferases.

We sought to use de novo protein design to create new luciferases that are small, highly stable, well-expressed in cells, specific for one substrate, and need no cofactors to function. As we are not constrained to natural luciferase substrates, we chose a synthetic luciferin, Diphenylterazine (DTZ) as the target substrate duo to its good quantum yield, red- shifted emission 3 , favorable in vivo pharmacokinetics 14,15 , and lack of required cofactors for emission. Previous computational enzyme design studies have primarily repurposed native protein scaffolds in the PDB, but there are few native structures with binding pockets appropriate for DTZ, and the effects of sequence changes on native proteins can be unpredictable. To circumvent these limitations, we set out to generate large numbers of ideal protein scaffolds with pockets of the appropriate size and shape for DTZ, and with clear sequence-structure relationships to facilitate subsequent active site incorporation. To identify protein folds capable of hosting such pockets, we first docked DTZ into 4000 native small molecule binding proteins. We found that many NTF2 (nuclear transport factor 2)-like folds have binding pockets with appropriate shape-complementary and size for DTZ placement (Fig. 1e), and hence selected the NTF2-like superfamily as the target topology.

Family-wide hallucination

Native NTF2 structures have a range of pocket sizes and shapes but also contain nonideal features such as long loops which compromise stability. To create large numbers of ideal NTF2-like structures, we developed a deep-learning based “family-wide hallucination” approach that integrates unconstrained de novo design 19,20 and fixed backbone sequence design approaches 21 to enable the generation of an essentially unlimited number of proteins having a desired fold (Fig. 1a). The family-wide hallucination approach utilizes the de novo sequence and structure discovery capability of unconstrained protein hallucination 19,20 for loop and variable regions, and structure-guided sequence optimization for core regions. We employed the trRosetta™ structure prediction neural network 22 , which is effective in identifying experimentally successful de novo designed proteins and hallucinating new globular proteins of diverse topologies. Starting from sequences and predicted structures of 2,000 naturally occurring NTF2s, we used trRosetta™ to optimize the amino sequence of conserved core and variable loop regions. Protein core idealization was carried out with a topology-specific loss function over core residue pair geometries (see Methods) and variable loop optimization, by optimizing sequence length and identity to maximize the confidence of the neural network in the predicted structure. To further encode structural specificity, we incorporated buried, long-range hydrogen-bonding networks. The resulting 1615 familywide hallucinated NTF2 scaffolds provided more shape complementary binding pockets for DTZ than native small-molecule protein binding proteins (Fig. 1e). This approach samples protein backbones closer to native NTF2-like proteins (Fig. 1f) and with better scaffold quality metrics than a previous non deep-learning approach 23 (Fig. 1g).

De novo design of luciferases for DTZ

Standard computational enzyme design generally starts from an ideal active site or theozyme consisting of protein functional groups surrounding the reaction transition state that is then extrapolated into a set of existing scaffolds 1,2 . However, the detailed mechanism of native marine luciferases is not well defined as only a handfill of apo-structures and no holostructures have been solved 24,25 (excluding calcium-regulated photoproteins). Both quantum chemistry calculations 26,27 and experimental data 28,29 suggest that the chemiluminescent reaction proceeds through an anionic species and that the polarity of the surroundings can substantially alter the free energy of the subsequent single electron transfer (SET) process with triplet molecular oxygen ( 3 O 2 ). Guided by these data (Fig. 5), we sought to design a shape complementary catalytic site that stabilizes the anionic state of DTZ and lowers the SET energy barrier, assuming that the downstream dioxetane light emitter thermolysis steps are spontaneous. To stabilize the anionic species of DTZ, we focused on the placement of the positively charged guanidinium group of an arginine residue to interact with the anionic imidazopyrazinone core. To computationally design such active sites into large numbers of hallucinated NTF2 scaffolds, we first used AIMNet 30 to generate an ensemble of anionic DTZ conformers (Fig. 1b). Next, around each conformer, we used the RIFgen method 31,32 to enumerate Rotamer Interaction Fields (RIFs) on 3D grids consisting of millions of placements of amino acid sidechains making hydrogen bonding and nonpolar interactions with DTZ (Fig. 1c). Additionally, we included an arginine guanidinium group near the deprotonation site at the nitrogen of imidazopyrazinone (N1 atom) in the RIF. RIFdock was then used to dock each DTZ conformer and associated RIF in the central cavity of each scaffold to maximize protein-DTZ interactions. An average of eight sidechain retainers including an arginine to stabilize the anionic imidazopyrazinone core were positioned in each pocket (Fig. 1d). For the top 50,000 docks with the most favorable sidechain-DTZ interactions, we optimized the remainder of the sequence using RosettaDesign™ (Fig. 1d) for high-affinity binding to DTZ with a bias towards the naturally observed sequence variation to ensure foldability. During the design process, pre-defined hydrogen bond networks (HBNets) in the scaffolds were kept intact for structural specificity and stability, and interactions of these HBNet side chains with DTZ were explicitly required in the RIFdock step to ensure preorganization of residues essential for catalysis. In the first sequence design phase, the identities of all RIF and HBNet residues were kept fixed, and the surrounding residues were optimized to hold the sidechain- DTZ interactions in place and maintain structural specificity. In the second sequence design step, the RIF residue identities (except the arginine) were also allowed to vary, to identify apolar and aromatic packing interactions missed in the RIF due to binning effects. During sequence design, the scaffold backbone, sidechains, and DTZ substrate were allowed to relax in cartesian space. Following sequence optimization, the designs were filtered based on ligand-binding energy, protein-ligand hydrogen bonds, shape complementarity, and contact molecular surface, and 7982 designs were selected and ordered as pooled oligos for experimental screening.

Screening and characterization of DTZ specific luciferases

Oligonucleotides encoding the two halves of each design were assembled into full- length genes and cloned into an E. coli expression vector (see Methods). A colony-based screening method was used to directly image active luciferase colonies from the library and the activities of selected clones were confirmed using a 96-well plate expression (Fig. 6). Three active designs were identified; we refer to the most active of these as LuxSit (Latin: let light exist); LuxSit is the smallest known luciferase with 117 residues (13.9 kDa). Biochemical analysis, including SDS-PAGE and size exclusion chromatography (Fig. 2ab and Fig. 7), indicated that LuxSit is highly expressed, soluble, and monomeric from E. coli expression. Circular dichroism (CD) spectroscopy showed a strong far UV CD signature, suggesting an organized α-β structure. CD melting experiments showed that the protein is not fully unfolded at 95 °C, and the full structure is regained when the temperature is dropped (Fig. 2c). Incubation of LuxSit with DTZ resulted in luminescence with an emission peak at ~480 nm (Fig. 2d), consistent with the DTZ chemiluminescence spectrum. While we were not able to determine the crystal structure of LuxSit, the AlphaFold2 33 predicted structure is very close to the design model at the backbone level (RMSD = 1.3 Å) and over the side chains interacting with the substrate (Fig. 2e). The designed LuxSit active site contains Tyr14-His98 and Asp18-Arg65 dyads; with the imidazole nitrogen atoms of His98 making hydrogen bond interactions with Tyr14 and the 01 atom of DTZ (Fig. 2f). The center of the Arg65 guanidinium cation is 4.2 Å from the N1 atom of DTZ and Aspl8 forms a bidentate hydrogen bond to the guanidinium group and backbone N-H of Arg65 (Fig. 2g).

De novo design of luciferases for h-CTZ

We next sought to apply the knowledge gained from designing LuxSit to design 2- deoxycoelenterazine (h-CTZ) specific luciferases. Since the molecular shape of h-CTZ is different from that of DTZ, we created an additional set of NTF2 superfamily scaffolds (see Methods) with matching pocket shapes and highly predicted model confidence (AlphaFold2 predicted local distance difference test, pLDDT > 92). We then installed catalytic sites in these scaffolds and designed the protein sidechain-h-CTZ interactions using RosettaDesign™ as described above for DTZ. To design the remainder of the sequence, we used ProteinMPNN 34 , which can result in better stability, solubility, and accuracy than RosettaDesign™. Following filtering based on the AlphaFold2 predicted pLDDT, Cα RMSD, the protein-h-CTZ contact molecular surface, and computed binding energies (see Methods), we selected and experimentally expressed 46 designs in E. coli and identified two (HTZ3-D2 and HTZ3-G4) with luciferase activity with the h-CTZ luciferin substrate. Both designs were highly soluble, monodisperse, and monomeric; the luciferase activities were of the same order of magnitude as LuxSit (Fig. 8). The success rate in de novo luciferase increased from 3/7982 to 2/46 sequences in the second round likely due to the better understanding of functional luciferase active site geometry and the robustness of the deep-learning design method. Activity optimization

To better understand the contributions to catalysis of LuxSit, the most active of our luciferase designs, we constructed a site saturation mutagenesis (SSM) library in which every mutation was made at every pocket residue one at a time (see Methods). Fig. 2f-i illustrate the amino-acid preferences at key positions. Arg65 is highly conserved, and its dyad partner Aspl8 can only be mutated to Glu (which reduces activity), suggesting the carboxylate- Arg65 hydrogen bond is important for luciferase activity. In the Tyr14-His98 dyad, Tyr14 can be substituted with Asp and Glu, while His98 can be replaced with Asn. As all active variants had hydrogen bond donors and acceptors at these positions, the dyad may help mediate the electron and proton transfer required for luminescence. Hydrophobic (Fig. 2h) and π-stacking (Fig. 2i) residues at the binding interface tolerate other aromatic or aliphatic substitutions and generally prefer the amino acid in the original design consistent with modelbased affinity predictions of mutational effects. The A96M and Ml 10V mutants increase activity by 16-fold and 19-fold over LuxSit respectively (Table 4). Optimization guided by these results yielded LuxSit-f (A96M/M110V) with strong initial flash emission and LuxSit-i (R60S/A96L/M110V) with more than 100-fold higher photon flux over LuxSit (Fig. 9). Overall, the active site saturation mutagenesis results support the design model, with the Tyr14-His98 and Aspl8-Arg65 dyads playing key roles in catalysis and the substrate-binding pocket largely conserved.

The most active catalysts, LuxSit-f and LuxSit-i were both expressed solubly in E. coli at high levels and are monomeric (some dimerization was observed at the high protein concentration, Fig. 71) and thermostable (Fig. 7j-k). Similar to native CTZ-utilizing luciferases, the apparent Michaelis constants K M of both LuxSit-f and LuxSit-i are in the low μM range (Fig. 3a) and the luminescent signal decays over time due to fast catalytic turnover (Fig. 10a). LuxSit-i is a very efficient enzyme with a k cat /K M of 10 6 M -1 s -1 . The luminescence signal is readily visible to the naked eye (Fig. 3b), and the photon flux (photon s -1 ) is 38% greater than the native Renilla reniformis luciferase (RLuc) (Table 3). The DTZ luminescent reaction catalyzed by LuxSit-i is pH-dependent (Fig. 10b), consistent with the proposed mechanism

Cellular imaging and multiplexed bioassay

As luciferases are commonly used genetic tags and reporters for the study of cellular functions, we evaluated the expression and function of LuxSit-i in live mammalian cells. LuxSit-i-mTagBFP2-expressing HEK293T cells had DTZ specific luminescence (Fig. 3c), which was maintained following targeting of LuxSit-i to the nucleus, membrane, and mitochondria (Fig. 11). Native and previously engineered luciferases are quite promiscuous with activity on many luciferin substrates (Fig. 4ac), possibly due to their large and open pockets (a luciferase with high specificity to one luciferin substrate has been difficult to control even with extensive directed evolution 35 ). In contrast, LuxSit-i exhibited exquisite specificity to its target luciferin with 50-fold selectivity for DTZ over bis-CTZ (which differ only in a benzylic carbon;. The most active design for h-CTZ, HTZ3-G4 was also highly specific for its target substrate (Fig. 4c and Fig. 8d). Overall, the specificity of our designed luciferases (LuxSit-i and HTZ3-G4) is much greater than native luciferases 36,37 or previously engineered luciferases 38 .

The high substrate specificity of LuxSit-i might allow multiplexing of luminescent reporters through substrate-specific or spectrally resolved luminescent signals (Fig. 4d and Fig. 12ab). To explore this possibility, we tracked two independent signaling pathways (cAMPZPKA and NF-κB) by placing the expression of either RLuc or LuxSit-i downstream of the NF-κB or cAMP response element promoters, respectively (Fig. 4e). Imaging in the presence of the substrates for the two luciferases (PP-CTZ for RLuc and DTZ for LuxSit-i) one at a time (Fig. 4f) can clearly distinguish known activators of the two pathways. Because the luminescence of the two reactions occurs at different wavelengths, we were also able to (Fig. 4g) simultaneously assess the activation of the two signaling pathways in the same sample in either intact HEK293T cells or cell lysates (Fig. 12c-e) by providing both substrates together and monitoring luminescence at different wavelengths.

Conclusion

Computational enzyme design to date has been constrained by the number of available scaffolds, which limits the extent to which catalytic configurations and enzymesubstrate shape complementarity can be achieved 16-18 . The use of deep-learning to generate large numbers of de novo designed scaffolds here eliminates this restriction; moving forward, the more accurate RoseTTAfold™ 39 and AlphaFold2™ 33 should enable still more effective protein scaffold generation by leveraging family-wide hallucination abilities. The diversity of scaffold pocket shapes and sizes enabled the exploration of a range of catalytic geometries and the maximization of substrate-enzyme shape complementarity; to our knowledge, no native luciferases have folds similar to LuxSit and HTZ3-G4, and the two enzymes have high specificity for fully synthetic luciferin substrates that do not exist in nature. With the incorporation of 2-3 substitutions that provide a more complementary pocket to stabilize the transition state, LuxSit-i has higher activity than any previously de novo designed enzyme; the kcat/^M of 10 6 M -1 s -1 is in the range of native luciferases. This is a notable advance for computational enzyme design, as tens of rounds of directed evolution were required to obtain catalytic proficiencies in this range for a designed retroaldolase, and the structure was remodeled considerably 40 ; in contrast, the predicted differences in ligand-sidechain interactions between LuxSit and LuxSit-i are very subtle. Achieving such high activities directly from the computer remains an outstanding goal for computational enzyme design. The small size of LuxSit makes it well suited as a genetic tag for capacity-limited viral vectors, biosensor development, and fusions to proteins of interest On the basic science side, the small size, simplicity, and high activity make LuxSit-i an excellent model system for computational and experimental studies aimed at improving understanding of the luciferase catalytic mechanism Extension of the approach used here to create similarly specific new luciferases for synthetic luciferin substrates beyond DTZ and h-CTZ would considerably extend the multiplexing opportunities illustrated in Fig. 4 or with the microscopy phasor 41 , leading to widely useful multiplexed luminescent toolkits. More generally, our family-wide hallucination method opens up an almost unlimited number of new scaffold possibilities for substrate binding and catalytic residue placement, which is particularly important in cases where the reaction mechanism, and how to promote it, are not completely understood: many structural and catalytic hypotheses can be readily enumerated with different catalytic residue placements in shape and chemically complementary binding pockets. While luciferases are unique in catalyzing the emission of light, the chemical transformation of substrates into products is common to all enzymes, and the approach developed here should be readily extendable to a wide variety of chemical reactions. Methods

1. Materials and general methods

Synthetic genes and oligonucleotides were purchased from Integrated DNA Technologies or GenScript. The synthetic gene was inserted between Ndel and Xhol sites of a pET29b+ vector, containing an N-terminal hexahistidine tag followed by a TEV protease cleavage site and a C -terminal stop codon. Restriction endonucleases, Q5 PCR polymerase, and T4 ligase were purchased from NEB. Plasmid DNA, PCR products, or digested fragments were purified by Qiagen DNA purification kits. DNA sequences were analyzed by Genewiz. Coelenterazine (CTZ) was purchased from Gold Biotechnology. Diphenylterazine (DTZ), pyridyl diphenylterazine (8pyDTZ), and Furimazine (FRZ) were purchased from MedChemExpress. All other coelenterazine analogs (bis-CTZ: bisdeoxycoelenterazine; f- CTZ: f-Coelenterazine; e-CTZ: e-Coelenterazine-F; PP-CTZ: methoxy e-Coelenterazine; v- CTZ: v-Coelenterazine; h-CTZ: 2-deoxycoelenterazine) were ordered from NanoLight. All other chemicals were purchased from Sigma-Aldrich or Fisher Scientific and used without further purification. To identify the molecular mass of each protein, intact mass spectra were obtained via reverse-phase LC/MS on an Agilent 6230B TOF on an AdvanceBio RP- Desalting column and subsequently deconvoluted by Bioconfirm software using a total entropy algorithm ÄKTA pure M with UNICORN 6.3.2 Workstation control (GE Healthcare) coupled with a Superdex 75 Increase 10/300 GL column was used for size exclusion chromatography. DNA and protein concentrations were determined by a NanoDrop small-volume 8 channel UV/vis spectrometer. CD spectra and CD melting experiments were performed by the default setting on a J-1500 Circular Dichroism Spectropolarimeter (Jasco). All luminescence measurements were acquired by a Biotek Synergy Neo2 Multi-Mode Plate Reader. To convert relative arbitrary unit (RLU) to the number of photons, Neo2 plate reader was calibrated by determining the chemiluminescence of luminol with known quantum yield in the presence of horseradish peroxidase and hydrogen peroxide in K 2 CO 3 aqueous solution as previously described 45 . SDS PAGE and luminescence images were captured by a Bio-Rad ChemiDoc™ XRS+. Images were analyzed using the Fiji image analysis software.

2. General procedures for protein production and purification

Lemo21(DE3) strain was used for transformation with the pET29b+ plasmid encoding the gene of interest Transformed cells were grown for 12 h in TB medium supplemented with kanamycin. Cells were inoculated at 1:50 ratio in 100 mL fresh TB medium, grown at 37 °C for 4 h, and then induced by IPTG for an additional 18 h at 16 °C. Cells were harvested by centrifugation at 4,000g for 10 min and resuspended in 30 mL lysis buffer (20 mM Tris-HCl pH 8.0, 300 mM NaCl, 30 mM imidazole, and Pierce™ Protease Inhibitor Tablets). Cell resuspensions were lysed by sonication for 5 min (10 s per cycle). Lysates were clarified by centrifugation at 24,000g at 12 °C for 40 min and pre-equilibrated with 1 mL of Ni-NTA nickel agarose at 4 °C for 1 h. The resin was washed twice with 10 mL wash buffer and then eluted in 1 mL elution buffer (20 mM Tris-HCl pH 8.0, 300 mM NaCl, 300 mM imidazole). The eluted proteins were purified by size exclusion chromatography in PBS. Fractions were collected based on A280 trace, snap-frozen in liquid nitrogen, and stored at -80 °C,

3. Computational design of idealized scaffolds

Our generation of idealized NTF2-scaffolds can be divided into four parts: (3.1) Generation of seed-structures, (3.2) optimization of backbone geometries using trRosetta™- based hallucination, (3.3) generation of structure-conditioned sequence models to bias design, (3.4) design and filtering.

3.1 Generation of seed structures

We thought to increase the set of NTF2 structures by complementing experimentally resolved structures from the PDB with highly accurate models generated by trRosetta™ 22 . To achieve this, we first collected 85 NTF2-like protein structures from the PDB based on SCOPe annotation (d. 17.4 SCOPe v2.05). Corresponding sequences were then used as queries to collect sequence homologs fromUniProt™ by performing 8 iterations of hhblits at le-20 e-value cutoff against uniclust30_2018_08 database; default filtering cutoffs were relieved (-maxfilt 100000000 -nefimax 20 -nodiff -realign max 10000000) to maximize the number of the output hits. All the hits were redundancy reduced using cd-hit 46 with a sequence identity cutoff of 60% yielding a set of 7,573 candidates for modeling.

To generate inputs for structure modeling with trRosetta™, we built multiple sequence alignments (MSAs) for each of the 7,573 selected sequences with hhblits using a more conservative e-value cutoff of le-50; the resulting MSAs were also complemented by hits from hmmsearch against uniref100 (release-2019 11) with the bit-score threshold of 115 (i.e. — Ibit per position). After joining the above two sets of alignments and filtering them at 90% sequence identity and 75% coverage cutoffs, only sequences with more than 50 homologs in the corresponding MSAs were retained for modeling (2,005 sequences). The filtered MSAs along with information on the top 25 putative structural homologs as identified by hhsearch against the PDB100 database of templates were used as inputs to the template- aware version of trRosetta™ 47 to predict residue pair distances and orientations. Network predictions were then used to reconstruct full atom 3D structure models using a Rosetta™- based folding protocol described previously 22 .

3.2 Hallucination of idealized NTF2s

Seeking to idealize the native structure seeds, we reasoned that trRosetta™, a convolutional residual neural network, which predicts residue-residue orientations and distances from sequence, could serve as a key component in a protein idealizer. Previously, this network has been used to generate diverse proteins that resemble the “ideal” structures of de novo designed proteins by changing the protein sequence to optimize the contrast (KL- divergence) between the predicted geometry and that of randomly generated sequences 19 .

For our purpose, the desired fold-space is not diverse but instead focused on the NTF2-like topology. To guarantee generation of ideal structures within this fold-space, we implemented anew fold-specific loss-function, which biased hallucinations based on observed geometries in native crystal structures. As many experimentally characterized NTF2s contain non-ideal regions, we began by creating a set of trimmed but ideal NTF2s by manually removing non-ideal structural elements such as kinked helices, and long or rarely observed loops. For each seed structure, we then used a structure-based sequence alignment method (see 3.3) to find equivalent positions between the seed structure and Residue pairs were considered to be in a conserved tertiary motif (TERM) if there were 5 or more equivalent positions in The smooth probability distributions based on observed geometries in were then computed. For distances we used a Gaussian distribution with mean equal to the true distance denoted by D and standard deviation denoted by a equal to 0.5 Å. The probability density function for distances d is given by:

Using this density function one can construct a categorical distribution for binned distances by evaluating this function at the centers of the bins and then normalizing by a sum of all values in different bins. Similarly, a von Mises distribution was used for omega angle smoothing with probability density function given by f(ω; Ω, κ) = N(κ) exp[κ cos(ω - Ω)] where N(κ) is a normalizing constant, Ω is the crystal value, κ is the inverse variance chosen to be 100, and ω is the smoothed angle. For phi and theta angles a von Mises-Fisher blur is given by f(x; μ, κ) = N(κ) exp[κ μ T x]where N(κ) is a normalizing constant, μ is a unit vector on a 3D sphere corresponding to the phi and theta angles from the crystal structure, x is a smoothed unit vector, and κ is the inverse variance chosen to be 100.

Next, we converted those probability distributions to energy landscapes (ie - negative log likelihoods) and sought to minimize the expected energy. This soft restraint encouraged the network to seek out the consensus structure, while still allowing deviations where needed. Specifically, we formulated the fold-specific loss as: where p is the network prediction and s is the smoothed probability distribution of the conserved residue pairs. For the second part of the loss function and similar to previous work 19 , we sought to maximize the Kullback-Leibler (KL) divergence between the predicted probability distribution and a background distribution for all i,j residue pairs not in a TERM. where b is the background distribution and Nx is the number of bins in each probability distribution (N d = 37, N ω,θ, = 25, N φ = 13). Briefly, b is calculated by a network of similar architecture to trRosetta™ trained on the same training data, except it is never given sequence information as an input. The final loss is given by:

L — L fold + L hall

We used a Markov Chain Monte Carlo (MCMC) procedure to search for sequences that trRosetta™ predicted to fold into structures that minimize this loss function. We allowed four types of moves with different sampling probabilities: mutations (p=0.55), insertions (p=0.15), deletions (p=0.15), and moving segments (p=0.15). Mutations randomly changed one amino acid to another, with an equal transition probability for all 20 amino acids. Insertions inserted a new amino acid (all equally likely) into a random location subject to the KL-divergence loss. Deletions deleted a random residue from the same locations. Finally, we also allowed “segments” to move, cutting and pasting themselves from one part of the sequence to another, while maintaining the same overall segment order. Here, a “segment” is a continuous stretch of amino acids all subject to fold specific loss, often composed of a single strand or helix. Starting from a random sequence of an initial length (typically 120 amino acids), we used the standard Metropolis criteria to accept or reject moves: where A i is the chance of accepting the move at step i, L i is the loss at the current step, L i-1 is the loss at the previous step and T is the temperature. The temperature started at 0.2 and was reduced by half every 5k steps. Generally, it took 30k steps to converge.

3.3 Structure-conditioned multiple sequence alignment

Given the complexity of the NTF2-like protein fold, we hypothesized that it was necessary to impose sequence design rules to disfavor alternative states (negative design). Towards this end, we computed a structure-conditioned multiple sequence alignment based on native NTF2-like proteins. Specifically, we used TMalign 48 to superimpose each of the 2005 predicted native structures (from 3.1) onto each hallucinated backbone (from 3.2). Next, to find structurally corresponding positions, we implemented a structure-based dynamic programming algorithm, similar to the Needleman-Wunsch algorithm 49 . However, instead of using the amino acid similarity as the scoring metric, we used a tunable structure-based score function. After aligning the two structures, we scored the structural similarity of any two residues by empirically weighting several metrics: (1) Distance between Ca atoms, (2) differences between backbone torsion angles (phi and psi) backbone torsion angles and (3) the angle (degrees) between the vectors pointing from Cα to Cβ in each residue. To calculate the unweighted score for each component, we normalized each by a maximum possible value (180 degrees for angles and 10Å for distances) and included a “set point” that approximately delineated when we judged a metric to indicate two residues to be more similar than not. Values above this setpoint are positive, indicating two residues are similar and values below the set point indicated two residues are dissimilar.

Score unweighted = (set_point - value) / max_value

Each value was scaled by its normalized weight and summed to give an overall similarity score between any two amino acids. These similarity scores were used as the similarity metric in our dynamic programming algorithm, in place of the typical BLOSUM62 similarity metric. We used a gap penalty of 0.1 and an extension penalty of 0.0. Finally, after concatenating all the structure- conditioned aligned sequences, we used PSI-BLAST-exB 50,51 to compute sequence redundancy weighted log-odds scores for each amino acid at each position (position-specific scoring matrices, PSSMs).

3.4 Design

To design the resulting backbones, we sought, in addition to the sequence patterns captured in the PSSM (3.3), to further specify the backbone conformation and functionalize the pocket, by installing entire hydrogen bonding networks from native NTF2-like proteins. We compiled two sets of hydrogen bonding networks: a set for the cavity containing 85 networks and another set of networks connecting the C-terminal region of the first helix with the third beta-strand containing 25 networks. In 20 independent attempts for each backbone, we randomly grafted a network from each set, fixed the identities of hydrogen bonding residues, and designed the sequences for all other positions under PSSM constraints. The resulting models were filtered for various backbone quality metrics and for maintenance of hydrogen bonding networks in the absence of constraints, resulting in a total of 1615 idealized scaffolds.

4 RIFdock tuning files

The hierarchical search framework of RifDock is a powerful way to search through 6- dimensional rigid body orientations. While originally designed to work with physics-based forcefields, the scoring machinery can easily be modified to do other things. A system was added called "Tuning Files” that allows one to tune the energetics of rifdock by “requiring” specific interactions. Specified interactions can range from specific hydrogen bonds, to specific bidentates, and even to specific hydrophobic interactions. The specifics are that during the RifGen stage, each stored rotamer is compared against a list of definitions in the Tuning File. If the rotamer satisfies a definition, it is stored into the RIF with a “Requirement Number”. Later during RifDock, these Requirement Numbers are available during scoring and the presence or absence of certain rotameric interactions may be used to penalize or even completely discard dock solutions. In this work, the Tuning Files were used to require the specific hydrogen bond interactions between the arginine and the secondary amine in the pyrazine ring of the colenterazine-like substrate. 5 Designing theozvme architectures into de novo NTF2 scaffolds

De novo design of luciferases can be divided into three main steps - scaffold construction, substrate placement with required interactions, and sequence design. With the idealized NTF2-like scaffolds in hand, we selected 5 diverse rotamers from AIMNet and used the Rotamer Interaction Field (RIF) docking method 31 to exhaustively search a large space of interacting side chains to the anionic form of DTZ. Chemically, deprotonation of N1 hydrogen is the first step to forming an anionic species (Fig. 5). We first generated RIF using RifGen 31 to guide placement in the protein scaffolds. We required the placement of a positively charged Arginine sidechain by a tuning file (see below) to stabilize the formation of negative charge N1 atom where the deprotonation occurs initially and enumerated large numbers of possible sidechain interactions with the rest of DTZ.

HBOND-DEFINITION

N1 1 ARG

END_HBOND_DEFINITION

REQUIREMENT-DEFINITION 1 HBOND N1 1 END_REQUIREMENT_DEFINITION

Rifdock was then used to hierarchically search for the best combination of RIF to place on the input backbone. Although the negative charge can move to another electronegative atom 01 via resonance of the inridazopyrazinone core, it is unclear which anionic species is more critical for the luciferase-catalyzed luminescence emission. Thus, we let RIFdock place the polar rotamers on the basis of hydrogen-bond geometry to 01 and apolar rotamers to DTZ without specific requirements. In the next docking step, we parsed the -scaffold_res argument with a list of residue numbers as scaffold backbone positions that were annotated as pocket residues to allow a hierarchical search of RIF placement. We only- allowed the RIF placements in the pocket residues and left pre-defined hydrogen bond networks (HBNets) intact. After RIFdock, we continued for Rosetta™ sequence design where the score function was reweighted for higher buried_unsat_penalty 52 and the amino acid selection was biased by giving a pre-generated PSSM file via SeqprofConsensus task operation. This would minimize buried unsatisfied residues and increase pre-organized architectures in the core that are known to be beneficial for a catalytic pocket 53 . Two rounds of FastDesign calculation were included: we restricted the RIF retainers and core HBNets to repacking in the first round while we allowed other residues for re-design based on PSSM during the Monte Carlo simulated annealing procedure. After the surrounding residues were optimized to retain the RIF interactions, we allowed the re-design of RIF retainers, to find efficient aromatic and hydrophobic packing around DTZ while catalytic residues (the N1 requirement) were still limited to only repacking. The final set of designs was obtained after filtering by ligand-binding interface energy, shape complementarity, contact molecular surface, number of HbondsToResidue, and the presence of N1_hbond.

6. Structure prediction of LuxSit with AiphaFold2 and comparison to design model

To computationally assess the accuracy of the LuxSit design model, we performed single sequence structure prediction using AlphaFold2. All models were run with 12 recycles and generated models were relaxed using AMBER 54 . The model with the highest pLDDT was used for comparison to the Rosetta™ design model and structural superpositions were performed using the Theseus alignment tool to determine backbone RMSD between the design model and AlphaFold2 model 33 .

7, Computational design and characterization of de novo luciferases for h-CTZ

To customize a shape complementarity catalytic pocket that can accommodate and catalyze chemiluminescence of another structurally distinct luciferin substrate (2- deoxycoelenterazine, h-CTZ), we sought to use a more diverse set of scaffolds. We first used a deep-learning based protein sequence design method, ProteinMPNN 34 to redesign the whole sequences of the hallucinated NTF2 scaffolds described in 3.2 and the de novo NTF2-like superfamily reported previously 23 . Next, the protein structures of all resulting ProteinMPNN sequences were predicted by AlphaFold2 33 . 6234 scaffolds with diverse pocket geometries were obtained by filtering the pLDDT score greater than 92. With these scaffolds in hand, we selected three different h-CTZ conformers and used the RIFdock design strategy described above to search for the sidechain rotamer placements in these scaffolds. Since we’ve learned from LuxSit design that the N1-Arg and O1-His interactions are critical for catalyzing luminescence emission, both interactions were set as requirements in a tuning file (see an example below) to ensure all Rifdock outputs have N1-Arg and O1-His interactions.

HBOND_DEFINITION

N1 1 ARG 01 1 HIS

END-HBOND-DEFINITION

REQUIREMENT-DEFINITION 1 HBOND N1 1

2 HBOND O1 1

END_REQUIREMENT_DEFINITION

At tiiis stage, we generated ~215k RIFdock outputs in which we subsequently fixed the Nl-Arg and Ol-His interactions (by applying atom constraints) and allowed redesign all residues within 4 Å of the ligand. The resulting designs were prefiltered by contact molecular surface (>350), Rosetta™ ddG (<-50), and the presence of Nl-Arg and Ol-His interactions. The prefiltered sequences (~30k) were then optimized while all amino acid identities within 4Å of the ligand were kept fixed. All ProteinMPNN sequences were predicted by AlphaFold2 to obtain predicted 3D protein models where we evaluated the pLDDT score (>85), Cα RMSD (<1.2 Å) to the corresponding Rosetta™ model, and the numbers of hydrogen bonds to both hypothetical catalytic Arg (>2) and His (>1) residues (preorganization). Finally, 46 sequences passed the filters, and we ordered them as eBlocks gene fragments for experimental characterizations.

Each synthetic gene was inserted into a modified pET29b vector between two Bsal sites (Golden gate assembly) and transformed into BL21 competent E. coli. The cells were inoculated in LB and grew in a 96-deep well plate with ImM IPTG at 37 °C for 16h. For the luciferase activity screening, the cells were harvested by centrifugation at 500xg for 5 min and the pellets were lysed by the BugBuster reagent. Cell lysates were collected by centrifugation at 4000g for 20 min. The His-tag proteins were captured by nickel magnetic beads from cell lysates and bound proteins were eluted in elution buffer (20 mM Tris-HCl pH 8.0, 300 mM NaCl, 300 mM imidazole). The protein concentrations were determined by Bradford assay and the activity of each luciferase was evaluated individually in the presence of 1 μM purified protein and 25 μM h-CTZ in PBS. Through this process, we identified two designs (HTZ3-D2 and HTZ3-G4) that showed luciferase activity and substrate selectivity to h-CTZ. We scaled up the protein expression by the general procedure described above and the purified proteins were used for the characterization shown in Fig. 8. Serial diluted h-CTZ was mixed with 500 nM HTZ3-D2 or HTZ3-G4 in PBS and the concentration dependent luminescence was recorded for 30 mins (0.1 s integration and measurements were taken every 1 min). All data points were plotted as the average of the first 10 mins light output and fitted to Michaelis-Menten equation.

8. Computational SSM experiment to estimate mutation binding free energy

Rosetta™ cartesian ddg application 55,56 was used to computationally estimate enzyme and substrate binding free energy. The LuxSit design model was relaxed beforehand in cartesian space with the substrate-bound. For the 21 positions that were experimentally screened for single mutation effects on luciferase activity, each residue was computationally mutated into other amino acid types and packing and cartesian relaxation was performed to evaluate the final score in REU. This procedure was applied three times in parallel for both substrate-bound and apo-states. The average of the three calculation results was used to calculate the relative binding free energy (ddG bind ) by subtracting the total score of the apostate from the complex state..

9. Construction and screening of designed luciferase libraries

The construction of assembled gene libraries was described previously in detail 57 . In brief, the amino acid sequences of all designed luciferases were first reverse-translated into E. coli codon-optimized DNA sequences. All DNA sequences were categorized into multiple sub-pools by the gene length (-500 designs per sub-pool). Each gene was subsequently split into two fragments (fragment A and fragment B) and added outer and inner primer sequences to the 5’ and 3’ end (e.g., Outer_oligoA_5primer + design_half_A + Inner_oligoA_3primer and Inner_oligoB_5primer + design_half_B + Outer_oligoB_3primer). All oligos were ordered in one Twist 250nt Oligo Pool. To construct the library of each sub-pool, polymerase chain reaction (PCR) with oligoA_5primer/oligoA_3primer or oligoB_5primer/oligoB_3primer oligonucleotide pairs was used to amplify the individual fragment A or fragment B from each sub-pool. The pool-specific sequences were removed with Uracil Specific Excision Reagent (USER) followed by NEB End Repair kit. Outer primers (oligoA Sprimer and oligoB_3primer) were then used for fragment A and fragment B assembly and amplification. The assembled full-length fragment was digested with Xhol/Hindlll and ligated into a predigested pBAD/His B vector. All ligation products were used to transform ElectroMAX™ DH10B Cells, which were next plated on 150 mm x 15 mm LB agar plates supplemented with carbenicillin and L-arabinose. We sequenced 30 random colonies and 11 of the sequences were in our designed library. The plates (~2000 colonies per plate) were incubated at 37 °C overnight to form bacterial colonies and left at 4 °C for another 24 h. To directly image luminescence activity from bacterial colonies, we sprayed the PBS solution containing 30 μM DTZ to each agar plate, waited for 2 min, and the luminescence images were acquired and processed with Bio-Rad ChemiDoc XRS+. After screening 15 plates, active colonies were collected for sequencing, protein expression, and other downstream characterization where LuxSit was selected from three active designs shown catalytic signal above background.

10. Construction and evaluation of LuxSit site saturation mutagenesis libraries

To create libraries of each single amino acid substitution at residues 13, 14, 17, 18, 35, 37, 38, 49, 52, 53, 56, 60, 65, 81, 83, 94, 96, 98, 100, 110, and 112, forward oligos mixture with degenerate codons (NDT, VHG, and TGG = 1:1:0.1 ratio) and an overlapped reverse oligo were used to amplify the plasmid of LuxSit. The resulting PCR products were circularized by Gibson Assembly protocol and were subsequently used to transform ElectroMAX™ DH10B Cells. The cells were plated on 150 mm x 15 mm LB agar plates supplemented with carbenicillin and L-arabinose, incubated at 37 °C overnight, and left at 4 °C for another 24 h. As described in the screening of luciferase libraries, colony-based screening by spraying DTZ solution was used to identify active colonies. Inactive colonies were also randomly picked. As a result, a total of 32 colonies were picked for each residue library. 32 x 21 individual colonies were grown in 1 mL of TB supplemented with carbenicillin and L-arabinose in 96-well deep-well culture plates. The plates were shaken at 37 C overnight (-16-18 h) on 96-well plate shakers at 1,100 rpm. Cells were pelleted by centrifugation at 4,000g for 15 min in a tabletop centrifuge. Media was discarded and the cell pellets were resuspended in 0.2 mL BugBuster HT Protein Extraction buffer. The plates were transferred back to 96-well plate shakers and incubated at 1,100 rpm for an additional 30 min. Cellular debris was pelleted again by centrifugation at 4,000g for 15 min, soluble lysates were transferred to a new semi-deep 96-well plate, and incubated with 10 μL of magnetic Ni- NTA beads for 30 min to allow binding. The magnetic extractor was used to first transfer the beads from the binding plates to wash plates with 200 μL IMAC wash buffer in each well, and then transfer the beads to elusion plates containing 30 μL IMAC elution buffer in each well. The concentrations of all proteins in each well were determined by the Bradford assay directly. The elution solution in each well was used to make a 25 μL protein solution at indicated concentration and mixed with 25 μL of 50 μM DTZ PBS solution. The luminescence signals were acquired over a course of 15 min while the actual point mutation was identified by sequencing. Thus, the mutation-to-activity relationship can be mapped. To evaluate whether these beneficial mutations are synergistic, we ordered individual mutants with combinatorial mutations at residue 14, 60, 96, 98, and 110 (see Table 4), expressed, and purified these LuxSit variants for kinetic, emission spectra, and luminescence intensity. We identified four mutants that can produce 47 to 77-fold more photons than the parent LuxSit. We assigned one of which, LuxSit-f (A96M/M110V), for its strong initial flash emission. Since the mutations at residue 96 and 110 are robust and mutations at residue 60 are versatile, we generated a fully randomized library at 60, 96, and 110 positions to exhaustively screen all possible combinations. After the colony-based screening, we identified many colonies with strong luciferase activities with DTZ (Fig. 9). Among all selected mutants, Arg60 is confirmed to be mutable, Ala96 prefers larger hydrophobic sidechains (Leu, Ile, Met, and Cys), and Metl10 favors hydrophobic residues (Val, Ile, and Ala). A newly discovered mutant R60S/A96L/M110V with more than 100-fold higher photon flux over LuxSit was assigned LuxSit-i for its high brightness.

11. In vitro characterization of photoluminescence properties

For Michaelis-Menten kinetics measurements, 25 μL of serial diluted DTZ substrate in Tris pH 8.0 buffer was added into the wells of a white 96-well half-area microplate containing 25 μL of purified luciferases (final enzyme concentration: 100 nM; substrate concentration: 0.78 to 50 μM). Measurements were taken every 1 min (0.1 s integration and 10 s shaking between each interval) for a total of 20 min. Initial velocities were estimated as the average of the light intensities from the first three data points to fit the Michaelis-Menten equation. All relative arbitrary unit (RLU) per second values were converted to photon/s by the luminol-H 2 O 2 -HRP calibration method 45 . Following the equation: I max = LQY x k cat x [E], I max is the maximal photon flux (photon s -1 ), [E] is the total enzyme concentration, and V max is the maximum photon flux per molecule (photon s -1 molecule -1 ) from the fitting of the Michaelis-Menten equation To determine the luminescent quantum yields, 25 μL of 5 μM individual substrate in PBS was injected into 25 μL PBS containing 100 nM corresponding luciferase. DTZ was used for all LuxSit variants while CTZ was used as the substrate of native RLuc. The luminescence signals were monitored until the reactions were completed (0.1 s integration and measurements were taken every 5 s for a total of 40 min). The sum of luminescence photon counts was normalized to the total photon counts of RLuc/CTZ pair (LQY = 5.3 ± 0.1%) 58 to derive relative luminescent quantum yields of LuxSit variants (Fig. 10c). kcat values for each individual enzyme were calculated using the equation: kcat = V max / LQY. To record emission spectra, 25 μL of 50 μM DTZ in PBS were injected into 25 μL of 200 nM pure luciferases and the emission spectra were collected with 0.1 s integration and 2 nm increments from 300 to 700 nm In vitro luminescence activity measurements of LuxSit-i expressing HEK293T or HeLa cells were done similarly as 15,000 intact cells or lysates were used in the assay instead of purified luciferases. To evaluate the substrate specificity, 25 μL of 50 μM substrate analogs in PBS were added to 25 μL of 200 nM indicated luciferases, and the signals were recorded over 20 min. Data were shown as the total luminescence signal over the first 10 min. We normalized the data by setting the highest emission substrate at 100%.

12. Circular dichroism (CD)

Purified protein samples were prepared at 15 μM in pH 7.4 10 mM phosphate buffer. Spectra from 190 nm to 260 nm were recorded at 25 °C, 50 °C, 75 °C, 95 °C, and after cooling back to 25 °C. Thermal denaturation was monitored at 220 nm from 25 °C to 95 °C (1°C per min increments). Tm values were not reported because no obvious inflection points of the melting curves.

13. Mammalian cell culture and transfection

HEK293T and HeLa cell lines were maintained at 37 °C with humidified 5% CO 2 atmosphere and cultured in Dulbecco's Modified Eagle's Medium (DMEM, GIBDO) supplemented with 10% fetal bovine serum (FBS, Sigma). Cells were transfected with Turbofectin 8.0 (Origene) with 500 μg of plasmid DNA. After 24 h at 37 °C in a CO 2 incubator, the medium was removed, and cells were collected and resuspended in Dulbecco’s phosphate-buffered saline (DPBS).

14. Fluorescence Microscopy and image analysis

Cells were washed twice with HBSS and subsequently imaged in HBSS in the dark at 37°C. Right before imaging, cells were incubated with 25 μM DTZ. Epifluorescence imaging was conducted on a Yokogawa CSU-X1 microscope equipped with a Hamamatsu ORCA- Fusion scientific CMOS camera and Lumencor Celesta light engine. Objectives used were: 10x, NA 0.45, WD 4.0 mm, 20x, NA 1.4, WD 0.13 mm, and 40x, NA 0.95, WD 0.17-0.25 mm with correction collar for cover glass thickness (0.11 mm to 0.23 mm) (Plan Apochromat Lambda). Imaging for BFP utilized a 408 nm laser, 432/36 nm dichroic, and a 440/40 nm emission filter (Semrock). Exposure times were 200 ms for BFP and 10 s for luminescence. All epifluorescence experiments were subsequently analyzed using NIS Elements software. 15. Multiplex dual-luciferase reporter assay for the cAMP/PKA andNF-κB pathways HEK293T cells were grown in a tissue culture-grade white 96-well plate and transfected with indicated CRE-RLuc, NFκB-LuxSit-i, and CMV-CyOFP plasmids. 24 h after transfection, the medium was replaced by 2 μM of Forskolin (FSK) or 300 ng/mL human tumor necrosis factor alpha (TNFα) in regular cell media. 23 h after stimulation, the cells were resuspended in DPBS by pipette mixing. 25 μL of DPBS containing 30,000 intact cells was mixed with 25 μL of CelLytic M for 15 min to make cell lysates. For intact cell assay, 25 μL of DPBS containing 15,000 intact cells was mixed with 25 μL of PP-CTZ (2μM) or/and DTZ (10μM) in DPBS. For cell lysate assay, 25 μL of cell lysate was added to 25 μL of PP-CTZ (2μM) or/and DTZ (10μM) to initiate luminescence reactions. The signals were recorded every 1 min for a total of 10 min. The light signals were collected in the substrate-resolved mode without filters and with 528/20 and 390/35 filters under the spectrally resolved mode. Area scanning the fluorescence intensity of CyOFP at 480 nm (excitation wavelength) and 580 nm (emission wavelength) was used to estimate the total cell numbers and transfection efficiency. The reported unit was the average of the first 10 min luminescence (RLU) over the relative fluorescence units (a.u.). To derive fold-of-activation, all values were normalized to the corresponding non-stimulated control.

Statistical analysis

No statistical methods were used to pre-determine the sample size. No sample was excluded from data analysis. Results were reproduced using different batches of pure proteins on different days. Unless otherwise indicated, data are shown as mean ± s.d., and error bars in figures represent s.d. of technical triplicate. Data were analyzed and plotted using GraphPad Prism 8, seaborn, and matplotlib.

Supplementary Information

Table 4. Enzymatic and photoluminescence properties of LuxSit and its mutants

Table 5. Exemplary amino acid and DNA sequences of de novo luciferases used in this work

The sequences (underlined and/or in parentheses) below contain a PolyHis-TEV or PolyHis tag for protein purification (which are optional and may be present or deleted, and when deleted are not considered in determining percent identity)

References 1. Jiang, L. et al. De novo computational design of retro-aldol enzymes. Science 319, 1387-1391 (2008).

2. Rothlisberger, D. et al. Kemp elimination catalysts by computational enzyme design. Nature 453, 190-195 (2008).

3. Yeh, H. W. et al. Red-shifted luciferase-luciferin pairs for enhanced bioluminescence imaging. Nat. Methods 14, 971-974 (2017).

4. Love, A. C. & Prescher, J. A. Seeing (and Using) the Light: Recent Developments in Bioluminescence Technology. Cell Chemical Biology 27, 904-920 (2020).

5. Syed, A. J. & Anderson, J. C. Applications of bioluminescence in biotechnology and beyond. Chem. Soc. Rev. 50, 5668-5705 (2021).

6. Yeh, H.-W. & Ai, H.-W. Development and Applications of Bioluminescent and Chemiluminescent Reporters and Biosensors. Annu. Rev. Anal. Chem. 12, 129-150 (2019).

7. Zambito, G., Chawda, C. & Mezzanotte, L. Emerging tools for bioluminescence imaging. Curr. Opin. Chem. Biol. 63, 86-94 (2021).

8. Markova, S. V., Larionova, M. D. & Vysotski, E. S. Shining Light on the Secreted Luciferases of Marine Copepods: Current Knowledge and Applications. Photochem.

Photobiol. 95, 705-721 (2019).

9. Wu, N. et al. Solution structure of Gaussia Luciferase with five disulfide bonds and identification of a putative coelenterazine binding cavity by heteronuclear NMR. Sci.

Rep. 10, (2020).

10. Jiang, T. Y., Du, L. P. & Li, M. Y. Lighting up bioluminescence with coelenterazine: strategies and applications. Photochem. Photobiol. Sci. 15, 466-480 (2016).

11. Shakhmin, A. et al. Coelenterazine analogues emit red-shifted bioluminescence with NanoLuc. Org. Biomol. Chem. 15, 8559-8567 (2017).

12. Michelini, E. et al. Spectral-resolved gene technology for multiplexed bioluminescence and high-content screening. Anal. Chem. 80, 260-267 (2008).

13. Rathbun, C. M. et al. Parallel screening for rapid identification of orthogonal bioluminescent tools. ACS Cent. Sci. 3, 1254-1261 (2017).

14. Yeh, H.-W., Wu, T., Chen, M. & Ai, H.-W. Identification of Factors Complicating Bioluminescence Imaging. Biochemistry 58, 1689-1697 (2019).

15. Su, Y. C. et al. Novel NanoLuc substrates enable bright two-population bioluminescence imaging in animals. Nat. Methods 17, 852-860 (2020). 16. Lombardi, A, Pirro, F., Maglio, 0., Chino, M. & DeGrado, W. F. De Novo design of four-helix bundle metalloproteins: One scaffold, diverse reactivities. Acc. Chem. Res. 52, 1148-1159 (2019).

17. Chino, M. et al. Artificial diiron enzymes with a DE Novo designed four-helix bundle structure. Eur. J. Inorg. Chem. 2015, 3352-3352 (2015).

18. Basler, S. et al. Efficient Lewis acid catalysis of an abiological reaction in a de novo protein scaffold. Nat. Chem. 13, 231-235 (2021).

19. Anishchenko, I. et al. De novo protein design by deep network hallucination. Nature (2021) doi:10.1038/s41586-021-04184-w.

20. Wang, J. et al. Scaffolding protein functional sites using deep learning. Science 377, 387-394 (2022).

21. Norn, C. et al. Protein sequence design by conformational landscape optimization. Proc. Natl. Acad. Sci. U. S. A. 118, (2021).

22. Yang, J. Y. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. U. S. A. 117, 1496-1503 (2020).

23. Basanta, B. et al. An enumerative algorithm for de novo design of proteins with diverse pocket structures. Proc. Natl. Acad Sci. U. S. A. 117, 22135-22145 (2020).

24. Loening, A. M, Fenn, T. D. & Gambhir, S. S. Crystal structures of the luciferase and green fluorescent protein from Renilla reniformis. J. Mol. Biol. 374, 1017-1028 (2007).

25. Tomabechi, Y. et al. Crystal structure of nanoKAZ: The mutated 19 kDa component of Oplophorus luciferase catalyzing the bioluminescent reaction with coelenterazine.

Biochem. Biophys. Res. Commun. 470, 88-93 (2016).

26. Ding, B. W. & Liu, Y. J. Bioluminescence of Firefly Squid via Mechanism of Single Electron-Transfer Oxygenation and Charge-Transfer-Induced Luminescence. J. Am.

Chem. Soc. 139, 1106-1119 (2017).

27. Isobe, H., Yamanaka, S., Kuramitsu, S. & Yamaguchi, K. Regulation mechanism of spin-orbit coupling in charge-transfer-induced luminescence of imidazopyrazinone derivatives. J. Am. Chem. Soc. 130, 132-149 (2008).

28. Kondo, H. et al. Substituent effects on the kinetics for the chemiluminescence reaction of 6-arylimidazo[l,2-a]pyrazin-3(7H)-ones (Cypridina luciferin analogues): support for the single electron transfer (SET)-oxygenation mechanism with triplet molecular oxygen. Tetrahedron Lett. 46, 7701-7704 (2005).

29. Branching B. R. et al. Experimental Support for a Single Electron-Transfer Oxidation Mechanism in Firefly Bioluminescence. J. Am. Chem. Soc. 137, 7592-7595 (2015). 30. Zubatyuk, R., Smith, J. S., Leszczynski, J. & Isayev, 0. Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network. Science Advances 5, (2019).

31. Dou, J. Y. et al. De novo design of a fluorescence-activating beta-barrel. Nature 561, 485-491 (2018).

32. Cao, L. et al. Design of protein-binding proteins from the target structure alone. Nature 605, 551-560 (2022).

33. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-+ (2021).

34. Dauparas, J. et al. Robust deep learning based protein sequence design using ProteinMPNN. bioRxiv (2022) doi:10.1101/2022.06.03.494563.

35. Yeh, H.-W. et al. ATP-Independent Bioluminescent Reporter Variants To Improve in Vivo Imaging. ACS Chem. Biol. 14, 959-965 (2019).

36. Bhaumik, S. & Gambhir, S. S. Optical imaging of Renilla luciferase reporter gene expression in living mice. Proc. Natl. Acad. Set U. S. A 99, 377-382 (2002).

37. Szent-Gyorgyi, C., Ballou, B. T., Dagnal, E. & Bryan, B. Cloning and characterization of new biolunrinescent proteins, in Biomedical Imaging: Reporters. Dyes, and Instrumentation (eds. Bornhop, D. J., Contag, C. H. & Sevick-Muraca, E. M.) (SPIE, 1999). doi:10.1117/12.351015.

38. Hall, M. P. et al. Engineered luciferase reporter from a deep sea shrimp utilizing a novel imidazopyrazinone substrate. ACS Chem. Biol. 7, 1848-1857 (2012).

39. Baek, M. et al. Accurate prediction of protein structures and interactions using a three- track neural network. Science 373, 871-+ (2021).

40. Giger, L. et al. Evolution of a designed retro-aldolase leads to complete active site remodeling. Nat. Chem. Biol. 9, 494-498 (2013).

41. Yao, Z. et al. Multiplexed bioluminescence microscopy via phasor analysis. Nat.

Methods 19, 893-898 (2022).

42. Loening, A. M., Dragulescu-Andrasi, A & Gambhir, S. S. A red-shifted Renilla luciferase for transient reporter-gene expression. Nat. Methods 7, 5-6 (2010).

43. Dijkema, F. M. et al. Flash properties of Gaussia luciferase are the result of covalent inhibition after a limited number of cycles. Protein Sci. 30, 638-649 (2021).

44. Schenkmayerova, A. et al. Engineering the protein dynamics of an ancestral luciferase. Nat. Common. 12, (2021). 45. Ando, Y. et al. Development of a quantitative bio/chemiluminescence spectrometer determining quantum yields: Re-examination of the aqueous luminol chemiluminescence standard. Photochem. Photobiol. 83, 1205-1210 (2007).

46. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).

47. Farrell, D. P. et al. Deep learning enables the atomic structure determination of the Fanconi Anemia core complex from cryoEM. IUCrJ 7, 881-892 (2020).

48. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302-2309 (2005).

49. Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970).

50. Oda, T., Lim, K. & Tonrii, K. Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance. BMC Bioinformatics 18, 288 (2017).

51. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997).

52. Coventry, B. & Baker, D. Protein sequence optimization with a pairwise decomposable penalty for buried unsatisfied hydrogen bonds. PLoS Comput. Biol. 17, (2021).

53. Smith, A. J. T. et al. Structural Reorganization and Preorganization in Enzyme Active Sites: Comparisons of Experimental and Theoretically Ideal Active Site Geometries in the Multistep Serine Esterase Reaction Cycle. J. Am. Chem. Soc. 130, 15361-15373 (2008).

54. Salomon-Ferrer, R., Case, D. A. & Walker, R. C. An overview of the Amber biomolecular simulation package. Wiley Interdiscip. Rev. Comput. Mol. Sci. 3, 198-210 (2013).

55. Kellogg, E. H., Leaver-Fay, A. & Baker, D. Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins 79, 830-838 (2011).

56. Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201-6212 (2016).

57. Klein, J. C. et al. Multiplex pairwise assembly of array-derived DNA oligonucleotides. Nucleic Acids Res. 44, (2016). 58. Loening, A. M., Wu, A. M. & Gambhir, S. S. Red-shifted Renilla renifomris luciferase variants for imaging in living subjects. Nat. Methods 4, 641-643 (2007).

59. Liang, J., Feng, X., Hait, D. & Head-Gordon, M. Revisiting the performance of timedependent density functional theory for electronic excitations: Assessment of 43 popular and recently developed functionals from rungs one to four. J. Chem. Theory Comput. 18, 3460-3473 (2022).

60. Chai, J.-D. & Head-Gordon, M. Long-range corrected hybrid density functionals with damped atom-atom dispersion corrections. Phys. Chem. Chem. Phys. 10, 6615-6620 (2008).

61. Ditchfield, R, Hehre, W. J. & Pople, J. A. Self-consistent molecular-orbital methods. IX. An extended Gaussian-type basis for molecular-orbital studies of organic molecules. J. Chem. Phys. 54, 724-728 (1971).

62. Grimme, S. Exploration of chemical compound, conformer, and reaction space with meta-dynamics simulations based on tight-binding quantum chemical calculations. J. Chem. Theory Comput. 15, 2847-2862 (2019).

63. Pracht, P., Bohle, F. & Grimme, S. Automated exploration of the low-energy chemical space with fast quantum chemical methods. Phys. Chem. Chem. Phys. 22, 7169-7192 (2020).

64. Luchini, G, Alegre-Requena, J. V., Funes-Ardoiz, I. & Paton, R. S. GoodVibes: automated thermochemistry for heterogeneous computational chemistry data. F1000Res. 9, 291 (2020).

65. Li, Y.-P., Gomes, J., Mallikarjun Sharada, S., Bell, A. T. & Head-Gordon, M. Improved force-field parameters for QM/MM simulations of the energies of adsorption for molecules in zeolites and a free rotor correction to the rigid rotor harmonic oscillator model for adsorption enthalpies. J. Phys. Chem. C Nanomater. Interfaces 119, 1840- 1850 (2015).

66. Götz, A. W. et al. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized Bom. J. Chem. Theory Comput. 8, 1542-1555 (2012).

67. Becke, A. D. Density-functional thermochemistry. III. The role of exact exchange. J. Chem. Phys. 98, 5648-5652 (1993).

68. Grimme, S., Antony, J., Ehrlich, S. & Krieg, H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J. Chem. Phys. 132, 154104 (2010). 69. Grimme, S., Ehrlich, S. & Goerigk, L. Effect of the damping function in dispersion corrected density functional theory. J. Comput. Chem. 32. 1456-1465 (2011). 70. Meiler, J. & Baker, D. ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins 65, 538-548 (2006). 71. Davis, I. W. & Baker, D. RosettaLigand docking with full ligand and receptor flexibility. J. Mol. Biol. 385, 381-392 (2009). 72. Davis, I. W., Raha, K., Head, M. S. & Baker, D. Blind docking of pharmaceutically relevant compounds using RosettaLigand. Protein Sci. 18, 1998-2002 (2009). 73. Wang, J., Wolf, R. M., C W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25, 1157-1174 (2004). 74. Bayly, C. I., Cieplak, P., Cornell, W. & Kollman, P. A. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the RESP model. J. Phys. Chem. 97, 10269-10280 (1993). 75. Besler, B. H., Merz, K. M. & Kollman, P. A. Atomic charges derived from semiempirical methods. J. Comput. Chem. 11, 431-439 (1990). 76. Singh, U. C. & Kollman, P. A. An approach to computing electrostatic charges for molecules. J. Comput. Chem. 5, 129-145 (1984). 77. Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926-935 (1983). 78. Maier, J. A. et al. Ffl4SB: Improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696-3713 (2015). 79. Darden, T., York, D. & Pedersen, L. Particle mesh Ewald: AnN·log(N) method for Ewald sums in large systems. J. Chem. Phys. 98, 10089-10092 (1993). 80. Roe, D. R. & Cheatham, T. E., III. PTRAJ and CPPTRAJ: Software for processing and analysis of molecular dynamics trajectory data. J. Chem. Theory Comput. 9, 3084-3095 (2013).