Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
FOLDED AND PROTEASE-RESISTANT POLYPEPTIDES
Document Type and Number:
WIPO Patent Application WO/2018/201020
Kind Code:
A1
Abstract:
Non-naturally occurring polypeptides are disclosed that include (a) 3-3 secondary structure elements, wherein each secondary structure element is either an α-helix (H domain) of between 10-20 amino acid residues in length or a β-strand (E domain) of between 3-10 amino acid residues in length; and (b) 2-4 linkers of between 2 to 6 amino acid residues in length connecting adjacent secondary structure elements; wherein the polypeptide is between 25-50 amino acid residues in length; and wherein the polypeptide includes no cysteine residues.

Inventors:
ROCKLIN GABRIEL (US)
BAKER DAVID (US)
Application Number:
PCT/US2018/029904
Publication Date:
November 01, 2018
Filing Date:
April 27, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV WASHINGTON (US)
International Classes:
C07K1/00; C07K14/00
Foreign References:
US20140235467A12014-08-21
Other References:
BHARDWAJ ET AL.: "Accurate de novo design of hyperstable constrained peptides", NATURE, vol. 538, no. 7625, 20 October 2016 (2016-10-20), pages 329 - 335, XP055528081
SUN ET AL.: "Protein engineering by highly parallel screening of computationally designed variants", SCI. ADV., vol. 2, no. 7, 20 July 2016 (2016-07-20), pages e1600692, XP055528084
Attorney, Agent or Firm:
HARPER, David, S. (US)
Download PDF:
Claims:
We claim

1. A non-naturally occurring polypeptide comprising

(a) 3-5 secondary structure elements, wherein each secondary structure element is either an a-helix (H domain) of between 10-20 amino acid residues in length or a β-strand (E domain) of between 3-10 amino acid residues in length; and

(b) 2-4 linker s of between 2 to 6 amino acid residues in length connecting adjacent secondary structure elements;

wherein the polypeptide is between 25-50 amino acid residues in length; and wherein the polypeptide includes no cysteine residues.

2. The polypeptide of claim 1 , wherein each H domain is independently between 10-15 amino acids in length.

3. The polypeptide of claim 1 or 2, wherein each E domain is independently between 3-7 amino acids in length.

4. The polypeptide of any one of claims 1-3, wherein the polypeptide is between 30-50, 35-50, 35-45, 40-45, or 40-43 amino acid residues in length. 5. The polypeptide of any one of claims claim 1-4, wherein the polypeptide comprises a secondary structure element arrangement selected from the group consisting of

HHH, EHEE, HEEH, and EEHEE.

6. The polypeptide of any one of claims 1-5, wherein the polypeptide comprises an amino acid sequence having at least 30% identity along its length to the amino acid sequence selected from the group consisting of SEQ ID NOS: 1-4000, or a mirror image thereof.

7. The polypeptide of any one of claims 1-5, wherein the polypeptide comprises an amino acid sequence having at least 50% identity along its length to the amino acid sequence selected from the group consisting of SEQ ID NOS: 1-4000, or a mirror image thereof.

8. The polypeptide of any one of claims 1 -5, wherein the polypeptide comprises an amino acid sequence having at least 80% identity along its length to the amino acid sequence selected from the group consisting of SEQ ID NOS: 1 -4000, or a mirror image thereof.

9. The polypeptide of any one of claims 1 -5, wherein the polypeptide comprises an amino acid sequence having at least 90% identity along its length to the amino acid sequence selected from the group consisting of SEQ ID NOS: 1-4000, or a mirror image thereof.

10. The polypeptide of any one of claims 1-5, wherein the polypeptide comprises an amino acid sequence having at least 95% identity along its length to the amino acid sequence selected from the group consisting of SEQ ID NOS: 1-4000, or a mirror image thereof.

1 1. The polypeptide of any one of claims 1-3, wherein the polypeptide comprises an amino acid sequence selected from the group consisting of SEQ ID NOS: 1 -4000, or a mirror image thereof. 12. An isolated nucleic acid encoding the polypeptide any one of claims 1-11.

13. A recombinant expression vector comprising the isolated nucleic acid of claim 12 operatively linked to a promoter. 14. A recombinant host cell comprising the recombinant expression vector of claim 13.

15. A method, comprising:

(a) using a computing device to construct a library of proteins, wherein the computing device designs a sequence to stabilize the backbone of the protein, and wherein the proteins comprise less than about 50 amino acids;

(b) synthesizing the proteins using next-generation gene synthesis;

(c) expressing the proteins in yeast so that every cell displays many copies of one protein sequence on its surface; and (d) screening the library of proteins for susceptibility to digestion by protease.

16. The method of claim 15, wherein the synthesizing step comprises oligo library synthesis technology, capable of parallel synthesis of 104-105 arbitrarily specified DNA sequences long enough to encode the proteins.

17. The method of claim 15 or 16, wherein in the screening step, cells are incubated with varying concentrations of protease, those displaying resistant proteins are isolated by fluorescence-activated cell sorting (FACS), and the frequencies of each protein at each protease concentration are determined by deep sequencing.

18. The method any one of claims 15-17, the method further comprising assigning each protein a stability score, wherein the stability score comprises: the difference between the measured EC50 and the predicted EC50 in the unfolded state of the protein, according to a sequence-based model parameterized using EC50 measurements of scrambled sequences.

19. The method of claim 18, wherein a stability score of 1 corresponds to a 10- fold higher EC50 than the predicted EC50 in the unfolded state.

20. The method of any one of claims 15-19, wherein the library comprises 1,000 to 30,000 proteins.

Description:
Folded and protease-resistant polypeptides Cross Reference

This application claims priority to U.S. Provisional Patent Application Serial No. 62/491518 filed April 28, 2017, incorporated by reference herein in its entirety.

Background

Proteins fold into unique native structures stabilized by thousands of weak interactions that collectively overcome the entropic cost of folding. Though these forces are "encoded" in the thousands of known protein structures, "decoding" them is challenging due to the complexity of natural proteins that have evolved for function, not stability.

The key challenge to achieving a quantitative understanding of the sequence determinants of protein folding is to accurately and efficiently model the balance between the many forces contributing to the free energy of folding. Mutagenesis has been the primary experimental tool for interrogating the balance of forces contributing to stability, but the results are context-dependent and do not provide a global view of the contributions to stability. Molecular dynamics simulations on minimal proteins do not reveal which interactions specify and stabilize the native structure, and cannot in general determine whether a given sequence will fold into a stable structure.

Summary of the Invention

In one aspect, non-naturally occurring polypeptides are provided, comprising (a) 3-5 secondary structure elements, wherein each secondary structure element is either an a-helix (H domain) of between 10-20 amino acid residues in length or a β-strand (E domain) of between 3-10 amino acid residues in length; and

(b) 2-4 linker s of between 2 to 6 amino acid residues in length connecting adjacent secondary structure elements;

wherein the polypeptide is between 25-50 amino acid residues in length; and wherein the polypeptide includes no cysteine residues.

In one embodiment, each H domain is independently between 10-15 amino acids in length. In another embodiment, each E domain is independently between 3-7 amino acids in length. In a further embodiment, the polypeptide is between 30-50, 35-50, 35-45, 40-45, or 40-43 amino acid residues in length. In another embodiment, the polypeptide comprises a secondary structure element arrangement selected from the group consisting of HHH, EHEE, HEEH, and EEHEE. In various embodiments, the polypeptide comprises an amino acid sequence having at least at least 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical along its length to the amino acid sequence of any one of SEQ ID NOS: 1-4000, or a mirror image thereof. Also provided are isolated nucleic acids encoding the polypeptide of any embodiment herein, recombinant expression vectors comprising the isolated nucleic acids linked to a promoter, and recombinant host cells comprising the recombinant expression vectors disclosed herein.

Also provided herein are methods comprising:

(a) using a computing device to construct a library of proteins, wherein the computing device designs a sequence to stabilize the backbone of the protein, and wherein the proteins comprise less than about 50 amino acids;

(b) synthesizing the proteins using next-generation gene synthesis;

(c) expressing the proteins in yeast so that every cell displays many copies of one protein sequence on its surface; and

(d) screening the library of proteins for susceptibility to digestion by protease. In one embodiment, the synthesizing step comprises oligo library synthesis technology, capable of parallel synthesis of 10 4 - 10 s arbitrarily specified DNA sequences long enough to encode the proteins. In another embodiment, in the screening step, cells are incubated with varying concentrations of protease, those displaying resistant proteins are isolated by fluorescence-activated cell sorting (FACS), and the frequencies of each protein at each protease concentration are determined by deep sequencing. In a further embodiment, the method further comprising assigning each protein a stability score, wherein the stability score comprises: the difference between the measured EC 50 and the predicted EC 50 in the unfolded state of the protein, according to a sequence-based model parameterized using EC 50 measurements of scrambled sequences. In one embodiment, a stability score of 1 corresponds to a 10-fold higher EC 50 than the predicted EC 50 in the unfolded state. In another embodiment, the library comprises 1 ,000 to 30,000 proteins. Description of the Figures

Fig. 1. Yeast display enables massively parallel measurement of protein stability.

(A) Each yeast cell displays many copies of one test protein fused to Aga2. The c-terminal c- Myc tag is labeled with a fluorescent antibody. Protease cleavage of the test protein (or other cleavage) leads to loss of the tag and loss of fluorescence. (B) Libraries of 10 4 unique sequences are sorted by flow cytometry. Most cells show high protein expression (measured by fluorescence) before proteolysis (blue). Only some cells retain fluorescence after proteolysis; those above a threshold (shaded green region) are collected for deep sequencing analysis. (C) Sequential sorting at increasing protease concentrations separates proteins by stability. Each sequence in a library of 19,726 proteins is shown as a grey line tracking the change in population fraction (enrichment) of that sequence, normalized to each sequence's population in the starting (pre-selection) library. Enrichment traces for seven proteins at different stability levels are highlighted in color. (D) EC 50 S for the seven highlighted proteins in (C) are plotted on top of the overall density of the 46,187 highest-confidence EC 50 measurements from design rounds 1-4. (E) Same data as at left, showing that stability scores (EC 50 values corrected for intrinsic proteolysis rates) correlate better than raw EC 50 S between the proteases. (F-I) Stability scores measured in high-throughput correlate with individual folding stability measurements for mutants of four small proteins. The wild-type sequence in each set is highlighted as a red circle. Confidence intervals for all EC 50 measurements are provided in supplementary materials. (F) Pinl AGu d f data at 40 °C from (28) by thermal denaturation (G) hYAP65 Tm data from (5, 10) (H) Villin HP35 AGunr data at 25 °C from (7, 11) by urea denaturation (I) BBL AGunr data at 10 °C from (8) by thermal denaturation.

Fig. 2. Biophysical characterization of designed minimal proteins. (A) Design models and NMR solution ensembles for designed minimal proteins. (B) Far-ultraviolet circular dichroism (CD) spectra at 25 °C (black), 95 °C (red), and 25 °C following melting (blue). (C) Thermal melting curves measured by CD at 220 nm. Melting temperatures determined using the derivative of the curve. (D) Chemical denaturation in GuHCl measured by CD at 220 nm and 25 °C. Unfolding free energies determined by fitting to a two-state model (red solid line).

Detailed Description of the Invention

All references cited are herein incorporated by reference in their entirety. As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. "And" as used herein is interchangeably used with "or" unless expressly stated otherwise.

As used herein, the amino acid residues are abbreviated as follows: alanine (Ala; A), asparagine (Asn; N), aspartic acid (Asp; D), arginine (Arg; R), cysteine (Cys; C), glutamic acid (Glu; E), glutamine (Gin; Q), glycine (Gly; G), histidine (His; H), isoleucine (He; I), leucine (Leu; L), lysine (Lys; K), methionine (Met; M), phenylalanine (Phe; F), proline (Pro; P), serine (Ser; S), threonine (Thr; T), tryptophan (Tip; W), tyrosine (Tyr; Y), and valine (Val; V).

All embodiments of any aspect of the invention can be used in combination, unless the context clearly dictates otherwise.

In one aspect, the invention provides non-naturally occurring polypeptides comprising or consisting of:

(a) 3-5 secondary structure elements, wherein each secondary structure element is either an a-helix (H domain) of between 10-20 amino acid residues in length or a β-strand (E domain) of between 3-10 amino acid residues in length; and

(b) 2-4 linkers of between 2 to 6 amino acid residues in length connecting adjacent secondary structure elements;

wherein the polypeptide is between 25-50 amino acid residues in length; and wherein the polypeptide includes no cysteine residues.

As demonstrated in the examples, the inventors have developed computational methods for de novo design of non-naturally occurring folded protease-resistant peptides that do not include cysteine residues and thus do not rely on disulfide bonds for stability, and the use of these methods to design a large number of exemplary 25-50 residue constrained peptides. The stable polypeptides disclosed herein provide robust starting scaffolds for generating peptides that bind targets of interest using computational interface design or experimental selection methods. Solvent-exposed hydrophobic residues can be introduced without impairing folding or solubility, suggesting high mutational tolerance. Hence it should be possible to reengineer the peptide surfaces, incorporating target-binding residues to construct binders, agonists, or inhibitors.

As used herein, a β-sheet secondary structure element comprises β strands connected laterally by backbone hydrogen bonds. As used herein, an a-helix secondary structure element is a right-handed or left-handed (when D amino acids are involved) helix in which backbone amine groups donate a hydrogen bond to backbone carbonyl groups of amino acids 3- 4 residues before it along the primary amino acid sequence of the polypeptide. In various embodiments, the polypeptide comprises or consists of 3-5, 3-4, 4-5, 3, 4, or 5 secondary structure elements. In various non-limiting embodiments, the secondary structure element arrangement of the polypeptide may be selected from the group consisting of HHH, EHE, EEH, HEE, HEEE, EEHE, EHEE, EEEH, HEEH, and EEHEE, wherein H is a helix and E is a beta strand. In specific embodiments, structure element arrangement of the polypeptide may be selected from the group consisting of HHH, EHEE, HEEH, and EEHEE.

In various embodiments, each E domain is independently between 3-10, 3-9, 3-8, 3-7,

4- 10, 4-9, 4-8, 4-7, 4-6, 4-5, 5-10, 5-9, 5-8, 5-7, 5-6, 6-10, 6-9, 6-8, 6-7, 7-10, 7-9, 7-8, 8-10, 8-9, 9-10, 3, 4, 5, 6, 7, 8, 9, or 10 amino acid residues in length. In a specific embodiment, each E domain is independently 3-7 amino acids in length. In one embodiment, each E domain in the polypeptide is the same length; in another embodiment, not all E domains in the polypeptide are the same length.

In other embodiments, each H domain is independently between 10-20, 10-19, 10-18, 10-17, 10-16, 10-15, 10-14, 10-13, 10-12, 10-11, 11-20, 11-19, 11-18, 11-17, 11-16, 11-15, 11-14, 11-13, 11-12, 12-20, 12-19, 12-18, 12-17, 12-16, 12-15, 12-14, 12-13, 13-20, 13-19, 13-18, 13-17, 13-16, 13-15, 13-14, 14-20, 14-19, 14-18, 14-17, 14-16, 14-15, 15-20, 15-19, 15-18, 15-17, 15-16, 16-20, 16-19, 16-18, 16-17, 17-20, 17-19, 17-18, 18-20, 18-19, 19-20, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20 amino acid residues in length. In a specific embodiment, each H domain is independently between 10-15 amino acid residues in length. In one embodiment, each H domain in the polypeptide is the same length; in another embodiment, not all H domains in the polypeptide are the same length.

The linkers serve to appropriately space adjacent secondary structure elements. The polypeptides may comprise 2-4, 2-3, 3-4, 2, 3, or 4 linkers. In further embodiments, each linker is independently 2-6, 2-5, 2-4, 2-3, 3-6, 3-5, 3-4, 4-6, 4-5, 5-6, 2, 3, 4, 5, or 6 amino acids in length. In one embodiment, each linker in the polypeptide is the same length; in another embodiment, not all linkers in the polypeptide are the same length. The linkers may be of any suitable amino acid sequence, and each linker in a given polypeptide may be the same or different.

In various further embodiments, the polypeptide is 25-50, 30-50, 35-50, 40-50, 45-50,

25-45, 30-45, 35-45, 40-45, 25-40, 30-40, 35-40, 25-35, 30-35, 25-30, or 40-43 amino acid residues in length.

As used throughout the present application, the term "polypeptide" is used in its broadest sense to refer to a sequence of subunit amino acids. The polypeptides of the invention may comprise glycine, L-amino acids, D-amino acids (which are resistant to L- amino acid-specific proteases in vivo), or a combination of glycine and D- and L-amino acids. As disclosed herein, L-amino acids and glycine are shown in upper case letters, and D- amino acids are shown in lower case letters.

In another embodiment, the polypeptide is at least 30% identical along its entire length to the amino acid sequence of any one of SEQ ID NOS: 1-4000, or a mirror image thereof (i.e.: L amino acids substituted with D amino acids). In various further embodiments, the polypeptide is at least 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100% identical along its length to the amino acid sequence of any one of SEQ ID NOS: 1 -4000, or a mirror image thereof.

The polypeptides described herein may be chemically synthesized or recombinantly expressed (when the polypeptide is genetically encodable). The polypeptides may be linked to other compounds to promote an increased half-life in vivo, such as by PEGylation, HESylation, PASylation, glycosylation, or may be produced as an Fc-fusion or in

deimmunized variants. Such linkage can be covalent or non-covalent.

The polypeptides of the invention may include additional residues at the N-terminus, C-terminus, or both that are not present in the polypeptides of the invention; these additional residues are not included in detennining the percent identity of the polypeptides of the invention relative to the reference polypeptide.

As shown in the examples that follow, the specific primary amino acid sequence is not a critical determinant of maintaining the structure of the constrained peptide.

Thus, the polypeptides disclosed herein may be substituted with conservative or non- conservative substitutions. In one embodiment, changes from the reference polypeptide may be conservative amino acid substitutions. As used herein, "conservative amino acid substitution" means an amino acid substitution that does not alter or substantially alter polypeptide function or other characteristics. In one such embodiment, L amino acids are substituted with other L-amino acids, D amino acids are substituted with other L amino acids, and glycine may be substituted with L or D amino acids, preferably with D amino acids.

In other embodiments, a given amino acid can be replaced by a residue having similar physiochemical characteristics, e.g., substituting one aliphatic residue for another (such as

He, Val, Leu, or Ala for one another), or substitution of one polar residue for another (such as between Lys and Arg; Glu and Asp; or Gin and Asn). Other such conservative substitutions, e.g., substitutions of entire regions having similar hydrophobicity characteristics, are known. Polypeptides comprising conservative amino acid substitutions can be tested in any one of the assays described herein to confirm that a desired activity, e.g. antigen-binding activity and specificity of a native or reference polypeptide is retained. Amino acids can be grouped according to similarities in the properties of their side chains (in A. L. Lehninger, in

Biochemistry, second ed., pp. 73-75, Worth Publishers, New York (1975)): (1) non-polar: Ala (A), Val (V), Leu (L), He (I), Pro (P), Phe (F), Trp (W), Met (M); (2) uncharged polar: Gly (G), Ser (S), Thr (T), Cys (C), Tyr (Y), Asn (N), Gin (Q); (3) acidic: Asp (D), Glu (E); (4) basic: Lys (K), Arg (R), His (H). Alternatively, naturally occurring residues can be divided into groups based on common side-chain properties: (1) hydrophobic: Norleucine, Met, Ala, Val, Leu, He; (2) neutral hydrophilic: Cys, Ser, Thr, Asn, Gin; (3) acidic: Asp, Glu; (4) basic: His, Lys, Arg; (5) residues that influence chain orientation: Gly, Pro; (6) aromatic: Trp, Tyr, Phe. Non-conservative substitutions will entail exchanging a member of one of these classes for another class. Particular conservative substitutions include, for example; Ala into Gly or into Ser; Arg into Lys; Asn into Gin or into H is; Asp into Glu; Cys into Ser; Gin into Asn; Glu into Asp; Gly into Ala or into Pro; His into Asn or into Gin; He into Leu or into Val; Leu into He or into Val; Lys into Arg, into Gin or into Glu; Met into Leu, into Tyr or into He; Phe into Met, into Leu or into Tyr; Ser into Thr; Thr into Ser; Trp into Tyr; Tyr into Trp; and/or Phe into Val, into He or into Leu. In one embodiment, polar residues

(DEHKNQRST) are more mutable than the other residues.

As noted above, the polypeptides of the invention may include additional residues at the N-terminus, C-terminus, or both. Such residues may be any residues suitable for an intended use, including but not limited to detection tags (i.e.: fluorescent proteins, antibody epitope tags, etc.), adaptors, ligands suitable for purposes of purification (His tags, etc.), and peptide domains that add functionality to the polypeptides.

In a further aspect, the present invention provides isolated nucleic acids encoding a polypeptide of the present invention that can be genetically encoded. The isolated nucleic acid sequence may comprise RNA or DNA. As used herein, "isolated nucleic acids" are those that have been removed from their normal surrounding nucleic acid sequences in the genome or in cDNA sequences. Such isolated nucleic acid sequences may comprise additional sequences useful for promoting expression and/or purification of the encoded protein, including but not limited to polyA sequences, modified Kozak sequences, and sequences encoding epitope tags, export signals, and secretory signals, nuclear localization signals, and plasma membrane localization signals. It will be apparent to those of skill in the art, based on the teachings herein, what nucleic acid sequences will encode the polypeptides of the invention. In another aspect the present invention provides recombinant expression vectors comprising the isolated nucleic acid of any aspect of the invention operatively linked to a suitable control sequence. "Recombinant expression vector" includes vectors that operatively link a nucleic acid coding region or gene to any control sequences capable of effecting expression of the gene product. "Control sequences" operably linked to the nucleic acid sequences of the invention are nucleic acid sequences capable of effecting the expression of the nucleic acid molecules. The control sequences need not be contiguous with the nucleic acid sequences, so long as they function to direct the expression thereof. Thus, for example, intervening untranslated yet transcribed sequences can be present between a promoter sequence and the nucleic acid sequences and the promoter sequence can still be considered "operably linked" to the coding sequence. Other such control sequences include, but are not limited to, polyadenylation signals, termination signals, and ribosome binding sites. Such expression vectors include but not limited to, plasmid and viral-based expression vectors. The control sequence used to drive expression of the disclosed nucleic acid sequences in a mammalian system may be constitutive (driven by any of a variety of promoters, including but not limited to, CMV, SV40, RSV, actin, EF) or inducible (driven by any of a number of inducible promoters including, but not limited to, tetracycline, ecdysone, steroid-responsive). The expression vector must be replicable in the host organisms either as an episome or by integration into host chromosomal DNA. In various embodiments, the expression vector may comprise a plasmid, viral-based vector, or any other suitable expression vector.

In a further aspect, the present invention provides host cells that comprise the recombinant expression vectors disclosed herein, wherein the host cells can be either prokaryotic or eukaryotic. The cells can be transiently or stably engineered to incorporate the expression vector of the invention, using standard techniques in the art, including but not limited to standard bacterial transformations, calcium phosphate co-precipitation, electroporation, or liposome mediated-, DEAE dextran mediated-, polycationic mediated-, or viral mediated transfection. (See, for example, Molecular Cloning: A Laboratory Manual (Sambrook, et al., 1989, Cold Spring Harbor Laboratory Press; Culture of Animal Cells: A Manual of Basic Technique, 2 nd Ed. (R.I. Freshney. 1987. Liss, Inc. New York, NY). A method of producing a polypeptide according to the invention is an additional part of the invention. The method comprises the steps of (a) culturing a host according to this aspect of the invention under conditions conducive to the expression of the polypeptide, and (b) optionally, recovering the expressed polypeptide. The expressed polypeptide can be recovered from the cell free extract, but preferably they are recovered from the culture medium.

Also provided are methods, comprising:

(a) using a computing device to construct a library of proteins, wherein the computing device designs a sequence to stabilize the backbone of the protein, and wherein the proteins comprise less than about 50 amino acids;

(b) synthesizing the proteins using next-generation gene synthesis;

(c) expressing the proteins in yeast so that every cell displays many copies of one protein sequence on its surface; and

(d) screening the library of proteins for susceptibility to digestion by protease.

Here we combine computational protein design, next-generation gene synthesis, and a high-throughput protease susceptibility assay to measure folding and stability for over 15,000 de novo designed miniproteins, 1,500 natural proteins, 10,000 point-mutants, and 30,000 negative control sequences, identifying over 2,500 new stable designed proteins in four basic folds. This scale— three orders of magnitude greater than that of previous studies of design or folding— enabled us to systematically examine how sequence determines folding and stability in uncharted protein space. Iteration between design and experiment increased the design success rate from 6% to 47%, produced stable proteins unlike those found in nature for topologies where design was initially unsuccessful, and revealed subtle contributions to stability as designs became increasingly optimized. Our approach achieves the long-standing goal of a tight feedback cycle between computation and experiment, and promises to transform computational protein design into a data-driven science.

In one embodiment, the synthesizing step comprises oligo library synthesis technology capable of parallel synthesis of 10 4 -10 5 arbitrarily specified DNA sequences long enough to encode the proteins. In another embodiment, in the screening step, cells are incubated with varying concentrations of protease, those displaying resistant proteins are isolated by fluorescence-activated cell sorting (FACS), and the frequencies of each protein at each protease concentration are determined by deep sequencing. In a further embodiment, the method further comprising assigning each protein a stability score, wherein the stability score comprises: the difference between the measured EC 50 and the predicted EC 50 in the unfolded state of the protein, according to a sequence-based model parameterized using EC 50 measurements of scrambled sequences. In another embodiment, a stability score of 1 corresponds to a 10-fold higher EC 50 than the predicted EC 50 in the unfolded state. In a further embodiment, the library comprises 1,000 to 30,000 proteins. Examples

Here we present a new synthetic approach to examine the determinants of protein folding by exploring the space of potential minimal proteins using de novo computational protein design on a three order of magnitude larger scale. To enable this new scale, both DNA synthesis and protein stability measurements are parallelized. To encode our designs, we employ oligo library synthesis technology (22), originally developed for transcriptional profiling and large gene assembly applications, and now capable of parallel synthesis of 10 4 - 10 s arbitrarily specified DNA sequences long enough to encode short proteins. To assay designs for stability, we expressed these libraries in yeast so that every cell displays many copies of one protein sequence on its surface, genetically fused to an expression tag that can be fluorescently labeled (23) (Fig. 1 A). Cells were then incubated with varying

concentrations of protease, those displaying resistant proteins were isolated by FACS (Fig. IB), and the frequencies of each protein at each protease concentration were determined by deep sequencing (Fig. 1C). We then inferred protease EC 50 values for each sequence from these data by modeling the complete selection procedure (Fig. ID, details given in Methods). Finally, each design was assigned a "stability score" (Fig. IE): the difference between the measured EC 50 and the predicted EC 50 in the unfolded state, according to a sequence-based model parameterized using EC 50 measurements of scrambled sequences. A stability score of 1 corresponds to a 10-fold higher EC 50 than the predicted EC 50 in the unfolded state. The amino acid sequences of peptides developed herein are provided as SEQ ID NOs: 1-4000 in the accompanying sequence listing.

Massively parallel measurement of folding stability

Proteolysis assays have been previously used to measure stability for individual sequences (24) and to select for stable sequences in a proteome (25) or combinatorial library (26, 27), but this approach has not been applied to date to quantify stability for all sequences in a library. To evaluate the ability of the assay to measure stability on a large scale, we obtained a synthetic DNA library encoding four small proteins (pinl WW-domain (28), hYAP65 WW-domain (5, 10), villin HP35 (7, 11), and BBL (8)) and 116 mutants of these proteins that have been previously characterized thermodynamically using experiments on purified material. The library also contained 19,610 unrelated sequences, and all sequences were assayed for stability simultaneously as described. Although the stability score is not a directly analog of a thermodynamic parameter, stability scores measured with trypsin and separately measured with chymotrypsin were each well-correlated with folding free energies (or melting temperatures) for all four sets of mutants, with r 2 values ranging from 0.63 to 0.85 (Fig. 1 F-I). Most mutants in this dataset were predicted to have similar unfolded state EC 50 values to their parent sequences, so the relative stability scores of the mutants are very similar to their relative EC 50 values. In the case of villin assayed with chymotrypsin, the unfolded state model improved the correlation between protease resistance and folding free energy from r 2 = 0.46 (using raw EC 50 values) to the reported = 0.77 by correcting for the effect mutations such as K70M and F51L have on intrinsic chymotrypsin cleavage rates. The mutual agreement between trypsin results, chymotrypsin results, and experiments on purified protein indicate that the assay provides a robust measure of folding stability for small proteins.

Massively parallel testing of designed miniproteins

We selected four protein topologies (ααα, βαββ, αββα, and ββαββ) as initial design targets. These topologies have increasing complexity: the ααα topology features only two linkers and exclusively local secondary structure (helices); the ββαββ fold requires four linkers and features a mixed parallel/antiparallel β-sheet bridging the N- and C-termini. Of these topologies, only ααα proteins have been found in nature within the target size range of 40-43 residues; no proteins have been previously designed in any of the four topologies at this size (excluding designed acta and βαββ proteins stabilized by multiple disulfide linkages (29)). For each topology, we first designed between 5,000 and 40,000 de novo proteins using a blueprint-based approach. Each design has its own unique three-dimensional main chain conformation and its own unique sequence predicted to be near-optimal for that

conformation. We then selected 1,000 designs per topology for experimental testing by ranking the designs by a weighted sum of their computed energies and additional filtering terms [see Methods: Protein design}. The median sequence identity between any pair of tested designs of the same topology ranged from 15-35%, and designs were typically no more than 40-65% identical to any other design. For each design, we also included two control sequences in our library: one made by scrambling the order of amino acids in that design (preserving the overall amino acid composition), and a second made by scrambling the order while preserving both the composition and the hydrophobic or polar character at each position. This library comprised 12,459 unique sequences in total: 4,153 designed proteins and 8,306 control sequences. The designed proteins are named using their secondary structure topology (using H for a-helix and E for β-strand), their design round, and a design number. We assayed the sequence library for stability using both chymotrypsin and trypsin. To stringently identify stable designs, we ranked sequences by the lower of their trypsin stability score or their chymotrypsin stability score, referred to simply as their stability score from here on. The fully scrambled sequences and patterned scrambled sequences had similar stability score distributions; most of these controls had stability scores below 0.5, and only one had a score greater than 1.0. In contrast, 206 designed sequences had stability scores above 1.0. Most of these (195/206) were ααα designs (both left-hand and right-handed bundles); the remaining 11 were βαββ. The clustering of the 206 most stable designs around the ααα topology, and the high stability of designed sequences compared with chemically identical control sequences, strongly suggests these stable designs fold into their designed structures. To examine this further, we selected six stable designs (four ααα and two βαββ) for E. coli expression, purification, and further characterization by size-exclusion

chromatography (SEC) and circular dichroism spectroscopy (CD). All six designs eluted from SEC as expected for a 5-7 kDa monomer, and the CD spectra were consistent with the designed secondary structure. Five of the six designs had clear, cooperative melting transitions, re-folded reversibly and were highly stable for minimal proteins: all had melting temperatures above 70 °C, and the βαββ design EHEE_rdl_0284

NMR; each structure closely matched the design model (Fig. 2A,. In sum, both high- throughput control experiments and low-throughput characterization of individual proteins indicate that the protease resistant designs fold as designed.

Global determinants of stability

This unprecedentedly large set of stable and unstable minimal proteins with varying physical properties enabled us to quantitatively examine which protein features correlated with folding. We computed over 60 structural and sequence-based metrics and examined which metrics differed between the 195 most stable ααα designs (stability score > 1.0, considered to be design successes) and the 664 remaining ααα designs (considered to be failures) using the K-S 2 -sample test. Significant differences indicate that a particular metric captures an important contribution to protein stability, and that this contribution was poorly optimized among the tested designs.

The dominant difference between stable and unstable ααα designs was the total amount of buried nonpolar surface area (NPSA) from hydrophobic amino acids. Stable designs buried more NPSA than did unstable designs (p < 5e-38), and none of the 95 designs below 32 A 2 /residue were stable. Above this threshold, the success rate (successful designs / tested designs) steadily increased as buried NPSA increased. Stable designs also had better agreement between their sequences and their local structures as assessed by quantifying the geometric similarity (in A of RMSD) between 9-residue long fragments of the designs and 9- residue long fragments of natural proteins similar in local sequence to the designed fragment (see Methods: Fragment analysis). Fragments of stable designs were more geometrically similar to fragments of natural proteins of similar local sequence, while fragments of unstable designs were more geometrically distant from the fragments of natural proteins matching their local sequence (p < 2e-26). Other metrics were only weakly correlated with success despite substantial variability among designs, including different measures of amino acid packing density, and the total Rosetta energy itself. Although local sequence-structure agreement and especially buried NPSA are well known to be important for protein stability, it is very challenging to determine the precise strength of these contributions at a global level in the complex balance of all the energies influencing protein structure. Our results directly demonstrate how specific imbalances led to selection of hundreds of unstable designs, and our data and approach provide a completely new route to refining this balance in biophysical modeling.

Iterative, data-driven protein design

We sought to use these findings to increase the success rate of protein design, both by changing the design procedure to increase buried NPSA and by re-weighting the metrics used to select designs for testing (see Methods: Protein design). Using the improved design and ranking procedure, we built a second generation of 4,150 designs, along with two control sequences per design: a pattern-preserving scrambled sequence as before (now also preserving Gly and Pro positions), and a second control identical to the designed sequence, but with the most buried side chain (according to the design model) replaced with aspartate. As in Round 1, almost no scrambled sequences had stability scores above 1 (our cutoff defining success) despite the increased hydrophobicity of the scrambled sequences. However, a much greater fraction of second-generation designs proved stable: success for acta designs improved from 23% to 69%, βαββ designs improved from 1% to 11% successful, and we also obtained 7 stable αββα designs and one stable ββαββ design. These increases demonstrate how iterative, high-throughput protein design can make concrete improvements in design and modeling. Nearly all stable designs were destabilized via the single buried Asp substitution: the median drop in stability score for these designs was 1.1, and only 33 buried Asp controls had stability scores > 1.0 compared with 271 designs. This significant destabilization from a single designed substitution provides further large-scale evidence that the stable designs fold into their designed structures. We purified and characterized seven second-generation proteins by SEC and CD, all of which (including three αββα designs and one ββαββ design) were monomelic, displayed their designed secondary structure in CD, were folded cooperatively, and re-folded reversibly after thermal denaturation. Although the αββα and ββαββ designs were only weakly stable, the second-generation βαββ design EHEE_rd2_0005

our knowledge, the most thermostable minimal protein ever found (lacking disulfides or metal coordination): its CD spectrum is essentially unchanged at 95 °C, and its Cm is above 5 M GuHCl.

The amount of buried NPSA was the strongest observed determinant of folding stability for second-generation βαββ designs, and continued to show correlation with stability for second-generation ααα designs. The success rate for ααα designs improved in Round 2 at all levels of buried NPSA, indicating that improving design properties unrelated to buried NPSA (mainly local sequence-structure compatibility) contributed to the increase in success rate along with the increase in NPSA. To improve the stability of the other two topologies, we built a third generation of designs with even greater buried NPSA, at the cost of increased exposure of hydrophobic surface that could hurt solubility. To increase buried NPSA in the ββαββ topology, we expanded the architecture from 41 to 43 residues. This led to a large increase in the ββαββ success rate (~0% to 13%) and 236 newly discovered stable ββαββ desi ns. We urified four third- eneration desi ns and found the α desi n

SEQ ID NO: 199) to be very stable (Fig. 2). We solved the solution structure of this design by NMR, revealing that it folds into its designed structure, which is not found in nature at this size range (Fig. 2). Buried NPSA remained the dominant determinant of stability within the tested ββαββ designs. We also observed that a newly improved Rosetta energy function (optimized independently from this work (/9) provided significant discrimination between stable and unstable designs, both for the ββαββ topology and for other topologies. Having accumulated nearly 1,000 examples of stable designs from rounds 1-3, we asked whether more systematic application of this data could be used to select better designs. We designed 2,000-6,000 new proteins per topology (using the improved energy function), and then selected 1,000 designs each for experimental testing by ranking the designs using linear regression, logistic regression, and gradient boosting regression models trained on the structural features and experimental stabilities of the 10,000 designs from Rounds 1-3. Many designs selected for testing were predicted to have a low likelihood of folding, but were included to increase the sequence diversity of tested designs and because better designs could not be found (see Methods: Protein design). Despite this, an even larger fraction of these designs proved stable than before: most notably, the success rate for βαββ designs increased from 17% to 39%, and the success rate for ββαββ designs increased from 13% to 58%.

Although the success rate for designing the αββα topology remained low, five purified fourth- generation designs in this topology possessed the highest stability yet observed for the fold by CD. We solved the structure of one of these HEEH rd4 0097

by NMR and found that it adopts its designed structure in solution (Fig. 2A). Of the models used to rank designs, the logistic regression was most successful, and was very accurate: when designs are binned according to their predicted success probability, the number of success in each bin is very close to that predicted beforehand by the logistic regression. The accuracy of the regression model demonstrates that large-scale analysis of stable and unstable designed proteins can be used to build predictive models of protein stability. The overall increase in success across the four rounds— from 200 stable designs in Round 1 (nearly all in a single topology) to over 1,800 stable designs in Round 4 spread across all four topologies— also demonstrates the power of our massively parallel approach to drive systematic improvement in protein design.

Sequence determinants of stability

We next examined determinants of stability at the individual residue level by constructing a library containing every possible point mutant of 14 designs, as well as every point mutant in three paradigm proteins from decades of folding research: villin HP35, pinl WW-domain, and hYAP65 WW-domain L30K mutant. This library of 12,834 point mutants is comparable in size to the 12,561 single mutants found in the entire ProTherm™ database (34) and is unbiased toward specific mutations. We assayed this library for stability using trypsin and chymotrypsin, and determined an overall stability effect for each mutation by using the independent results from each protease to maximize the dynamic range of the assay (see Methods: Mutational stability effects). The mutational effects were qualitatively consistent with the designed structures for 13 of 14 designs. As expected, the positions on the designs that were most sensitive to mutation were the core hydrophobic residues, including many alanine residues, which indicates the designed cores are tightly packed. Mutations to surface residues had much smaller effects, highlighting the potential of these proteins as stable scaffolds whose surfaces can be engineered for diverse applications.

To examine the mutability of protein surfaces in greater detail and to probe more subtle contributions to stability, we divided 260 surface positions from 12 of the designs into categories based on secondary structure, and calculated the average stability effect of each amino acid for each category using the ~5,000 stability measurements at these positions

{Methods: Mutational stability effects). We observed specific, though weak, preferences for helices, helix N-caps, the first and last turns of helices, middle strands and edge strands, and linker residues). Amino acids that were favorable for capping helices (Asp, Ser, Thr, and Asn) were unfavorable within helices; these amino acids (except Asn) were as destabilizing as glycine when inside helices. Hydrophobic side chains were stabilizing even when located on the solvent-facing side of a β-sheet, and this effect was stronger at middle strand positions compared with edge strand positions. Most notably, we observed stabilization from charged amino acids on the first and last turns of a-helices when these charges counteract the C-to-N negative-to-positive helical dipole; charges that enhanced the dipole were destabilizing. We isolated this effect by comparing the average stability of each amino acid on first and last helical turns with the average stability of each amino acid at all helical sites (polar sites only in both cases,). This effect remained significant in our data even when we restricted the analysis to positions that were Arg or Lys in the original designs to control for any bias in the designed structures favoring original, designed residues compared with mutant residues, although no significant effect was seen for mutations at Glu positions. We had not examined agreement with this dipolar preference during the four rounds of design, and after this observation, we found that the net favorable charge on first and last helical turns (stabilizing charges minus destabilizing charges summed over all helices) discriminated between stable and unstable fourth-generation ααα designs better than any other metric we examined, explaining in part why the success rate had not reached 100%.

In the three naturally occurring proteins, mutations at conserved positions were generally destabilizing, although each natural protein possessed several highly conserved positions that we experimentally determined to be unimportant or deleterious to stability. In villin HP35, these were W64, K70, L75, and F76 (villin HP35 consists of residues 42-76), which are required for villin to bind F-actin . In pinl, the highly conserved S16 is deleterious for stability, but directly contacts the phosphate on phosphopeptide ligands of pinl (55), highlighting a stability-function trade-off in pinl discoverable without directly assaying function, the conserved residues H32, T37, and W39 are relatively unimportant for stability, but these residues form the peptide recognition pocket in YAP-family WW-domains. These examples illustrate how our approach enables high-throughput identification of functional residues, even without a functional assay or a protein structure, via comparison between stability data and residue conservation.

Stability measurement of all known small protein domains

How stable are these designed proteins compared with naturally occurring proteins?

To examine this, we synthesized DNA encoding (1) all 472 sequences in the protein databank (PDB) between 20 and 50 residues in length and containing only the 19 non-Cys amino acids, and (2) one representative for all 706 domains meeting these criteria in the Pfam protein family database. We included this DNA (and DNA for all stable designs from rounds 1-3) in the library containing our fourth-generation designs to facilitate a head-to-head comparison. Many of the PDB structures are multimeric, and the most resistant overall sequence

(measured by stability score) was a C-terminal coiled-coil domain from a TRP channel (3SRO, stability score 1.93). Of the 100 unique, monomelic sequences with PDB structures, the most protease-resistant was a peripheral subunit binding domain (ααα topology) from the thermophile Bacillus stearothermophilus (2PDD, stability score 1.48,), which has been heavily studied as an ultrafast-folding protein. A total of 774 designed proteins had higher stability scores than this most protease-resistant natural monomeric protein. The number of stable proteins discovered is orders of magnitude greater than that of natural proteins in the PDB (monomeric or not) in this size range.

Conclusion

We have shown that proteins can be computationally designed and assayed for folding thousands at a time, and that high-throughput design experiments can provide quantitative insights into the determinants of protein stability. Large libraries can be designed in a relatively unbiased manner (as in our first generation) to maximize the protein property space examined, or properties can be tuned to increase the design success rate at the cost of diversity. The power of our iterative learning approach to progressively hone in on more subtle contributions to stability is highlighted by the progression of our ααα design sets from early rounds in which design failures were caused by insufficient buried nonpolar surface area to the last round where helix-sidechain electrostatics had the greater effect. The large numbers of folded and not folded designs will also provide stringent tests of molecular dynamics simulation approaches which have successfully reproduced structures and some thermodynamic measurements of natural proteins, but have not yet been challenged with plausible but unstable protein structures like our design failures.

The four solution structures, saturation mutagenesis data on 13 of 14 designs, and over thirty thousand negative control experiments indicate that the large majority of our stable sequences are structured as designed. These thousands of designed proteins, stable without disulfides or metal coordination, should have numerous applications in

biotechnology and synthetic biology. Many are more stable than any comparably-sized monomelic proteins found in the PDB, making them ideal scaffolds for engineering inhibitors of intracellular protein-protein interactions. Their small size may also help promote membrane translocation and endosomal escape. As DNA synthesis technology continues to improve, high-throughput protein design will become possible for larger proteins as well, revealing determinants of protein stability in more complex structures and leading to a new era of iterative, data-driven de novo protein design and modeling.

Methods

Protein design

All protein design in this work was performed in three stages: (1) backbone construction, (2) sequence design, and (3) selection of designs for testing. Backbone construction (the de novo creation of a compact, three-dimensional backbone with a pre- specified secondary structure) was performed using a blueprint-based approach described previously (34, 54). Briefly, blueprint files were built by hand for each topology in order to define (a) the secondary structure at each residue position for that topology, and (b) the strand pairing and register of any β-sheets. These blueprint files were then used to select short three- dimensional fragments from protein crystal structures matching the proposed secondary structure in the blueprint (200 fragments for every 3- and 9-residues-length stretch of the blueprint). Finally, these fragments were assembled into a full protein backbone using Monte Carlo sampling with a coarse-grained energy function (including constraints on the hydrogen- bonding pairs of residues as specified for the β-sheets in the blueprint) until the overall backbone matched the specified secondary structure and topology, satisfied compactness criteria, and avoided steric clashes. The same four HHH blueprints (each 43 residues in length) were used for all four design rounds. One 40-residue-length EHEE blueprint was used for all four design rounds. A total of 42, 12, 27, and 7 HEEH blueprints (each 43 residues in length) were used in design rounds 1-4 respectively. Blueprints for each round were selected based on the stabilities of designs from the prior round; new blueprints were also introduced in design rounds 2 and 3. A total of 2, 1, 4, and 7 EEHEE blueprints were used in design rounds 1-4 respectively. New blueprints were introduced in design round 3 that increased the protein length from 41 residues to 42 or 43 residues in order to increase the size of the potential hydrophobic core and increase the helix length (blueprints for design rounds 1 -4 were 41, 41 , 41/42/43, and 43 residues long respectively).

Each backbone structure produced above was used as the input to the Rosetta™ sequence design protocol FastDesign™, also described previously (33). This protocol alternates between (a) a fixed-backbone Monte Carlo search in sequence and rotamer space, and (b) a fixed-sequence backbone relaxation step. This protocol begins with a softened repulsive potential and restores this potential to full strength across several cycles of design and relaxation. These design steps employ the Rosetta full-atom energy function. Design rounds and 1 and 2 employed the Talaris™2013 version of the energy function (40); design round 3 employed the beta_july 15 version of the energy function, and design round 4 employed the beta_novl5 version of the energy function (19, 55). The allowed amino acids at each position were restricted using the LayerDesign™ protocol (34); these restrictions are imposed separately from the design energy function for more efficient sampling and to account for design criteria not reflected in the energy function, such as solubility. In this protocol, positions on the designed structure are classified into "core", "boundary", and "surface" layers according to their degree of burial, and polar amino acids are excluded from positions in the core layer while nonpolar amino acids are excluded from positions in the surface layer. Layer classification was performed using the "sidechain neighbors" protocol, which counts the number of neighboring residues in the region around the side chain of a given residue. Layer classification is performed on each structure individually and can change during the design process as the structure changes. The definitions of each layer (e.g. the level of burial required for a residue to be classified as core or boundary) were adjusted from design round to design round in order to increase the number of positions where hydrophobic amino acids were permitted. The specific layer definitions used at each stage are given in the included design scripts. For finer control over hydrophobic and polar positioning for the Round 4 designs, we manually specified the allowed amino acids at each position in the designed topologies using resfiles. Starting from design round 3, we included an amino acid composition-based energy term in the design energy function that penalized sequences possessing too few nonpolar amino acids. Finally, to limit the number of designs analyzed at the final selection stage, designs were filtered following sequence design using several basic criteria (primarily compactness and overall score).

After designing 2,000-40,000 finished designs per topology, we then analyzed and ranked these across diverse structural metrics inside and outside of Rosetta™ in order to select the final set of designs for experimental testing. In design rounds 1-3, this ranking was performed by selecting metrics of interest and assigning weights to each metric (again by hand) in order to produce a single composite design score, which was the sum of each metric multiplied by its weight. The selected metrics and their weights were adjusted until the top- ranking designs appeared optimal to the designer. Only a small number of metrics were used in design round 1 in order to ensure broad sampling of protein properties; additional metrics were added in further design cycles as new causes of failure were identified. Different weights were used for each different topology. All scoring metrics used are defined in the section Methods: Definition of scoring metrics.

The metrics used for ranking designs in Round 1 evaluated each design's overall energy (total_score), β-sheet quality (hbond_lr_bb_per_res), packing (cavity_volume, degree, holes, AlaCount, pack), hydrophobic burial (buried_np, one core each, two_core_each, percent_core_SCN), agreement between sequence and local structure

(mismatch_probability), solubility (exposed_hydrophobics), and hydrogen bond satisfaction (unsat_hbond). Based on the metrics that correlated with design success in Round 1 (as described in the text), we adjusted these weights to select new designs for Round 2, and also added additional metrics related to nonpolar burial (buried_minus_exposed,

buried_over_exposed, contact_all) and the geometric similarity between 9-residue-long fragments of the designs and fragments of natural proteins of similar local sequence

(avg_best_frag, worst6frags, worstfrag). We again adjusted these weights to select designs for Round 3, and added additional measures of local sequence-structure compatibility (abego_res_profile, p_aajpp), fragment quality (avg_all_frags), packing (fa_atr_per_res, ss_sc), nonpolar burial (n_hphob_clusters, largest_hphob_cluster, hphob_sc_contacts), the spacing between nonpolar amino acids along the primary sequence (contig_not_hp_avg, contig_not_hp_max), and solubility (fxn_exposed_is_np). In design round 3, we also introduced restrictions on the sequence similarity of the selected βαββ designs (this topology featured the lowest amount of sequence variation between designs): rather than selecting the highest-ranking designs for testing, we selected designs in rank order while passing over designs that were more than 60% identical to a higher-ranking design already selected for testing.

For design round 4, we employed an automated design ranking scheme. All designs from all rounds were scored with ~50 structural metrics, and the structural metrics and experimental stability scores of the Round 1-3 designs were used to fit topology-specific linear regression, logistic regression, and gradient boosting regressions to predict

experimental outcome (stability score) as a function of the design structural metrics. Models were fit using the scikit.learn package (56). Logistic regressions employed Ll-regularization with C=0.1. Gradient boosting regressions employed 250 estimators with a tree depth of 5, the minimum samples per split set to 5, a learning rate of 0.01 , and a least-squares loss function. All potential Round 4 designs were then ranked according to their predicted stabilities (linear and gradient boosting regression) or success probabilities (logistic regression), and designs were selected for testing in order of the lowest rank given to each design by any of the three regression models. In selecting designs for testing, we again passed over designs that were too similar to designs already selected for testing. We used a threshold of 70% identity for ααα and βαββ designs, and a threshold of 75% ββαββ designs (this was unnecessary for αββα designs). The median identity between designs sharing a topology remained 20-50%.

DNA synthesis

All sequences were reverse translated and codon optimized using DNAworks2.0 (57).

Sequences were optimized using E. coli codon frequencies despite being used for expression in yeast. Oligo libraries encoding designs and control sequences for design rounds 1 and 2 (12,472 sequences per round) were purchased from CustomArray™, Inc. The oligo library for design round 3 (12,524 sequences) was purchased from Twist Bioscience. Oligo libraries for the point mutant library (13,564 sequences) and design round 4 (18,527 sequences, including the natural protein sequences) were ordered from Agilent Technologies in 27,000 feature format and selectively amplified out of the 27,000-sequence pool in the initial qPCR step. In order to amplify all sequences in a library as evenly as possible, we padded all sequences with extra residues until the amplified region of every oligo had the same length. All libraries were 43 residues in length (not including 18bp adapter sequences on both ends), except design round 4 where all sequences were 50 residues in length. The saturation mutagenesis library also included 46-residue length oligos for the hYAP65 sequences. In the 43-residue libraries, the shorter EHEE and EEHEE sequences were padded with GSS or GS at the N-terminus. In design library 4, EHEE and EEHEE designed sequenced were again padded at the N-terminus with GSS or GS to reach 43 residues, and then all sequences were padded with N, G, and S residues at the C-terminus so that all sequences would be a uniform 50 residues in length.

DNA preparation and sequencing

Oligo libraries were amplified for yeast transformation in two qPCR steps. First, a 10 ng (CustomArray™ libraries) or 2.5 ng (Twist and Agilent libraries) quantity of synthetic DNA was amplified in a 25 μΐ, reaction using Kapa HiFi™ Polymerase (Kapa Biosystems) for 10-20 cycles by qPCR. The number of cycles was chosen based on a test qPCR run in order to terminate the reaction at 50% maximum yield and avoid overamplification. Second, this reaction product was gel extracted to isolate the expected length product, and re- amplified by qPCR as before to generate sufficient DNA for high-efficiency yeast transformation. Each second qPCR reaction used 1/25* of the gel extraction product as template. This second PCR product was PCR purified and concentrated for transformation of EBY100 yeast using the protocol of (58) (3 ug of insert and 1.5 μg of cut vector per transformation ). Yeast display employed a modified version of the pETcon™ vector (59) (known as "pETcon 3"), altered to remove a long single-nucleotide stretch near the cloning region. The amplified libraries included 40bp segments on either end to enable homologous recombination with the pETcon vector. Gel extraction and PCR purification were performed using QIAquick™ kits (Qiagen Inc).

DNA libraries for deep sequencing were prepared as above, except the first step started from yeast plasmid prepared from 5x 10 7 to 1 * 10 8 cells by Zymoprep™ (Zymo Research). Cells were frozen at -80°C before and after the zymolase digestion step to promote efficient lysis. One-half the plasmid yield from the Zymoprep™ was used as the template for the first PCR amplification. Illumina adapters and 6-bp pool-specific barcodes were added in the second qPCR step. Unlike libraries prepared for transformation, DNA prepared for deep sequencing was gel extracted following the second amplification step. All libraries before and after selections were sequenced using Illumina NextSeq™ sequencing. Yeast display proteolysis

S. cerevisiae (strain EBY100) cultures were grown and induced as in (26). Following induction, cell density (O.D.600) was measured by NanoDrop™, and an amount of cells corresponding to 1 mL at O.D. 1 (12-15M cells) was added to each microcentrifuge tube for proteolysis. Cells were washed and resuspended in 250 uL buffer (20 mM NaPi 150 mM NaCl pH 7.4 (PBS) for trypsin reactions, or 20 mM Tris 100 mM NaCl pH 8.0 (TBS) for chymotrypsin reactions). Proteolysis was initiated by adding 250 uL of room temperature protease in buffer (PBS or TBS) followed by vortexing and incubating the reaction at room temperature (proteolysis reactions took place at cell O.D. 2). After 5 minutes, the reaction was quenched by adding 1 mL of chilled buffer containing 1% BSA (referred to as PBSF or TBSF), and cells were immediately washed 4x in chilled PBSF or TBSF. Cells were then labeled with anti-c-Myc-FITC for 10 minutes, washed twice with chilled PBSF, and then sorted using a Sony SH800 flow cytometer using "Ultra Purity" settings. Events were initially gated by forward scattering area and back scattering area to collect the main yeast population, and then by forward scattering width and forward scattering height to separate individual and dividing cells (which were used for analysis) from cell clumps (which were discarded).

Following these gates, cells were gated by fluorescence intensity in one-dimension (Fig. IB), with the threshold separating displaying (fluorescent) from non-displaying (non-fluorescent) cells set at ~2,200 fluorescence units (Fig. IB). Small adjustments were made to this gate based on daily conditions to maximize the separation between the major displaying and non- displaying populations. For each sort, we recorded the fraction of cells passing the fluorescence threshold before proteolysis (using cells from the same starting yeast population, but untreated with protease) and after proteolysis, and also recorded the total number of cells collected for each condition.

Design libraries 1-4 were assayed at six protease concentrations over three sequential selection rounds. Trypsin assays used 0.07 uM, 0.21 μΜ, 0.64 μΜ, 1.93 μΜ, 5.78 uM, and 17.33 uM protease; chymotrypsin assays used 0.08 μΜ, 0.25 uM, 0.74 uM, 2.22 μΜ, 6.67 μΜ, and 20.00 uM protease. Selections using the lowest two concentrations of each protease (0.07 μΜ and 0.21 μΜ trypsin and 0.08 μΜ and 0.25 uM chymotrypsin) were performed starting from the narve yeast library. The middle two selections (0.64 μΜ and 1.93 μΜ trypsin and 0.74 μΜ and 2.22 μΜ chymotrypsin) were performed starting from the post- selection 0.21 μΜ trypsin or 0.25 μΜ chymotrypsin cultures after 12-24 hours of growth and 12-24 hours of fresh induction. The highest concentration selections were performed starting from the post-selection 1.93 μΜ trypsin or 2.22 μΜ chymotrypsin cultures again following growth and re-induction.

The saturation mutagenesis library was assayed at six (trypsin) or eight

(chymotrypsin) protease concentrations over four sequential selection rounds. Trypsin assays used 0.41 μΜ, 0.81 μΜ, 1.63 uM, 3.25 μΜ, 6.50 μΜ, and 13.00 μΜ protease; chymotrypsin assays used 0.21 μΜ, 0.42 uM, 0.84 μΜ, 1.69 uM, 3.38 μΜ, 6.75 μΜ, 13.50 μΜ, and 27.00 μΜ protease. As before, selections 1 and 2 were performed starting from the naive library, selections 3 and 4 were performed starting from the selection 2 culture following growth and re-induction, selections 5 and 6 were performed starting from the selection 4 culture following growth and re-induction, and selections 7 and 8 (only done for chymotrypsin) were performed starting from the selection 6 culture following growth and re-induction. For trypsin, selection 6 was performed starting from the selection 5 culture following growth and re-induction.

Protease reagents

Trypsin-EDTA (0.25%) solution was purchased from Life Technologies and stored at stock concentration (2.5 mg/mL) at -20°C. a-Chymotrypsin from bovine pancreas was purchased from Sigma- Aldrich as lyophilized powder and stored at 1 mg/mL in TBS +100 mM CaCl 2 at -20°C. Each reaction used a freshly thawed aliquot of protease. The trypsin stock activity was measured to be 5,410 ± 312 BAEE units (ΔΑ253 x 1,000 / 1 minute) per mg in PBS buffer, pH 7.4, with 0.23 mM BAEE (Sigma-Aldrich). Using the Pierce Fluorescent Protease Assay Kit with the fluorescence protocol (ThermoFisher Scientific), 1 mg of the chymotrypsin stock was measured to have equivalent activity to 3.74 ± 0.31 mg of the trypsin stock in pH 7.4 PBS buffer at 25°C.

Processing of raw deep sequencing data

Each library in a sequencing run was identified via a unique 6 bp barcode. Following sequencing, reads were paired using the PEAR™ program (60). Reads were considered counts for a particular ordered sequence if the read (1) contained the complete Ndel cut site sequence immediately upstream from the ordered sequence, (2) contained the complete Xhol cut site sequence immediately downstream from the ordered sequence, and (3) matched the ordered sequence at the amino acid level (for sequences in designed libraries 1-4) or at the nucleotide level (sequences in the saturation mutagenesis library). A higher stringency was used for the saturation mutagenesis library due to the overall similarity of the sequences in the library.

EC 50 estimation from sequencing counts

To determine protease resistance from our raw sequencing data we built a probabilistic model of the cleavage and selection procedure and used this model to calculate maximum a posteriori estimates of the protease EC 50 of each member of the pool. To build the model, we assumed that proteolysis (i.e. any cleavage that results in detachment of the epitope tag) follows pseudo-first order kinetics, with a rate constant specific to each sequence. The fraction of surviving, tagged surface proteins for a given sequence after proteolysis is therefore:

where k p is a sequence-specific rate constant, [E] is the concentration of protease and / is the reaction time.

In the assay, each cell has a labeling intensity proportional to the number of displayed proteins on its surface. Within the expressing population of cells, we assumed that the number of displayed proteins per cell is log-normally distributed, resulting a distribution of labeling intensities with sequence-independent expression location and scale parameters// and σ. The fraction of cells collected at the labeling threshold L ce ii > L s is then given by the cumulative distribution function:

Following proteolysis, labeling intensity is given by Lpo St = L ce ii * f sprot and cells are collected when Lpo Sl > L s . With a fixed selection level L s (defined as e c * in terms of log- intensity rather than absolute intensity) across selection rounds, the fraction of cells collected after proteolysis in each round is given by:

For model fitting it is advantageous to describe protease stability in terms of a sequence-dependent variable EC so and a sequence-independent variable K sel . The EC$o for each sequence is defined as the protease concentration at which half of all cells displaying that sequence pass selection. K se , is a constant term representing expression and collection conditions. Setting allows us to define k p t in terms of the sequence-specific EC50 ' .

Substituting (7) into (6) and grouping the sequence-independent terms/., σ, and c s into Ksei yields the form of the model used for fitting:

We modeled each selection experiment as a set of discrete selection events producing both (A) a difference in the observed library population distribution after selection, and (B) a global selection rate during the sorting experiment. For each round of selection with enzyme concentration an observed input population distribution P in is updated by a sequence-

dependent proteolysis rate to produce an unobserved distribution of labeled cells Pcieave ' .

for all sequences /, where the denominator of (11) normalizes Pcieave so that

1.

In each selection, total cells are examined, of which the cells passing the

labeling threshold are collected. Th collected cells are randomly selected from

during sorting to produce the observed post-selection distribution P sel - To account for carryover during sorting, contamination during library preparation, and sequencing errors that can cause a sequence / to appear in the selected population even when set a lower bound on Frac sel equal to 10 -4 .

Each library was analyzed by multiple rounds of selection, where the resulting population of a round may be used as the source population for a subsequent selection round at a higher protease concentration (see Methods: Yeast display proteolysis for details). For each round, the observed distribution of matching reads in the pre-selection library is used as P in , which is normalized to sum to 1. The (non-normalized) post-selection distribution P se/ (which sums to the total number of cells collected n sel ) is computed by multiplying n set by the observed normalized distribution of matching reads in the post-selection library. These definitions of Pj„ and P sel assume that there are no sequence-dependent effects in

amplification efficiency or sequencing efficiency. If a sequence did not appear in the input distribution P in , any observations of that sequence in the selected population P sel were ignored.

Because the model only considers sequences that match those in the designed library (i.e. only library sequences are included in P in and P sel ), the number of cells collected for n sel is smaller than the total number of cells collected as reported by the flow cytometer. To account for this, we crudely approximated n sel as the number of cells collected by the flow cytometer multiplied by the fraction of sequencing reads that matched sequences in the designed library (i.e. if 200,000 cells were collected in a sort and 75% of sequencing reads for that sort matched library sequences, we assumed in the model that 150,000 cells containing library sequences had been collected). In rare cases where the total number of matching sequencing reads for a given library was smaller than the estimated n sel (thus the statistical error was limited by sequencing rather than sorting), the number of matching sequencing reads was used as n sel . Finally, to calculate the total number of cells observed (containing library sequences) n^ay, we assumed that the overall collection rate (i.e. one collected cell for every twenty observed cells) could be used as a proxy measure of the collection rate for library sequences (as would be true if library sequences dominated the overall cell population, or if the collection rates for library and non-library cells were approximately equal). We calculated the overall collection rate directly from the flow cytometry data as the fraction of initially displaying cells that remained displaying following proteolysis, and then calculated n assay by dividing n sel by the overall collection rate.

The complete model log-likelihood is the sum of the data-log likelihoods of P sel and n sel and prior likelihoods over the fit parameter ECso, taking P in and n assay as given. K sel was initially treated as a fit parameter as well, but for consistency between all libraries, we fixed Ksei at 0.8 for all analysis in this work. The log-likelihood of the observed population P sel was modeled as a multinomial distribution of n sel independent selections from

The log-likelihood of the observed global selection rate was modeled as a binomial distribution of selection events, where the selection probability is the weighted mean of sequence-dependent proteolysis rates:

for all sequences /. Uniform priors covering the range of experimentally relevant values were used for the model parameters. The MAP estimate of the model parameters is found by optimizing the expression:

The 95% credible intervals for all EC 50 S were also estimated from the likelihood expression given in (14). All model components were implemented in Python via PyMC3 (61) and Theano™ (62).

For the analysis of design success rates and design features correlating with success, we excluded sequences whose model-estimated EC 50 credible intervals were large. To include as much data as possible, we used a permissive threshold: designs were included in the analysis if their chymotrypsin and trypsin stability score 95% credible intervals (directly taken from the EC 50 credible intervals) were smaller than 0.95 stability score units (A factor of 9 in [protease]; the equivalent of two rounds of sorting). These thresholds excluded from analysis 14%, 30%, 0.7%, and 1% of sequences from design rounds 1-4 respectively.

Credible intervals were much narrower in the later rounds due to improved DNA libraries and better representation of each design in sorting. Despite the permissive thresholds, the median 95% credible interval width for stability scores for sequences included in the analysis was 0.14 stability score units, and 95% of the credible intervals were smaller than 0.48 stability score units.

Unfolded state model

We trained a model for the expected protease EC 50 of an unfolded protein using stability data on scrambled sequences. We used both fully scrambled sequences and hydrophobic-polar pattern-preserving scrambled sequences as training data (~18,000 sequences in total). Only sequences with EC 50 value 95% credible intervals smaller than a factor of 3 in [protease] were used for model fitting (protease concentrations increased by this amount at each selection step).

To define the model, we separated the cutting rate into a fixed rate for the constant regions of the fusion construct and an individual rate of cutting at each site i in the inserted sequence. This was done by rearranging equation (7):

where kf is the pseudo-first order rate constant for the constant regions of the fusion construct in Menzyme -1 s -1 , k t is the cleavage rate after amino acid i for all n residues in the inserted sequence (same units), and t is time. If we assume that (1) the inserted sequence cannot affect the cleavage rate of the constant sequence, and (2) that the inserted sequence is completely uncleaved (all k,- = 0), then the EC 50 reaches a maximum that is independent of the inserted sequence:

By dividing the numerator and denominator on the right-hand-side of (15) by kft, we can re- write (15) as:

We modeled k f I k/as a function of the 9-residue-long local sequence surrounding sequence position /. In other words, the cut rate at site * in the model depends on the amino acid identities at sites i - 4 through / + 4, referred to as sites P5 to P4' in protease nomenclature. The effects of the sequence at these positions are implemented through a position-specific scoring matrix (PSSM) with coefficients for all 19 amino acids (excluding cysteine) at positions P5-P4' around a potential cut site. The model for Λ, / k/as a function of the local sequence is given below:

where aa sUe is the amino acid identity at site. The parameters of the full model are ECsomax, k max , c 0t and the 19 χ 9 = 171 elements of the PSSM. Distributing the c 0 term into the PSSM coefficients would not affect the model (c 0 adds no additional model freedom), but including the term aided model fitting. Positive PSSM coefficients lead to a smaller denominator in eqn. (18), an increased cutting rate ki, and a lower predicted ECso by eqn. (17). The model parameters (referred to collectively as Θ) were trained by minimizing the logarithmic error between the model predicted EC 50 s and the observed EC 50 s over the training set of scrambled sequences. We used a combination of squared-error and absolute error in the objective function to provide slightly more tolerance for large outliers than squared-error alone.

We trained the model starting from a uniform PSSM by iterating between fitting only the PI component of the PSSM, all other positions of the PSSM, and the and

co terms. We validated the model using three-fold cross validation (three separate models built by excluding a different one-third of the data at a time, followed by predicting each EC 50 using the model that did not encounter that sequence during training). The cross- validated root-mean-squared errors (RMSE) for trypsin and chymotrypsin are 2% (trypsin) and 5% (chymotrypsin) higher than the RMSEs of the models trained using all data without cross-validation, indicating minimal overfitting. The predictions made by the cross-validated models are very similar to the predictions made by the models trained on the complete dataset.

All stability scores reported herein were calculated using the final version of the unfolded state model, trained using the scrambled sequence data from all four design rounds. Obviously, data on the full set of scrambled sequences was not available at earlier stages of the work. The data analysis after each design round employed earlier versions of the unfolded state model that were trained using the scrambled sequence data that had been collected up to that point. However, the final model predictions used for the manuscript are very similar to the model predictions made when only the data from Round 1 are used for training.

Because the unfolded state model is trained on EC 50 S of scrambled sequences and not on designed sequences, a systematic bias may be introduced that would cause scrambled sequences to receive lower stability scores than designed sequences (the stability score is the deviation of each sequence's measured EC 50 from the unfolded state model's predicted EC 50 ; if the model were overfit, the sequences used in training would have incorrectly low deviations). However, the cross-validation results indicate that only minimal overfitting is present in the model parameters. To further quantify possible bias in the model parameters, we examined the distribution of predicted unfolded state EC 50 values for scrambled sequences (which were used in training) and designed sequences (which were not). We would expect these distributions to be the same because the designed and scrambled sequences are very similar at the sequence level. However, on average, the predicted unfolded state EC 50 values for the scrambled sequences are higher than the predicted EC ;o values for the designed sequences, which biases the scrambled sequences to appear less stable, although this effect is small. This bias likely results from overfitting of EC 50 values for partially folded scrambled sequences. Overall, the small bias (0.15-0.16 units of stability score on average) between designed sequences and scrambled sequences does not change the conclusion that hundreds of the designs at each stage are many times more stable than the scrambled sequences, often by 0.5-1.0 stability score units or more.

Protein expression and purification

Two different expression vectors were used to purify the designs chosen for biophysical analysis. Most designs were expressed as isolated domains with an additional 21- residue N-terminal sequence containing a His-tag and thrombin cleavage site to aid purification. Genes encoding these designs were obtained from GenScript™ in the pET-28b+ expression vector. The remaining designs were expressed as fusions with the yeast SUMO domain Smt3 using the custom vector pCDB24. Genes encoding these designs were obtained as gBlocks™ from IDT and inserted into the pCDB24 vector via Gibson assembly (63).

All designs were expressed in E. coli BL21* (DE3) cells (Invitrogen). Starter cultures were grown overnight at 37°C in Luria-Bertani (LB) medium overnight with added antibiotic (50 μg/ml carbenicillin for SUMO expression or 30 μg/ml kanamycin for pET-28b+ expression). These overnight cultures were used to inoculate 500 mL of Studier autoinduction media (64) supplemented with antibiotic, and grown overnight. Cells were harvested by centrifugation at 4°C, resuspended in 25 mL lysis buffer (20 mM imidazole in PBS containing DNAse and protease inhibitors), and lysed by sonication or by microfluidizer. PBS buffer contained 20mM NaP0 4 , 150mM NaCl, pH 7.4. After removal of insoluble material, the lysates were loaded onto nickel affinity gravity columns to purify the designed proteins by immobilized metal-affinity chromatography (IMAC). Designs expressed as fusions to the SUMO domain were then cleaved from the domain using the Yeast SUMO protease Ulpl and dialyzed overnight in PBS at 4°C to remove excess imidazole before a second IMAC step was used to remove the SUMO tag following cleavage.

For expression of 13 C- l5 N-labeled protein for NMR analysis, the plasmids were transformed into the Lemo21 E. coli expression strain (NEB) and plated on M9/glucose plates containing kanamycin to 50 ug/mL and chloramphenicol to 34 ug/mL, grown at 37°C overnight. For the starter culture, a single colony was inoculated into a 250mL baffled flask containing 50mL of Luria-Bertani medium, with kanamycin to 50 ug/mL, chloramphenicol to 34 ug/mL, and grown for approximately 18 hours at 37°C, shaking at 225rpm. 10 mL of the starter culture was then transferred to a 2L baffled flask containing 0.5L of Terrific Broth (Difco), with 25mM Na 2 HP0 4 , 25mM KH 2 P0 4 , 50mM NH 4 CI, 5 mM Na 2 S0 4> kanamycin to 50 ug/mL, and chloramphenicol to 34 ug/mL. This expression culture was grown at 37°C to an OD 600 of approximately 1.0, then removed from the flask and spun at 4000rpm for 15 minutes to pellet the cells. The Terrific Broth was removed, and the cells were washed briefly with 30 mL of PBS. The cells were then transferred to a new 2L baffled flask containing 0.5 L of labeled media (25mM Na 2 HP0 4 , 25mM KH 2 P0 4 , 50mM l5 NrLiCl, 5 mM Na 2 S0 4 , 0.2% (w/v) 13 C glucose), kanamycin to 50 ug/mL and chloramphenicol to 34 ug/mL. The cells were allowed to recover at 37°C for 30 minutes, then IPTG (Carbosynth) was added to ImM and the temperature was reduced to 20°C. The cells were harvested the following day and purified by IMAC. The labeled NRjCl and glucose were obtained from Cambridge Isotopes. Size-exclusion chromatography

Following IMAC, designs (labeled and unlabeled) were further purified by size- exclusion chromatography on AKTAxpress™ (GE Healthcare) using a Superdex™ 75 10/300 GL column (GE Healthcare) in PBS buffer. The monomeric fraction of each run (typically eluting at the 15 mL mark) was collected and immediately analyzed by CD or flash frozen in liquid N 2 for later analysis.

Circular dichroism

Far-ultraviolet CD measurements were carried out with an AVIV spectrometer, model 420. Wavelength scans were measured from 260 to 195 nm at 25 and 95°C. Temperature melts monitored dichroism signal at 220 nm in steps of 2°C/minute with 30s of equilibration time. Wavelength scans and temperature melts were performed using 0.35 mg/ml protein in PBS buffer (20mM NaP0 4 , 150mM NaCl, pH 7.4) with a 1 mm path-length cuvette.

Chemical denaturation experiments with guanidinium hydrochloride (GuHCl) were performed using an automatic titrator with a protein concentration of 0.035 mg/ml and a 1 cm path-length cuvette with stir bar. The GuHCl concentration was determined by refractive index in PBS buffer. The denaturation process monitored dichroism signal at 220 nm in steps of 0.2 M GdmCl with 1 minute mixing time for each step, at 25°C. Protein concentrations were determined by absorbance at 280 nm measured using a NanoDrop™ spectrophotometer (Thermo Scientific) using predicted extinction coefficients (65). Protein concentrations for designs lacking aromatic amino acids were measured by Qubit protein assay (ThermoFisher Scientific).

Melting temperatures were determined by first smoothing the data with a Savitsky- Golay filter of order 3, then approximating the smoothed data with a cubic spline to compute derivatives. The reported Tm is the inflection point of the melting curve. Chemical denaturation curves were fitted by nonlinear regression to a two-state unfolding model with six-parameters: the folding free energy, m-value, and linear pre- and post-transition baselines with individual slope and intercepts (66).

NM R structure determination

NMR data acquisition was carried out at 25°C (HHH_rdl_0142, EHEE_rdl_0284, and EEHEE_rd3_1049) or 15°C (HEEH_rd4_0097) on Bruker spectrometers operating at 600 or 800 MHz, and equipped with cryogenic probes. All 3D spectra were acquired with non-uniform sampling schemes in the indirect dimensions and were reconstructed by multidimensional decomposition software MDDNMR (67) or (68), interfaced with NMRPipe™ (69). Conventional backbone and NOESY spectra were acquired as described previously (70), and the automated program ABACUS™ (77) was used to aide in the assignment of backbone and sidechain resonances. Initial automated NOE assignments and structure calculations were performed using the noeassign module in CYANA™ 3.0 (72). The best 20 of 100 CYANA™ structures from the final cycle were refined with CNSSOLVE (73) by performing a short restrained molecular dynamics simulation in explicit solvent (74). The NMR structures of the constructs are comprised of the final 20 refined structures.

Fragment analysis

To evaluate agreement between sequence and structure for a given designed protein, we used Rosetta's standard fragment generation protocol (/ 7) to select 200 fragments (9- residue length segments) from natural protein crystal structures for each 9-residue-long segment of the designed protein. The fragments were chosen so that their sequence and secondary structure were as similar as possible to the sequence and predicted secondary structure of the designed protein segment (predicted using PSIPRED™ (75)). If these fragments are highly geometrically similar to the designed segment (measured by RMSD), this indicates that the designed sequence preferentially adopts the designed fold even at the local level, because each local sequence segment is commonly found in its designed local structure when found in solved protein structures.

Mutational stability effects Instead of using the minimum of the trypsin and chymotrypsin stability scores as an overall stability score for sequences in the point mutant library, we took advantage of the hundreds of mutants available for each protein to calibrate the trypsin and chymotrypsin stability scales in relation to each other for each set of mutants (i.e. mutants of the same wild- type protein). For example, mutations in EHEE_rdl_0882 that cause a chymotrypsin stability score change of 1.0 typically cause a trypsin stability score change of 1.2 (i.e. the slope of the best-fit line is 1.2; the r 2 for the two datasets is 0.77). However, mutations to

EEHEE_rd3_0037 that cause a chymotrypsin stability score change of 1.0 cause a much larger trypsin stability score change of 2.6 (r 2 = 0.71). Because each set of mutants had a characteristic slope, we used these slopes to combine the trypsin and chymotrypsin measurements and compute a consensus stability score for each mutant. These consensus stability scores were assigned in four steps. First, we identified the subset of mutants for each wild-type protein with high-confidence EC 50 values (i.e. those that were precisely measured and were within the dynamic range of protease concentrations tested). Second, these high- confidence measurements were then used to determine the slope and intercept of the best-fit line between the trypsin and chymotrypsin stability scores for each protein by orthogonal distance regression. Third, we mapped each mutant of a given protein onto the best-fit line for that protein at one of three positions: the nearest point on the fit line, the point on the fit line at an identical x-coordinate (chymotrypsin stability score), or the point on the fit line at an identical y-coordinate (trypsin stability score) Finally, after mapping all points onto the fit line, we used the x-coordinate of the mapped point (the location of that point on the chymotrypsin axis) as the overall consensus stability score for each mutant.

In examining the best-fit lines between the trypsin and chymotrypsin measurements for each mutant set, we observed that mutants whose predicted unfolded state chymotrypsin §§ 50 values changed significantly from the predicted unfolded state wild-type EC 50 value were often outliers in the fit. These outliers suggested that the chymotrypsin unfolded state model was oversensitive to the effects of single amino acid changes, distorting the fits. To improve the estimation of consensus stability scores, we restricted the deviation between each mutant's predicted unfolded state EC 50 and the wild-type predicted unfolded state EC 50 to a factor of 2.64 in [chymotrypsin] (2 1 4 ). This only affected 1.2% of mutants, and would not have affected any results in Fig. 1 (no mutants in Fig. 1 deviate by this amount from the wild- type predicted unfolded state EC 50 value.)

The average stability effects of each amino acid were calculated using the consensus stability scores described above. To compute the average stability effects using the data, we used the average stability score of the A, E, H, I, M, T, and V mutants at each position as the "baseline" stability at each position (i.e. the average stability score of these mutants was used as the zero-point for a new position-specific stability scale). These amino acids were chosen because they included the different types of amino acid physical properties (polar and hydrophobic, large and small) and because these amino acids generally have minimal impact on a sequence's unfolded state predicted EC 50 with either trypsin or chymotrypsin. We then computed the stability of each amino acid at each position relative to this baseline. Finally, we averaged these re-zeroed stability scores across all the different protein sites in a given category (i.e. polar helical positions, edge strand positions, etc.) to determine the average stability effect of each amino acid for that category. The re-zeroing procedure for each site did not affect the relative average stabilities of the different amino acids shown in the figure, but it did lower the associated standard error by removing irrelevant variation in overall protein stability from the measurement. The average stability effects were adjusted a final time to set the mean value of all 20 amino acids to zero. Helical positions were considered to be polar if their wild-type (designed) amino acid was D, E, H, K, N, Q, R, S, T, or Y. The first and last helical turns were defined as the first and last three residues of each helix. Natural protein compilation

Our 1 ,178 natural proteins were compiled by querying the PDB on February 4 th , 2016 for all structures containing only protein (no lipid, carbohydrate, or nucleic acid), with 1 chain per asymmetric unit, chain length 20-50aa, and no modified residues. We then manually filtered this list to remove all sequences containing Cys residues. Pfam sequences were collected from the seed database of Pfam 28.0, taking the first representative of all families lacking a Cys residue.

Conservation analysis in naturally occurring proteins

Homologous sequences for villin HP35, pinl WW-domain, and hYAP65 WW- domain were identified using HHblits (76) with an e- value cutoff of le-10 and 4 iterations. HHfilter was used to remove sequences that were more than 90% identical or that covered less than 90% of the query sequence (77). Bits of conservation were calculated using WebLogo 3.3 (78).

The secondary structural arrangements of peptides designed are provided below in Table 1; the amino acid sequences are provided in the accompanying sequence listing. References:

1. K. A. Dill, Dominant forces in protein folding. Biochemistry. 29, 7133-7155 (1990).

2. A. D. Robertson, K. P. Murphy, Protein Structure and the Energetics of Protein Stability.

Chem. Rev. 97, 1251-1268 (1997).

3. C. Nick Pace, J. Martin Scholtz, G. R. Grimsley, Forces stabilizing proteins. FEBSLett.

588, 2177-2184 (2014).

4. H. Gelman, M. Gruebele, Fast protein folding kinetics. Q. Rev. Biophys. 47, 95-142 (2014).

5. X. Jiang, J. Kowalski, J. W. Kelly, Increasing protein stability using a rational approach combining sequence homology and structural alignment: Stabilizing the WW domain.

Protein Sci. 10, 1454-1465 (2001).

6. M. Jager et al., Structure-function-folding relationship in a WW domain. Proceedings of the National Academy of Sciences. 103, 10648-10653 (2006).

7. S. Xiao, Y. Bi, B. Shan, D. P. Raleigh, Analysis of core packing in a cooperatively

folded miniature protein: the ultrafast folding villin headpiece helical subdomain.

Biochemistry. 48, 4607-4616 (2009).

8. H. Neuweiler et al., The folding mechanism of BBL: Plasticity of transition-state

structure observed within an ultrafast folding protein family. J. Mol. Biol. 390, 1060- 1073 (2009).

9. C. N. Pace et al., Contribution of hydrophobic interactions to protein stability. J. Mol.

Biol. 408, 514-528 (201 1).

10. C. L. Araya et al., A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl. Acad. Sci. U. S. A. 109, 16858-16863 (2012).

U. S. Xiao et al., Rational modification of protein stability by targeting surface sites leads to complicated results. Proc. Natl. Acad. Sci. U. S. A. 110, 1 1337-11342 (2013).

12. C. N. Pace et al., Contribution of hydrogen bonds to protein stability. Protein Sci. 23, 652-661 (2014).

13. K. Lindorff-Larsen, S. Piana, R. O. Dror, D. E. Shaw, How Fast-Folding Proteins Fold.

Science. 334, 517-520 (201 1 ).

14. S. Piana, K. Lindorff-Larsen, D. E. Shaw, Protein folding kinetics and thermodynamics from atomistic simulation. Proc. Natl. Acad. Sci. U. S. A. 109, 17845-17850 (2012).

15. H. Nguyen, J. Maier, H. Huang, V. Perrone, C. Simmerling, Folding simulations for proteins with diverse topologies are accessible in days with a physics-based force field and implicit solvent. J. Am. Chem. Soc. 136, 13959-13962 (2014).

16. P.-S. Huang, S. E. Boyken, D. Baker, The coming of age of de novo protein design.

Nature. 537, 320-327 (2016). 17. C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, D. Baker, Protein structure prediction using Rosetta. Methods Enzymol. 383, 66-93 (2004).

18. T. J. Magliery, Protein stability: computation, sequence statistics, and new experimental methods. Curr. Opin. Struct. Biol. 33, 161-168 (2015).

19. H. Park et al, Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. J. Chem. Theory Comput. 12, 6201-6212 (2016).

20. B. I. Dahiyat, S. L. Mayo, De novo protein design: fully automated sequence selection.

Science. 278, 82-87 (1997).

21. H. Liang et al., De Novo Design of a βαβ Motif. Angew. Chem. Int. Ed. 48, 3301-3303 (2009).

22. S. Kosuri, G. M. Church, Large-scale de novo DNA synthesis: technologies and

applications. Nat. Methods. 11, 499-507 (2014).

23. E. T. Boder, K. D. Wittrup, Yeast surface display for screening combinatorial

polypeptide libraries. Nat. Biotechnol. 15, 553-557 (1997).

24. C. Park, S. Marqusee, Pulse proteolysis: a simple method for quantitative determination of protein stability and ligand binding. Nat. Methods. 2, 207-212 (2005).

25. C. Park, S. Zhou, J. Gilmore, S. Marqusee, Energetics-based protein profiling on a

proteomic scale: identification of proteins resistant to proteolysis. J. Mol. Biol. 368, 1426-1437 (2007).

26. V. Sieber, A. Pluckthun, F. X. Schmid, Selecting proteins with improved stability by a phage-based method. Nat. Biotechnol. 16, 955-960 (1998).

27. M. D. Finucane, M. Tuna, J. H. Lees, D. N. Woolfson, Core-directed protein design. I.

An experimental method for selecting stable proteins from combinatorial libraries.

Biochemistry. 38, 11604-11612 (1999).

28. M. JSger, M. Dendle, J. W. Kelly, Sequence determinants of thermodynamic stability in a WW domain-an all-beta-sheet protein. Protein Sci. 18, 1806-1813 (2009).

29. G. Bhardwaj et al., Accurate de novo design of hyperstable constrained peptides. Nature.

538, 329-335 (2016).

30. N. Koga et al, Principles for designing ideal protein structures. Nature. 491, 222-227 (2012).

31. S. Kamtekar, J. Schiffer, H. Xiong, J. Babik, M. Hecht, Protein design by binary

patterning of polar and nonpolar amino acids. Science. 262, 1680-1685 (1993).

32. A. R. Davidson, R. T. Sauer, Folded proteins occur frequently in libraries of random amino acid sequences. Proc. Natl. Acad. Sci. U. S. A. 91, 2146-2150 (1994).

33. M. H. Hecht, A. Das, A. Go, L. H. Bradley, Y. Wei, De novo proteins from designed combinatorial libraries. Protein Sci. 13, 1711-1723 (2004). 34. M. D. S. Kumar et ah, ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions. Nucleic Acids Res. 34, D204-6 (2006).

35. E. G. Baker et ah, Local and macroscopic electrostatic interactions in single a-helices.

Nat. Chem. Biol. 11, 221-228 (2015).

36. D. S. Doering, P. Matsudaira, Cysteine scanning mutagenesis at 40 of 76 positions in villin headpiece maps the F-actin binding site and structural features of the domain. Biochemistry. 35, 12677-12685 (1996).

37. J. Meng et ah, High-resolution crystal structures of villin headpiece and mutants with reduced F-actin binding activity. Biochemistry. 44, 11963-11973 (2005).

38. M. A. Verdecia, M. E. Bowman, K. P. Lu, T. Hunter, J. P. Noel, Structural basis for phosphoserine-proline recognition by group IV WW domains. Nat. Struct. Biol. 7, 639- 643 (2000).

39. B. K. Shoichet, W. A. Baase, R. Kuroki, B. W. Matthews, A relationship between

protein stability and protein function. Proceedings of the National Academy of Sciences. 92, 452-456 (1995).

40. P. A. Chong, H. Lin, J. L. Wrana, J. D. Forman-Kay, An expanded WW domain

recognition motif revealed by the interaction between Smad7 and the E3 ubiquitin ligase SmurG. J. Biol. Chem. 281, 17069-17075 (2006).

41. E. Aragon et ah, Structural basis for the versatile interactions of Smad7 with regulator WW domains in TGF-β Pathways. Structure. 20, 1726-1736 (2012).

42. S. Piana, J. L. Klepeis, D. E. Shaw, Assessing the accuracy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations. Curr. Opin. Struct. Biol. 24, 98-105 (2014).

43. J. S. Appelbaum et ah, Arginine topology controls escape of minimally cationic proteins from early endosomes to the cytoplasm. Chem. Biol. 19, 819-830 (2012).

44. J. R. LaRochelle, G. B. Cobb, A. Steinauer, E. Rhoades, A. Schepartz, Fluorescence correlation spectroscopy reveals highly efficient cytosolic delivery of certain penta-arg proteins and stapled peptides. J. Am. Chem. Soc. 137, 2536-2541 (2015).

45. P.-S. Huang et ah, RosettaRemodel: a generalized framework for flexible backbone protein design. PLoS One. 6, e24109 (201 1).

46. A. Leaver-Fay et ah , in Methods in Enzymology (2013), pp. 109-143.

47. R. F. Alford et ah, The Rosetta all-atom energy function for macromolecular modeling and design (2017), , doi:10.1101/106054.

48. F. Pedregosa et ah, Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825-2830 (2011).

49. D. M. Hoover, J. Lubkowski, DNA Works: an automated method for designing

oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 30, e43 (2002). 50. L. Benatuil, J. M. Perez, J. Belk, C.-M. Hsieh, An improved yeast transformation method for the generation of very large human antibody libraries. Protein Eng. Des. Sel. 23, 155-159 (2010).

51. G. Chao et al. , Isolating and engineering human antibodies using yeast surface display.

Nat. Protoc. 1, 755-768 (2006).

52. J. Zhang, K. Robert, T. Flouri, A. Stamatakis, PEAR: a fast and accurate Ulumina

Paired-End reAd mergeR. Bioinformatics. 30, 614-620 (2014).

53. A. Patil, D. Huard, C. J. Fonnesbeck, PyMC: Bayesian Stochastic Modelling in Python.

J. Stat. Sofiw. 35, 1-81 (2010).

54. The Theano Development Team et al., Theano: A Python framework for fast

computation of mathematical expressions. arXiv [cs.SC] (2016). (available at http://arxiv.org/abs/1605.02688).

55. D. G. Gibson et al, Enzymatic assembly of DNA molecules up to several hundred

kilobases. Nat. Methods. 6, 343-345 (2009).

56. F. W. Studier, Protein production by auto-induction in high density shaking cultures.

Protein Expr. Purif. 41, 207-234 (2005).

57. C. N. Pace, F. Vajdos, L. Fee, G. Grimsley, T. Gray, How to measure and predict the molar absorption coefficient of a protein. Protein Set. 4, 2411-2423 (1995).

58. M. M. Santoro, D. W. Bolen, Unfolding free energy changes determined by the linear extrapolation method. 1. Unfolding of phenylmethanesulfonyl alpha-chymotrypsin using different denaturants. Biochemistry. 27, 8063-8068 (1988).

59. V. Y. Orekhov, I. Ibraghimov, M. Billeter, Optimizing resolution in multidimensional NMR by three-way decomposition. J. Biomol. NMR. 27, 165-173 (2003).

60. K. Kazimierczuk, V. Y. Orekhov, Accelerated NMR spectroscopy by using compressed sensing. Angew. Chem. Int. Ed Engl. 50, 5556-5559 (2011).

61. F. Delaglio et al., NMRPipe: a multidimensional spectral processing system based on UNIX pipes. J. Biomol. NMR. 6, 277-293 (1995).

62. A. Lemak et al., A novel strategy for NMR resonance assignment and protein structure determination. J. Biomol. NMR. 49, 27-38 (201 1).

63. A. Lemak, C. A. Steren, C. H. Arrowsmith, M. Llinas, Sequence specific resonance assignment via Multicanonical Monte Carlo search using an ABACUS approach. J. Biomol. NMR. 41, 29-41 (2008).

64. P. Guntert, Automated NMR structure calculation with CYANA. Methods Mol. Biol.

278, 353-378 (2004).

65. A. T. Brunger et al, Crystallography & NMR system: A new software suite for

macromolecular structure determination. Acta Ctystallogr. D Biol. Crystallogr. 54, 905— 921 (1998). 66. J. P. Linge, M. A. Williams, C. A. E. M. Spronk, Alexandre M J, M. Nilges, Refinement of protein structures in explicit solvent. Proteins: Struct. Fund. Bioinf. 50, 496-506

(2003) .

67. D. T. Jones, Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195-202 ( 1999).

68. M. Remmert, A. Biegert, A. Hauser, J. Soding, HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods. 9, 173-175 (2011).

69. V. Alva, S.-Z. Nam, J. Soding, A. N. Lupas, The MPI bioinformatics Toolkit as an

integrative platform for advanced protein sequence and structure analysis. Nucleic Acids Res. 44, W410-W415 (2016).

70. G. E. Crooks, WebLogo: A Sequence Logo Generator. Genome Res. 14, 1188-1190

(2004) .

71. Y. Pan et a I., Quantitative proteomics reveals the kinetics of trypsin-catalyzed protein digestion. Anal. Bioanal. Chem. 406, 6247-6256 (2014).

72. V. Schellenberger, K. Braune, H. J. Hofmann, H. D. Jakubke, The specificity of

chymotrypsin. A statistical analysis of hydrolysis data. Eur. J. Biochem. 199, 623-636 (1991).

73. V. Schellenberger, C. W. Turck, L. Hedstrom, W. J. Rutter, Mapping the S' subsites of serine proteases using acyl transfer to mixtures of peptide nucleophiles. Biochemistry. 32, 4349-4353 (1993).

74. V. Schellenberger, C. W. Turck, W. J. Rutter, Role of the S' Subsites in Serine Protease Catalysis. Active-Site Mapping of Rat Chymotrypsin, Rat Trypsin, .alpha.-Lytic Protease, and Cercarial Protease from Schistosoma mansoni. Biochemistry. 33, 4251- 4257 (1994).

75. R. A. Laskowski, M. W. MacArthur, D. S. Moss, J. M. Thornton, PROCHECK: a

program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 26, 283-291 (1993).

76. A. Bhattacharya, R. Tejero, G. T. Montelione, Evaluating protein structures determined by structural genomics consortia. Proteins. 66, 778-795 (2007).

77. W. Kabsch, C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 22, 2577-2637 (1983).

78. Y.-R. Lin et ai, Control over overall shape and size in de novo designed proteins. Proc.

Natl. Acad. Sci. U. S. A. 112, E5478-85 (2015).

79. O. D. Monera, T. J. Sereda, N. E. Zhou, C. M. Kay, R. S. Hodges, Relationship of

sidechain hydrophobicity and a-helical propensity on the stability of the single-stranded amphipathic a-helix. J. Pept. Sci. 1, 319-329 (1995).

80. F. Zheng, J. Zhang, G. Grigoryan, Tertiary structural propensities reveal fundamental sequence/structure relationships. Structure. 23, 961-971 (2015).