Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEMS AND METHODS FOR AUTOMATED BIOLOGIC DEVELOPMENT DETERMINATIONS
Document Type and Number:
WIPO Patent Application WO/2019/070517
Kind Code:
A9
Abstract:
Presented herein are systems and methods that provide for a biologic development determination technology that facilitates determining effective and improved procedures for developing biologics. In particular, in certain embodiments, the biologic development determination technology leverages patterns in structural features of a target biologic to provide biologic development recommendations that provide guidance to a user for refining bioprocesses and formulations, and for identifying important structure-function relationships for the target biologic. These recommendations provide users with information and guidance that reduces the degree to which bioprocesses, formulations, and functional characteristics of a target biologic must be refined via costly and time-consuming trial and error process. In this manner, the systems and methods described herein facilitate development of new biologics and biosimilars, providing for faster development of manufacturing techniques and process scale-up, thereby reducing the time needed to bring a new biologic or biosimilar to market.

Inventors:
YARED WAEL I (US)
POSS KIRTLAND G (US)
CHADWICK JENNIFER A (US)
Application Number:
PCT/US2018/053323
Publication Date:
June 20, 2019
Filing Date:
September 28, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BIOANALYTIX INC (US)
International Classes:
C07K14/535; C07K14/56; C07K14/61; C07K16/00; C12M1/00; C12M1/34
Attorney, Agent or Firm:
ADATO, Ronen et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for refining a bioprocess for manufacture of a target biologic based on automated analysis of generalizable structural atributes (GSAs) of the target biologic, the method comprising:

(a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic;

(b) accessing, by the processor, an atribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record corresponding to a bioprocess having been implemented in production of an associated known biologic from the set of known biologies, wherein:

(i) each biologic record comprises a set of bioprocess parameter values each of which corresponds to a value of a corresponding bioprocess parameter used in the production of the associated known biologic; and

(ii) each biologic record is linked to one or more GSAs of the associated known biologic; and

(c) determining, by the processor, responsive to the input query, one or more bioprocess recommendations using a machine learning module that identifies paterns relating the GSAs for the set of known biologies with the biologic records of the atribute store.

2. The method of claim 1 , wherein each of one or more of the bioprocess

recommendations comprises an identification of a corresponding relevant bioprocess parameter determined by the machine learning module as associated with (e.g., influencing; e.g., correlated with) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the bioprocess recommendations comprises a flag associated with the corresponding relevant bioprocess parameter, the flag corresponding to an indication of whether a value of the relevant bioprocess parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant bioprocess parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

3. The method of claim 2, wherein the relevant bioprocess parameter is a parameter selected from the group consisting of:

(A) a host cell parameter [e.g., a cell type; e.g., a cell age; e.g., one or more host cell proteins (HCPs); e.g., a seeding density];

(B) a cell culture media parameter [e.g., presence and/or concentration of a particular cell culture media component (e.g., glucose, an amino acid, and other nutrients; e.g., a buffer; e.g., a hormone and/or growth factor; e.g., an antibiotic; e.g., a drug; e.g., a biosynthetic pathway modulator)];

(C) a bioreactor condition parameter [e.g., a reactor type and/or mode of operation; e.g., a volume; e.g., a pH; e.g., a temperature; e.g., p02; e.g., a flow rate; e.g., a feeding schedule; e.g., a waste buildup monitoring indicator; e.g., a cleaning process];

(D) a purification parameter (e.g., a parameters identifying use of centrifugation, filtration, and/or diafiltration and associated settings; e.g., an initial purification type and associated settings; e.g., a polishing chromatography parameter; e.g., a viral clearance parameter); and

(E) a conjugation parameter (e.g., a drug parameter; e.g., a linker parameter; e.g., a polymer; e.g., an identification of a particular solvent; e.g., a drug-antibody ratio (DAR)).

4. The method of any one of the preceding claims, wherein each of one or more of the bioprocess recommendations comprises a bioprocess design result comprising identifications of one or more relevant bioprocess parameters and, for each relevant bioprocess parameter, a bioprocess parameter value recommendation corresponding to a recommended value or change to a value of the relevant bioprocess parameter.

5. The method of claim 4, further comprising:

(d) adjusting a bioprocess protocol for manufacture of the target biologic using the bioprocess design result.

6. The method of claim 5, further comprising:

(e) producing the target biologic using the adjusted bioprocess protocol.

7. The method of any one of the preceding claims, wherein each of one or more of the bioprocess recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant bioprocess parameter.

8. The method of claim 7, wherein the one or more bioprocess recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not).

9. The method of claim 7, wherein the associated set of one or more GSAs comprises one or more quality attributes, optionally critical quality attributes (CQAs) of the target biologic.

10. The method of any one of claims 7 to 9, wherein the one or more bioprocess recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant bioprocess parameter.

11. The method of any one of the preceding claims, wherein the one or more bioprocess recommendations comprises a recommended analytical study to be carried out on the target biologic.

12. The method of claim 11, comprising performing the recommended analytical study.

13. The method of any one of the preceding claims, comprising causing, by the processor, display of a graphical representation of the one or more bioprocess recommendations.

14. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

15. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

16. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic]

17. The method of claim 16, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)];

(B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

18. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic];

(D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

19. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of:

(i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and

(v) aromaticity (e.g., classified as aromatic).

20. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

21. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

22. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

23. The method of any one of the preceding claims, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic.

24. The method of any one of the preceding claims, the method comprising:

receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and

generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

25. The method of claim 24, comprising performing one or more structural

characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

26. The method of any one of the preceding claims, wherein at least one of (e.g., one or more of; e.g., each of) the bioprocess parameter values of the biologic records in the attribute store corresponds to a value of a corresponding bioprocess parameter selected from the group consisting of:

(A) a host cell parameter [e.g., a cell type; e.g., a cell age; e.g., one or more host cell proteins (HCPs); e.g., a seeding density];

(B) a cell culture media parameter [e.g., presence and/or concentration of a particular cell culture media component (e.g., glucose, an amino acid, and other nutrients; e.g., a buffer; e.g., a hormone and/or growth factor; e.g., an antibiotic; e.g., a drug; e.g., a biosynthetic pathway modulator)];

(C) a bioreactor condition parameter (e.g., a reactor type and/or mode of operation; e.g., a volume; e.g., a pH; e.g., a temperature; e.g., p02; e.g., a flow rate; e.g., a feeding schedule; e.g., waste buildup monitoring indicator; e.g., a cleaning process);

(D) a purification parameter (e.g., a parameters identifying use of centrifugation, filtration, and/or diafiltration and associated settings; e.g., an initial purification type and associated settings; e.g., a polishing chromatography parameter; e.g., a viral clearance parameter); and

(E) a conjugation parameter (e.g., a drug parameter; e.g., a linker parameter; e.g., a polymer; e.g., an identification of a particular solvent; e.g., drug-antibody ratio (DAR)).

27. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like);

(D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

28. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

29. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic]

30. The method of claim 29, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of:

(A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low- abundance Man8; e.g., an identification of presence of a(l,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

31. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the associated known biologic;

(C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic]; (D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-bnked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

32. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of:

(i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral);

(iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

33. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

34. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

35. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

36. The method of any one of the preceding claims, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo

comparability profile of the associated known biologic.

37. The method of any one of the preceding claims, wherein at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing.

38. The method of any one of the preceding claims, wherein at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

39. The method of any one of the preceding claims, wherein at least a portion of the biologic records are generated from direct measurement of biologic structural features.

40. The method of any one of the preceding claims, wherein the machine learning module of step (c) implements a supervised machine learning technique.

41. The method of any one of the preceding claims, wherein the machine learning module of step (c) implements a reinforcement machine learning technique.

42. The method of any one of the preceding claims, wherein the machine learning module of step (c) implements an unsupervised machine learning technique.

43. The method of claim 42, wherein the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

44. A method for refining formulation of a target biologic based on automated analysis of generalizable structural attributes (GSAs) of the target biologic, the method comprising:

(a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic; (b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record corresponding to a formulation of an associated known biologic from the set of known biologies, wherein:

(i) each biologic record comprises a set of formulation parameter values each of which corresponds to a value of a corresponding formulation parameter used in the formulation of the associated known biologic, and

(ii) each biologic record is linked to one or more GSAs of the associated known biologic; and

(c) determining, by the processor, responsive to the input query, one or more formulation recommendations using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

45. The method of claim 44, wherein each of one or more of the formulation recommendations comprises an identification of a corresponding relevant formulation parameter determined by the machine learning module as associated with (e.g., correlated with; e.g., influencing) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the formulation recommendations comprises a flag associated with the corresponding relevant formulation parameter, the flag corresponding to an indication of whether a value of the relevant formulation parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant formulation parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

46. The method of claim 45, wherein the corresponding relevant formulation parameter is a parameter selected from the group consisting of:

(A) a buffer parameter [e.g., an identification of a particular type of buffer (e.g., Histidine, Acetate, Citrate, Phosphate, Tris, Bicarbonate, and the like) and, optionally, its concentration];

(B) a surfactant parameter [e.g., an identification of a particular surfactant (e.g., Polysorbate 80, Polysorbate 20, Poloxamer, and the like) and, optionally, its concentration];

(C) an excipient parameter [e.g., an identification of a particular excipient (e.g., Sucrose, Arginine, Glycine, Methionine, NaCl, Serum albumin, Zinc, Protamine, an antioxidant, a chelator (e.g., EGTA), Poly(D.L-lactide-co-glycolide), Cyclodextrin, a bacteriostat, and the like), and, optionally, its concentration];

(D) a form/concentration parameter [e.g., an identification of a particular storage form (e.g., liquid, lyophilized, spray dried, sustained release, and the like), and, optionally, a concentration of the target biologic];

(E) a storage pH;

(F) a storage temperature;

(G) a container parameter {e.g., an identification of a container type [e.g., a textual string listing container properties such as material (e.g., including any coatings or surface treatments such as siliconization), format (e.g., vial, syringe, and the like), and size]}; and

(H) a closure system parameter {e.g., an identification of a particular type of closure system [e.g., a textual string listing closure system types and properties (e.g., material)(e.g.,“natural latex stopper”; e.g.,“silicone rubber stopper”; e.g.,“syringe needle”; e.g.,“autoinjector”; e.g.,“amber ampoule”)]}.

47. The method of any one of claims 44 to 46, wherein each of one or more of the formulation recommendations comprises a formulation design result comprising

identifications of one or more relevant formulation parameters and, for each relevant formulation parameter, a formulation parameter value recommendation corresponding to a recommended value or change to a value of the relevant formulation parameter.

48. The method of claim 47, further comprising:

(d) adjusting a formulation of the target biologic using formulation design result.

49. The method of claim 48, further comprising:

(e) producing the adjusted formulation of the target biologic.

50. The method of any one of claims 44 to 49, wherein each of one or more of the formulation recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant formulation parameter.

51. The method of claim 50, wherein the one or more formulation recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not).

52. The method of claim 50, wherein the associated set of one or more GSAs of the target biologic comprises one or more quality attributes, optionally, critical quality attributes (CQAs) of the target biologic.

53. The method of any one of claims 50 to 52, wherein the one or more formulation recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant formulation parameter.

54. The method of any one of claims 44 to 53, wherein the one or more formulation recommendations comprises a recommended analytical study to be carried out on the target biologic.

55. The method of claim 54, comprising performing the recommended analytical study.

56. The method of any one of claims 44 to 55, comprising causing, by the processor, display of a graphical representation of the one or more formulation recommendations.

57. The method of any one of claims 44 to 56, wherein the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

58. The method of any one of claims 44 to 57, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

59. The method of any one of claims 44 to 58, wherein the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic]

60. The method of claim 59, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)];

(B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

61. The method of any one of claims 44 to 60, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic];

(D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

62. The method of any one of claims 44 to 61, wherein the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of:

(i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and

(v) aromaticity (e.g., classified as aromatic).

63. The method of any one of claims 44 to 62, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

64. The method of any one of claims 44 to 63, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

65. The method of any one of claims 44 to 64, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

66. The method of any one of claims 44 to 65, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic.

67. The method of any one of claims 44 to 66, the method comprising:

receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and

generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

68. The method of claim 67, comprising performing one or more structural

characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

69. The method of any one of claims 44 to 68, wherein at least one of (e.g., one or more of; e.g., each of) the formulation parameter values of the biologic records in the attribute store corresponds to a value of a corresponding formulation parameter selected from the group consisting of:

(A) a buffer parameter [e.g., an identification of a particular type of buffer (e.g., Histidine, Acetate, Citrate, Phosphate, Tris, Bicarbonate, and the like) and, optionally, its concentration];

(B) a surfactant parameter [e.g., an identification of a particular surfactant (e.g., Polysorbate 80, Polysorbate 20, Poloxamer, and the like) and, optionally, its concentration];

(C) an excipient parameter [e.g., an identification of a particular excipient (e.g., Sucrose, Arginine, Glycine, Methionine, NaCl, Serum albumin, Zinc, Protamine, an antioxidant, a chelator (e.g., EGTA), Poly(D.L-lactide-co-glycolide), Cyclodextrin, a bacteriostat, and the like), and, optionally, its concentration]; (D) a form/concentration parameter [e.g., an identification of a particular storage form (e.g., liquid, lyophilized, spray dried, sustained release, and the like), and, optionally, a concentration of the associated known biologic];

(E) a storage pH;

(F) a storage temperature;

(G) a container parameter {e.g., an identification of a container type [e.g., a textual string listing container properties such as material (e.g., including any coatings or surface treatments such as siliconization), format (e.g., vial, syringe, and the like), and size]}; and

(H) a closure system parameter {e.g., an identification of a particular type of closure system [e.g., a textual string listing closure system types and properties (e.g., material)(e.g.,“natural latex stopper”; e.g.,“silicone rubber stopper”; e.g.,“syringe needle”; e.g.,“autoinjector”; e.g.,“amber ampoule”)]}.

70. The method of any one of claims 44 to 69, wherein the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

71. The method of any one of claims 44 to 70, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of:

(A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

72. The method of any one of claims 44 to 71, wherein the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic].

73. The method of claim 72, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of:

(A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low- abundance Man8; e.g., an identification of presence of a(l,3)];

(B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

74. The method of any one of claims 44 to 73, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the associated known biologic;

(C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic];

(D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

75. The method of any one of claims 44 to 74, wherein the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral);

(iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and

(v) aromaticity (e.g., classified as aromatic).

76. The method of any one of claims 44 to 75, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

77. The method of any one of claims 44 to 76, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

78. The method of any one of claims 44 to 77, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

79. The method of any one of claims 44 to 78, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo comparability profile of the associated known biologic.

80. The method of any one of claims 44 to 79, wherein at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing.

81. The method of any one of claims 44 to 80, wherein at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

82. The method of any one of claims 44 to 81, wherein at least a portion of the biologic records are generated from direct measurement of biologic structural features.

83. The method of any one of claims 44 to 82, wherein the machine learning module of step (c) implements a supervised machine learning technique.

84. The method of any one of claims 44 to 83, wherein the machine learning module of step (c) implements a reinforcement machine learning technique.

85. The method of any one of claims 44 to 84, wherein the machine learning module of step (c) implements an unsupervised machine learning technique.

86. The method of claim 85, wherein the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

87. A method for refining a structure-function profile of a target biologic based on automated analysis of generalizable structural attributes (GSAs) of the target biologic, the method comprising:

(a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic;

(b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record representing a set of previously determined structural or functional characteristics of an associated known biologic, wherein:

(i) each biologic record comprises a known set of structure-function parameter values, each of which represents a previously determined specific structural or functional characteristic of the associated known biologic, and

(ii) each biologic record is linked to one or more GSAs of the associated known biologic; and

(c) determining, by the processor, responsive to the input query, one or more structure-function recommendations using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

88. The method of claim 87, wherein each of one or more of the structure-function recommendations, comprises an identification of a corresponding relevant structure-function parameter determined by the machine learning module as associated with (e.g., correlated with; e.g., influenced by) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the structure-function recommendations comprises a flag associated with the corresponding relevant structure- function parameter, the flag corresponding to an indication of whether the relevant structure- function parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant structure-function parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

89. The method of claim 88, wherein the corresponding relevant structure-function parameter is a parameter selected from the group consisting of:

(A) an intrinsic stability parameter [e.g., a deamidation rate; e.g., an aggregation propensity; e.g., a chain cleavage propensity];

(B) a pharmacokinetic (PK) profile parameter (e.g., an in vivo circulatory half-life; e.g., a biodistribution);

(C) a mechanism of action parameter [e.g., a textual string identifying a particular mechanism of action (e.g., antibody-dependent cell-mediated cytotoxicity (ADCC); e.g., complement-dependent cytotoxicity (CDC))];

(D) an efficacy parameter (e.g., a target binding time; e.g., an occupancy time; e.g., a dosing requirement); and

(E) an immunogenicity parameter (e.g., a binding anti-drug antibody (ADA); e.g., a PK altering ADA; e.g., a neutralizing ADA; e.g. a hypersensitivity ADA; e.g., a cross- reactive neutralizing ADA; e.g., complement activation; e.g.„ non-antibody cytokine releasing syndrome).

90. The method of any one of claims 87 to 89, wherein each of one or more of the structure-function recommendations comprises a structure-function profile result comprising identifications of relevant structure-function parameters and, for each relevant structure- function parameter, a structure-function parameter value prediction corresponding to a predicted value of the structure function parameter (e.g., the predicted value corresponding to a predicted specific structural or functional characteristic).

91. The method of claim 90, further comprising:

(d) adjusting a design of the target biologic using the structure-function profile result.

92. The method of any one of claims 87 to 91, wherein each of one or more of the structure-function recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant structure-function parameter.

93. The method of claim 92, wherein the one or more structure-function

recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not).

94. The method of claim 93, wherein the associated set of one or more GSAs of the target biologic comprises one or more quality attributes, optionally critical quality attributes (CQAs) of the target biologic.

95. The method of any one of claims 92 to 94, wherein the one or more structure-function recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant structure-function parameter.

96. The method of any one of claims 87 to 95, wherein the one or more structure-function recommendations comprises a recommended analytical study to be carried out on the target biologic.

97. The method of claim 96, comprising performing the recommended analytical study.

98. The method of any one of claims 87 to 97, wherein one or more of the structure- function recommendations comprises a set of recommended clinical trial specifications.

99. The method of any one of claims 87 to 98, wherein one or more of the structure- function recommendations comprises a safety profile recommendation [e.g., a warning flag identifying a potential adverse event; e.g., a patient monitoring recommendation (e.g., a recommendation to test for pre-existing IgE antibodies)].

100. The method of any one of claims 87 to 99, comprising determining one or more bioprocess recommendations (e.g., each bioprocess recommendation associated with one or more relevant structure-function parameters; e.g., the one or more bioprocess recommendations corresponding to any of the bioprocess recommendations of claims 2 to 3 and claims 7 to 12).

101. The method of claim 100, wherein each of one or more of the bioprocess

recommendations comprises a bioprocess design result comprising identifications of one or more relevant bioprocess parameters and, for each relevant bioprocess parameter, a bioprocess parameter value recommendation corresponding to a recommended value or change to a value of the relevant bioprocess parameter.

102. The method of claim 101, further comprising:

(d) adjusting a bioprocess protocol for manufacture of the target biologic using the bioprocess design result.

103. The method of claim 102, further comprising:

(e) producing the target biologic using the adjusted bioprocess protocol.

104. The method of any one of claims 87 to 103, comprising determining one or more formulation recommendations (e.g., each formulation recommendation associated with one or more relevant structure-function parameters; e.g., the one or more formulation

recommendations corresponding to any of the formulation recommendations of claims 45 to 46 and claims 50 to 55).

105. The method of claims 104, wherein each of one or more of the formulation recommendations comprises a formulation design result comprising identifications of one or more relevant formulation parameters and, for each relevant formulation parameter, a formulation parameter value recommendation corresponding to a recommended value or change to a value of the relevant formulation parameter.

106. The method of claim 105, further comprising:

(d) adjusting a formulation of the target biologic using the formulation design result.

107. The method of claim 106, further comprising:

(e) producing the adjusted formulation of the target biologic.

108. The method of any one of claims 87 to 107, comprising causing, by the processor, display of a graphical representation of the one or more structure-function recommendations.

109. The method of any one of claims 87 to 108, wherein the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

110. The method of any one of claims 87 to 109, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

111. The method of any one of claims 87 to 110, wherein the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic]

112. The method of claim 111, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)];

(B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

113. The method of any one of claims 87 to 112, wherein the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic];

(D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

114. The method of any one of claims 87 to 113, wherein the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of:

(i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and

(v) aromaticity (e.g., classified as aromatic).

115. The method of any one of claims 87 to 114, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

116. The method of any one of claims 87 to 115, wherein the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

117. The method of any one of claims 87 to 116, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

118. The method of any one of claims 87 to 117, wherein the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic.

119. The method of any one of claims 87 to 118, the method comprising:

receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and

generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

120. The method of claim 119, comprising performing one or more structural

characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

121. The method of any one of claims 87 to 120, wherein at least one of the structure- function parameter values of the biologic records in the attribute store corresponds to a value of a corresponding bioprocess parameter selected from the group consisting of:

(A) an intrinsic stability parameter [e.g., a deamidation rate; e.g., an aggregation propensity; e.g., a chain cleavage propensity];

(B) a PK profile parameter (e.g., an in vivo circulatory half-life; e.g., a biodistribution);

(C) a mechanism of action parameter [e.g., a textual string identifying a particular mechanism of action (e.g., antibody-dependent cell-mediated cytotoxicity (ADCC); e.g., complement-dependent cytotoxicity (CDC))];

(D) an efficacy parameter (e.g., a target binding time; e.g., an occupancy time; e.g., a dosing requirement); and

(E) an immunogenicity parameter (e.g., a binding anti-drug antibody (ADA); e.g., a PK altering ADA; e.g., a neutralizing ADA; e.g. a hypersensitivity ADA; e.g., a cross- reactive neutralizing ADA; e.g., complement activation; e.g.„ non-antibody cytokine releasing syndrome).

122. The method of any one of claims 87 to 121, wherein the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of:

(A) stability and/or a propensity for chemical degradation;

(B) a likelihood and/or propensity to form aggregates;

(C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like);

(D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a

biodistribution);

(E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and

(F) an immunogenic potential.

123. The method of any one of claims 87 to 122, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of:

(A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and

(B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

124. The method of any one of claims 87 to 123, wherein the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic]

125. The method of claim 124, wherein the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of:

(A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low- abundance Man8; e.g., an identification of presence of a(l,3)];

(B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites);

(C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites);

(D) an expected level of glycan occupancy; and

(E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

126. The method of any one of claims 87 to 125, wherein the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of:

(A) a sequence motif;

(B) a molecule type of the associated known biologic;

(C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic];

(D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and

(E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post- translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

127. The method of any one of claims 87 to 126, wherein the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of:

(i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic];

(ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic];

(iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral);

(iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and

(v) aromaticity (e.g., classified as aromatic).

128. The method of any one of claims 87 to 127, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

129. The method of any one of claims 87 to 128, wherein the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

130. The method of any one of claims 87 to 129, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

131. The method of any one of claims 87 to 130, wherein the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo comparability profile of the associated known biologic.

132. The method of any one of claims 87 to 131, wherein at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing.

133. The method of any one of claims 87 to 132, wherein at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

134. The method of any one of claims 87 to 133, wherein at least a portion of the biologic records are generated from direct measurement of biologic structural features.

135. The method of any one of claims 87 to 134, wherein the machine learning module of step (c) implements a supervised machine learning technique.

136. The method of any one of claims 87 to 135, wherein the machine learning module of step (c) implements a reinforcement machine learning technique.

137. The method of any one of claims 87 to 136, wherein the machine learning module of step (c) implements an unsupervised machine learning technique.

138. The method of claim 137, wherein the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

139. A method for automated analysis of generalizable structural attributes (GSAs) of a target biologic for refinement of at least one of a bioprocess for producing the target biologic, a formulation of the target biologic, and a structure-function profile of the target biologic, the method comprising:

(a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic;

(b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, wherein: (i) each biologic record comprises one or more of (A), (B), and

(C):

(A) a set of bioprocess parameter values each of which corresponds to a value of a corresponding bioprocess parameter used in production of the associated known biologic;

(B) a set of formulation parameter values each of which corresponds to a value of a corresponding formulation parameter used in a formulation of the associated known biologic; and

(C) a set of structure-function parameter values, each of which represents a previously determined specific structural or functional characteristic of the associated known biologic;

(ii) each biologic record is linked to one or more GSAs of the associated known biologic; and

(c) determining, by the processor, responsive to the input query any one or more of (i), (ii), and (iii):

(i) one or more bioprocess recommendations;

(ii) one or more formulation recommendations; and

(iii) one or more structure-function recommendations,

wherein step (c) comprises using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

140. A system for automated analysis of generalizable structural attributes (GSAs) of a target biologic for refinement of at least one of a bioprocess for producing the target biologic, a formulation of the target biologic, and a structure-function profile of the target biologic, the system comprising: a processor; and

a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to perform any of the methods of claims 1 to 139.

Description:
SYSTEMS AND METHODS FOR AUTOMATED BIOLOGIC DEVELOPMENT

DETERMINATIONS

CROSS REFERENCE TO RELATED APPLICATION

This application claims priority to and benefit of U.S. Application No. 62/567,716, filed October 3, 2017, the contents of which are hereby incorporated by reference in their entirety.

FIELD OF THE INVENTION

This invention relates generally to systems and methods for facilitating development of biologic drugs. More specifically, in certain embodiments, the techniques described herein facilitate refinement of bioprocesses, formulations, and structure-functional characteristics of biologic drugs via automated analysis of structural features of biologies. BACKGROUND OF THE INVENTION

Biologies are highly complex molecules whose detailed structural properties are critical to their ability to perform their desired function, as well as their stability over time in storage. Biologies can be expressed and folded incorrectly based on a range of variations in the biosynthesis or manufacturing process, and can be degraded or chemically changed by proteases, heat, acidic or other environmental conditions to produce modified species, including fragments and truncated molecules. Some biologies tend to form aggregates, which may be inactive and sometimes immunogenic. Biologies can be glycosylated at differing N- linked or O-linked sites, by different amounts, and/or with different sugars (e.g., they may vary by galactose content, afucosylation, sialic acid content, mannose content, etc.), and may include molecular species aglycosylated at critical or uncritical locations. Glycosylation of biologies has a significant impact on their stability, function and immunogenicity. For example, the presence of a core heptasaccharide at the Asn-297 position of a monoclonal antibody is important for activation of Fey receptors and the Cl component of complement. Proper disulfide bonding within a molecule and/or between molecules typically is critical for efficacy, and wrongly paired or unpaired disulfide bonds can lead to inoperative misfolded contaminants or to aggregates. The product also may be contaminated by one or more host cell proteins (e.g., proteases), DNAs, drugs (e.g., methotrexate) or other residuals from upstream expression, or with leached components from downstream processes such as purification. Biologies also may be deamidated, oxidized, methylated or otherwise modified. The molecule may be altered after its release, in storage, during handling/administration or in vivo when exposed to blood-borne enzymes, physiological temperatures and the like.

Production methods profoundly affect structure in various ways. A master cell bank comprising replicable, recombinant clones that reliably express copious quantities of active biologic is only a beginning. Upstream variables in culture of such cells such as culture duration, pH, amount of dissolved oxygen, concentrations and identities of media components, temperature, initial cell density, pC0 2 , mixing and gassing strategy and feeding strategy each may affect not only the quantitative protein yield, but also the structure of the product. Furthermore, contaminants such as host cell proteins, metabolites and the like are inevitably introduced into extracellular broths, as are possible infective agents such as viruses. Similarly, the downstream purification process may introduce variants or contaminants that may alter protein structure. The fine structure of the product can be affected by such aspects as the selection of separation technologies such as affinity columns, anionic or cationic exchange columns or ultrafiltration apparatus. Also, contaminants may be introduced or product degraded or derivatized by the addition of preservatives, diluents, vehicle, as well as the decision as to when a chromatography resin or filter is replaced, and the temperature or pH of the product during purification, compounding and storage. Formulation conditions may also influence the structure of a biologic. In particular, biologies are typically combined with various additives, referred to as excipients, and stored under specific conditions when used as drugs. Individual formulations - specific

combinations of a particular biologic, excipients, and storage conditions - are developed to reduce degradation and preserve the structural integrity of the biologic. Accordingly, structural degradation and alterations may be influenced by the presence of specific excipients, such as particular buffers or surfactants. Particular effective excipient combinations may be appropriate for one biologic, but not for another. Similarly, storage conditions, such as pH, temperature, and even container types and coatings may influence the structure of a biologic.

Accordingly, development of biologies is a highly complex process that involves extensive testing and refinement of functional behavior, bioprocesses, and formulations. This complex process is not merely required once, when a new biologic is first developed, but generally must be performed often, and under a variety of common circumstances, such as for validation of a biosimilar, validation of a new lot, or when a new formulation is developed. Currently, testing and refinement of functional behavior, bioprocesses, and formulations is performed by highly experienced experts who perform extensive structural characterization studies on biologies and analyze the structural data to guide adjustments to e.g., bioprocesses, formulations, and biologic design via a lengthy, iterative, trial-and-error process.

Accordingly, there exists a need for improved systems and methods for determining adjustments to bioprocess for producing biologic drugs and formulations of biologic drugs. Additionally, there exists a need for improved systems and methods for identifying relationships between functional behavior and structural characteristics of biologies. SUMMARY OF THE INVENTION

Presented herein are systems and methods that provide for a biologic development determination technology that facilitates determining effective and improved procedures for developing biologies. In particular, in certain embodiments, the biologic development determination technology leverages patterns in structural features of a target biologic to provide biologic development recommendations that provide guidance to a user for refining bioprocesses and formulations, and identifying important structure-function relationships for the target biologic. These recommendations provide users with information and guidance that reduces the degree to which bioprocesses, formulations, and functional characteristics of a target biologic must be refined via costly and time-consuming trial and error process. In this manner, the systems and methods described herein facilitate development of new biologies and biosimilars, providing for faster development of manufacturing techniques and process scale-up, thereby reducing the time needed to bring a new biologic or biosimilar to market.

For example, in certain embodiments, the systems and methods described herein facilitate refining a bioprocess for manufacture of a target biologic by determining, in an automated fashion, bioprocess recommendations. The determined bioprocess

recommendations may comprise identifications of relevant bioprocess parameters that are determined, via the approaches described herein, to be associated with particular structural features of the target biologic. The identified relevant bioprocess parameters can serve as guidance for a user regarding which bioprocess parameters of a bioprocess protocol should be adjusted, or experimented further. Bioprocess recommendations also may include identifications of the particular structural features associated with various relevant bioprocess parameters, as well as representations of correlations between structural features and the relevant bioprocess parameters with which they are associated, thereby providing further guidance regarding adjustments that can be made to a bioprocess protocol. Bioprocess recommendations corresponding to recommended analytical studies to be carried out on the target biologic may also be generated. In certain embodiments the bioprocess

recommendations may even include bioprocess design results, that include recommendations of particular values, or changes to values of the various relevant bioprocess parameters that can be used directly for adjusting a bioprocess protocol.

Determining such bioprocess recommendations is non-trivial. The relationships between structural features of a particular biologic and various parameters of a bioprocess used for its production are complex, and may depend on an interrelation between several bioprocess parameters, as well as underlying properties of the particular biologic itself, such as its molecule type.

In certain embodiments, the approaches described herein determine bioprocess recommendations for a given target biologic by mapping a set of generalizable structural attributes (GSAs) of the target biologic to records of bioprocesses that have been previously used to produce known biologies (e.g., which may be different from the target biologic) and are stored in a database. The records of previously used bioprocesses are stored as components within data structures referred to herein as biologic records. The biologic records store parameters of bioprocesses, as well as parameters of relevant formulations and structure-function profiles of known biologies. This database and the mapping between GSAs and the biologic records it stores are collectively referred to herein as an attribute store. The attribute store codifies domain knowledge and previous experience in (i) producing biologies via various bioprocesses, (ii) formulating biologies, and (iii) in determining structure-function profiles of biologies that reflect relationships between functional characteristics of biologies and their structure. Notably, in certain embodiments, the manner in which data are stored in the attribute store goes beyond merely storing records of bioprocesses that were previously implemented to produce various biologies. In particular, the attribute store includes sets of GSAs that are determined for the known biologies produced by the various bioprocesses represented as biologic records in the attribute store. For a given known biologic, GSAs are determined via a preprocessing step and linked to the records of the various bioprocesses used to produce it. As will be described in the following, GSAs are values or sets of values representing various sets or patterns of induced (e.g., derived) structural characteristics and physicochemical properties of a biologic and allow patterns of similarities between various different biologies to be identified.

In certain embodiments, by virtue of the manner in which biologic records are linked with GSAs of known biologies in the attribute store, the approaches described herein go beyond merely searching a database to identify and return preexisting bioprocess recommendations. In particular, the systems and methods described herein may determine, for a given target biologic, a set of target GSAs that are used as features in machine learning algorithms that utilize the attribute store to identify relevant bioprocess parameters that are associated with structural features of the target biologic. In this manner, the systems and methods allow for bioprocess recommendations, including bioprocess recommendations comprising bioprocess design results, to be determined (e.g., in an improved, faster, and more accurate manner) for target biologies that have not been produced before, and/or for parameters related to elements of a bioprocess that have not previously been used to produce the target biologic.

In certain embodiments, the systems and methods described herein provide for automated determination of formulation recommendations. The biologic development determination tool described herein may also predict and/or refine structure-function recommendations that identify potential functional characteristics of a target biologic based on the presence of related structural features. As described herein, formulation

recommendations and structure-function recommendations may be determined in a similar fashion to that described above with respect to bioprocess recommendations. In particular, the attribute store may also comprise biologic records corresponding to, and storing parameters of previously used formulations of known biologies and, similarly, biologic records that correspond to and store parameters of structure-functional profiles previously determined for various known biologies. As described above with respect to determining bioprocess recommendations, by virtue of the manner in which they link formulation parameters and/or structure-function parameters with GSAs of associated known biologies, the biologic records of the attribute store can be analyzed and used to determine formulation recommendations and structure-function recommendations.

Accordingly, in this manner, the biologic development determination tool described herein facilitates biologic development.

I Bioprocess recommendations

In one aspect, the invention is directed to a method for refining a bioprocess for manufacture of a target biologic based on automated analysis of generabzable structural attributes (GSAs) of the target biologic, the method comprising: (a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic; (b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record corresponding to a bioprocess having been implemented in production of an associated known biologic from the set of known biologies, wherein: (i) each biologic record comprises a set of bioprocess parameter values each of which corresponds to a value of a corresponding bioprocess parameter used in the production of the associated known biologic; and (ii) each biologic record is linked to one or more GSAs of the associated known biologic; and (c) determining, by the processor, responsive to the input query, one or more bioprocess recommendations using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

A. Output of the tool, and how it is used

Detail on bioprocess recommendations

In certain embodiments, each of one or more of the bioprocess recommendations comprises an identification of a corresponding relevant bioprocess parameter determined by the machine learning module as associated with (e.g., influencing; e.g., correlated with) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the bioprocess recommendations comprises a flag associated with the corresponding relevant bioprocess parameter, the flag corresponding to an indication of whether a value of the relevant bioprocess parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant bioprocess parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

In certain embodiments, the relevant bioprocess parameter is a parameter selected from the group consisting of: (A) a host cell parameter [e.g., a cell type; e.g., a cell age; e.g., one or more host cell proteins (HCPs); e.g., a seeding density]; (B) a cell culture media parameter [e.g., presence and/or concentration of a particular cell culture media component (e.g., glucose, an amino acid, and other nutrients; e.g., a buffer; e.g., a hormone and/or growth factor; e.g., an antibiotic; e.g., a drug; e.g., a biosynthetic pathway modulator)]; (C) a bioreactor condition parameter [e.g., a reactor type and/or mode of operation; e.g., a volume; e.g., a pH; e.g., a temperature; e.g., p02; e.g., a flow rate; e.g., a feeding schedule; e.g., a waste buildup monitoring indicator; e.g., a cleaning process]; (D) a purification parameter (e.g., a parameters identifying use of centrifugation, filtration, and/or diafiltration and associated settings; e.g., an initial purification type and associated settings; e.g., a polishing chromatography parameter; e.g., a viral clearance parameter); and (E) a conjugation parameter (e.g., a drug parameter; e.g., a linker parameter; e.g., a polymer; e.g., an identification of a particular solvent; e.g., a drug-antibody ratio (DAR)).

Recommended changes to bioprocess parameter values (bioprocess design results)

In certain embodiments, each of one or more of the bioprocess recommendations comprises a bioprocess design result comprising identifications of one or more relevant bioprocess parameters and, for each relevant bioprocess parameter, a bioprocess parameter value recommendation corresponding to a recommended value or change to a value of the relevant bioprocess parameter.

In certain embodiments, the method further comprises: adjusting a bioprocess protocol for manufacture of the target biologic using the bioprocess design result.

In certain embodiments, the method further comprises: producing the target biologic using the adjusted bioprocess protocol.

Identification of Correlations

In certain embodiments, each of one or more of the bioprocess recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant bioprocess parameter.

In certain embodiments, the one or more bioprocess recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not).

In certain embodiments, the associated set of one or more GSAs comprises one or more quality attributes, optionally critical quality attributes (CQAs) of the target biologic. In certain embodiments, the one or more bioprocess recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant bioprocess parameter.

Recommended Analytical Studies

In certain embodiments, the one or more bioprocess recommendations comprises a recommended analytical study to be carried out on the target biologic.

In certain embodiments, the method comprises performing the recommended analytical study.

Display of bioprocess recommendations

In certain embodiments, the method comprises causing, by the processor, display of a graphical representation of the one or more bioprocess recommendations (e.g., rendering the graphical representation).

B. Different GSAs and types of input

Target biologic GSA detail

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a

- lO - specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic].

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N- linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic]; (D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

In certain embodiments, the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic.

In certain embodiments, the method comprises: receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

In certain embodiments, the method comprises performing one or more structural characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

C. Biolosic Records comprising bioyrocess parameters

Bioprocess parameters

In certain embodiments, at least one of (e.g., one or more of; e.g., each of) the bioprocess parameter values of the biologic records in the attribute store corresponds to a value of a corresponding bioprocess parameter selected from the group consisting of: (A) a host cell parameter [e.g., a cell type; e.g., a cell age; e.g., one or more host cell proteins (HCPs); e.g., a seeding density]; (B) a cell culture media parameter [e.g., presence and/or concentration of a particular cell culture media component (e.g., glucose, an amino acid, and other nutrients; e.g., a buffer; e.g., a hormone and/or growth factor; e.g., an antibiotic; e.g., a drug; e.g., a biosynthetic pathway modulator)]; (C) a bioreactor condition parameter (e.g., a reactor type and/or mode of operation; e.g., a volume; e.g., a pH; e.g., a temperature; e.g., p02; e.g., a flow rate; e.g., a feeding schedule; e.g., waste buildup monitoring indicator; e.g., a cleaning process); (D) a purification parameter (e.g., a parameters identifying use of centrifugation, filtration, and/or diafiltration and associated settings; e.g., an initial purification type and associated settings; e.g., a polishing chromatography parameter; e.g., a viral clearance parameter); and (E) a conjugation parameter (e.g., a drug parameter; e.g., a linker parameter; e.g., a polymer; e.g., an identification of a particular solvent; e.g., drug- antibody ratio (DAR)). GSAs of the associated known biologic

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) an

identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation). In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic]

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O- linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(1,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.). In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the associated known biologic; (C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic]; (D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

In certain embodiments, the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo comparability profile of the associated known biologic.

Generating biologic records

In certain embodiments, at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing.

In certain embodiments, at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

In certain embodiments, at least a portion of the biologic records are generated from direct measurement of biologic structural features. I). Machine Learning Module Detail

In certain embodiments, the machine learning module of step (c) implements a supervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements a reinforcement machine learning technique.

In certain embodiments, the machine learning module of step (c) implements an unsupervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

II. Formulations recommendations

In another aspect, the invention is directed to a method for refining formulation of a target biologic based on automated analysis of generalizable structural attributes (GSAs) of the target biologic, the method comprising: (a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic; (b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record corresponding to a formulation of an associated known biologic from the set of known biologies, wherein: (i) each biologic record comprises a set of formulation parameter values each of which corresponds to a value of a

corresponding formulation parameter used in the formulation of the associated known biologic, and (ii) each biologic record is linked to one or more GSAs of the associated known biologic; and (c) determining, by the processor, responsive to the input query, one or more formulation recommendations using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store. A. Output of the tool and how it is used

Detail on formulation recommendations

In certain embodiments, each of one or more of the formulation recommendations comprises an identification of a corresponding relevant formulation parameter determined by the machine learning module as associated with (e.g., correlated with; e.g., influencing) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the formulation recommendations comprises a flag associated with the corresponding relevant formulation parameter, the flag corresponding to an indication of whether a value of the relevant formulation parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant formulation parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

In certain embodiments, the corresponding relevant formulation parameter is a parameter selected from the group consisting of: (A) a buffer parameter [e.g., an

identification of a particular type of buffer (e.g., Histidine, Acetate, Citrate, Phosphate, Tris, Bicarbonate, and the like) and, optionally, its concentration]; (B) a surfactant parameter [e.g., an identification of a particular surfactant (e.g., Polysorbate 80, Polysorbate 20, Poloxamer, and the like) and, optionally, its concentration]; (C) an excipient parameter [e.g., an identification of a particular excipient (e.g., Sucrose, Arginine, Glycine, Methionine, NaCl, Serum albumin, Zinc, Protamine, an antioxidant, a chelator (e.g., EGTA), Poly(D.L-lactide- co-glycolide), Cyclodextrin, a bacteriostat, and the like), and, optionally, its concentration]; (D) a form/concentration parameter [e.g., an identification of a particular storage form (e.g., liquid, lyophilized, spray dried, sustained release, and the like), and, optionally, a concentration of the target biologic]; (E) a storage pH; (F) a storage temperature; (G) a container parameter {e.g., an identification of a container type [e.g., a textual string listing container properties such as material (e.g., including any coatings or surface treatments such as siliconization), format (e.g., vial, syringe, and the like), and size] }; and (H) a closure system parameter {e.g., an identification of a particular type of closure system [e.g., a textual string listing closure system types and properties (e.g., material)(e.g.,“natural latex stopper”; e.g.,“silicone rubber stopper”; e.g.,“syringe needle”; e.g.,“autoinjector”; e.g.,“amber ampoule”)]} .

Recommended changes to formulation parameter values (formulation design results)

In certain embodiments, each of one or more of the formulation recommendations comprises a formulation design result comprising identifications of one or more relevant formulation parameters and, for each relevant formulation parameter, a formulation parameter value recommendation corresponding to a recommended value or change to a value of the relevant formulation parameter.

In certain embodiments, the method further comprises adjusting a formulation of the target biologic using formulation design result.

In certain embodiments, the method further comprises producing the adjusted formulation of the target biologic.

Identification of Correlations

In certain embodiments, each of one or more of the formulation recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant formulation parameter.

In certain embodiments, the one or more formulation recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not). In certain embodiments, the associated set of one or more GSAs of the target biologic comprises one or more quality attributes, optionally, critical quality attributes (CQAs) of the target biologic.

In certain embodiments, the one or more formulation recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant formulation parameter.

Recommended Analytical Studies

In certain embodiments, the one or more formulation recommendations comprises a recommended analytical study to be carried out on the target biologic.

In certain embodiments, the method comprises performing the recommended analytical study.

Display of formulation recommendations

In certain embodiments, the method comprises causing, by the processor, display of a graphical representation of the one or more formulation recommendations.

B. Different GSAs and types of input

Target biologic GSA detail

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic].

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N- linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic]; (D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)]. In certain embodiments, the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic. In certain embodiments, the method comprises: receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

In certain embodiments, the method comprises performing one or more structural characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

C. Biologic Records Comprising Formulation Parameters

Formulation parameters

In certain embodiments, at least one of (e.g., one or more of; e.g., each of) the formulation parameter values of the biologic records in the attribute store corresponds to a value of a corresponding formulation parameter selected from the group consisting of: (A) a buffer parameter [e.g., an identification of a particular type of buffer (e.g., Histidine, Acetate, Citrate, Phosphate, Tris, Bicarbonate, and the like) and, optionally, its concentration]; (B) a surfactant parameter [e.g., an identification of a particular surfactant (e.g., Polysorbate 80, Polysorbate 20, Poloxamer, and the like) and, optionally, its concentration]; (C) an excipient parameter [e.g., an identification of a particular excipient (e.g., Sucrose, Arginine, Glycine, Methionine, NaCl, Serum albumin, Zinc, Protamine, an antioxidant, a chelator (e.g., EGTA), Poly(D.L-lactide-co-glycolide), Cyclodextrin, a bacteriostat, and the like), and, optionally, its concentration]; (D) a form/concentration parameter [e.g., an identification of a particular storage form (e.g., liquid, lyophilized, spray dried, sustained release, and the like), and, optionally, a concentration of the associated known biologic]; (E) a storage pH; (F) a storage temperature; (G) a container parameter {e.g., an identification of a container type [e.g., a textual string listing container properties such as material (e.g., including any coatings or surface treatments such as siliconization), format (e.g., vial, syringe, and the like), and size]}; and (H) a closure system parameter {e.g., an identification of a particular type of closure system [e.g., a textual string listing closure system types and properties (e.g., material)(e.g., “natural latex stopper”; e.g.,“silicone rubber stopper”; e.g.,“syringe needle”; e.g., “autoinjector”; e.g.,“amber ampoule”)]}.

GSAs of the associated known biologic

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic].

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O- linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the associated known biologic; (C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic]; (D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

In certain embodiments, the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo comparability profile of the associated known biologic.

Generating biologic records

In certain embodiments, at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing. In certain embodiments, at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

In certain embodiments, at least a portion of the biologic records are generated from direct measurement of biologic structural features.

I). Machine Learning Module Detail

In certain embodiments, the machine learning module of step (c) implements a supervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements a reinforcement machine learning technique.

In certain embodiments, the machine learning module of step (c) implements an unsupervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

III. Structure-function recommendations

In another aspect, the invention is directed to a method for refining a structure- function profile of a target biologic based on automated analysis of generalizable structural attributes (GSAs) of the target biologic, the method comprising: (a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic; (b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, each biologic record representing a set of previously determined structural or functional characteristics of an associated known biologic, wherein: (i) each biologic record comprises a known set of structure-function parameter values, each of which represents a previously determined specific structural or functional characteristic of the associated known biologic, and (ii) each biologic record is linked to one or more GSAs of the associated known biologic; and (c) determining, by the processor, responsive to the input query, one or more structure-function recommendations using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

A. Output of the tool, and how it is used

Detail on structure-function recommendations

In certain embodiments, each of one or more of the structure-function

recommendations, comprises an identification of a corresponding relevant structure-function parameter determined by the machine learning module as associated with (e.g., correlated with; e.g., influenced by) a corresponding set of one or more GSAs of the target biologic [e.g., wherein at least one (e.g., one or more; e.g. up to all) of the structure-function recommendations comprises a flag associated with the corresponding relevant structure- function parameter, the flag corresponding to an indication of whether the relevant structure- function parameter should be further analyzed (e.g., wherein the flag is a binary value indicating whether the value of the relevant structure-function parameter does or does not need to be checked; e.g., wherein the flag is one of a set of values)].

In certain embodiments, the corresponding relevant structure-function parameter is a parameter selected from the group consisting of: (A) an intrinsic stability parameter [e.g., a deamidation rate; e.g., an aggregation propensity; e.g., a chain cleavage propensity]; (B) a pharmacokinetic (PK) profile parameter (e.g., an in vivo circulatory half-life; e.g., a biodistribution); (C) a mechanism of action parameter [e.g., a textual string identifying a particular mechanism of action (e.g., antibody-dependent cell-mediated cytotoxicity (ADCC); e.g., complement-dependent cytotoxicity (CDC))]; (D) an efficacy parameter (e.g., a target binding time; e.g., an occupancy time; e.g., a dosing requirement); and (E) an immunogenicity parameter (e.g., a binding anti-drug antibody (ADA); e.g., a PK altering ADA; e.g., a neutralizing ADA; e.g. a hypersensitivity ADA; e.g., a cross-reactive neutralizing ADA; e.g., complement activation; e.g.,, non-antibody cytokine-releasing syndrome).

In certain embodiments, each of one or more of the structure-function

recommendations comprises a structure-function profile result comprising identifications of relevant structure-function parameters and, for each relevant structure-function parameter, a structure-function parameter value prediction corresponding to a predicted value of the structure function parameter (e.g., the predicted value corresponding to a predicted specific structural or functional characteristic).

In certain embodiments, the method further comprises adjusting a design of the target biologic using the structure-function profile result.

Identification of Correlations

In certain embodiments, each of one or more of the structure-function

recommendations comprises an identification of the set of one or more GSAs of the target biologic associated with the corresponding relevant structure-function parameter.

In certain embodiments, the one or more structure-function recommendations comprises an indication of criticality of one or more of the GSAs of the set (e.g., the indication of criticality corresponding to a binary variable having a first value if a structural feature is critical and a second value if it is not).

In certain embodiments, the associated set of one or more GSAs of the target biologic comprises one or more quality attributes, optionally critical quality attributes (CQAs) of the target biologic. In certain embodiments, the one or more structure-function recommendations comprises a representation of a correlation between the associated set of one or more GSAs of the target biologic and the relevant structure-function parameter.

Recommended Analytical Studies

In certain embodiments, the one or more structure-function recommendations comprises a recommended analytical study to be carried out on the target biologic.

In certain embodiments, the method comprises performing the recommended analytical study.

Setting Clinical Trial Specifications and Patient Monitoring

In certain embodiments, one or more of the structure-function recommendations comprises a set of recommended clinical trial specifications.

In certain embodiments, one or more of the structure-function recommendations comprises a safety profile recommendation [e.g., a warning flag identifying a potential adverse event; e.g., a patient monitoring recommendation (e.g., a recommendation to test for pre-existing IgE antibodies)].

Identifying Relevant Bioprocess Parameters and Recommending Changes to Bioprocess Settings

In certain embodiments, the method comprises determining one or more bioprocess recommendations (e.g., each bioprocess recommendation associated with one or more relevant structure-function parameters; e.g., the one or more bioprocess recommendations corresponding to any of the bioprocess recommendations of claims 2 to 3 and claims 7 to 12).

In certain embodiments, each of one or more of the bioprocess recommendations comprises a bioprocess design result comprising identifications of one or more relevant bioprocess parameters and, for each relevant bioprocess parameter, a bioprocess parameter value recommendation corresponding to a recommended value or change to a value of the relevant bioprocess parameter.

In certain embodiments, the method further comprises adjusting a bioprocess protocol for manufacture of the target biologic using the bioprocess design result.

In certain embodiments, the method further comprises producing the target biologic using the adjusted bioprocess protocol.

Identifying Relevant Formulation Parameters and Recommending Formulation Changes

In certain embodiments, the method comprises determining one or more formulation recommendations (e.g., each formulation recommendation associated with one or more relevant structure-function parameters; e.g., the one or more formulation recommendations corresponding to any of the formulation recommendations of claims 45 to 46 and claims 50 to 55).

In certain embodiments, each of one or more of the formulation recommendations comprises a formulation design result comprising identifications of one or more relevant formulation parameters and, for each relevant formulation parameter, a formulation parameter value recommendation corresponding to a recommended value or change to a value of the relevant formulation parameter.

In certain embodiments, the method further comprises adjusting a formulation of the target biologic using the formulation design result.

In certain embodiments, the method further comprises producing the adjusted formulation of the target biologic.

Display of Structure-Function Recommendations

In certain embodiments, the method comprises causing, by the processor, display of a graphical representation of the one or more structure-function recommendations. B. Different GSAs and types of input

Target biologic GSA detail

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) an identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the target biologic (e.g., identification of the presence of particular HOS structures within the target biologic; e.g., a quantification of a relative fraction of the target biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the target biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation). In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the target biologic].

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the target biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(1,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N- linked sites; e.g., O-linked sites) within the target biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the target biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the target biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the target biologic; (C) a quantification of one or more specific amino acids within the target biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the target biologic; e.g., a fraction of one or more specific amino acids within the target biologic]; (D) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

In certain embodiments, the one or more GSAs of the target biologic comprise(s) a proportion of amino acids within the target biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic). In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the target biologic and (ii) a second lot of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from a CQA map of the target biologic.

In certain embodiments, the one or more GSAs of the target biologic comprise(s) one or more values derived from an in vivo comparability profile of the target biologic.

In certain embodiments, the method comprises: receiving, by the processor, a user input comprising data corresponding to one or more measured structural features of the target biologic; determining, by the processor, using the data corresponding to the one or more measured structural features of the target biologic, the one or more GSAs of the target biologic; and generating, by the processor, the determined one or more target biologic GSAs for use as the input query of step (a).

In certain embodiments, the method comprises performing one or more structural characterization studies on a sample comprising the target biologic to generate the data corresponding to the one or more measured structural features of the target biologic.

C. Biolosic Records Comprising Structure-Function Parameters

Structure-function parameters

In certain embodiments, at least one of the structure-function parameter values of the biologic records in the attribute store corresponds to a value of a corresponding bioprocess parameter selected from the group consisting of: (A) an intrinsic stability parameter [e.g., a deamidation rate; e.g., an aggregation propensity; e.g., a chain cleavage propensity]; (B) a PK profile parameter (e.g., an in vivo circulatory half-life; e.g., a biodistribution); (C) a mechanism of action parameter [e.g., a textual string identifying a particular mechanism of action (e.g., antibody-dependent cell-mediated cytotoxicity (ADCC); e.g., complement- dependent cytotoxicity (CDC))]; (D) an efficacy parameter (e.g., a target binding time; e.g., an occupancy time; e.g., a dosing requirement); and (E) an immunogenicity parameter (e.g., a binding anti-drug antibody (ADA); e.g., a PK altering ADA; e.g., a neutralizing ADA; e.g. a hypersensitivity ADA; e.g., a cross-reactive neutralizing ADA; e.g., complement activation; e.g.,, non-antibody cytokine-releasing syndrome).

GSAs of the associated known biologic

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more values or sets of values, each of which identifies and/or quantifies a particular pattern of structural features associated with (e.g., correlated with) one or more specific properties of a particular biologic molecule, wherein the one or more specific properties comprise one or more members selected from the group consisting of: (A) stability and/or a propensity for chemical degradation; (B) a likelihood and/or propensity to form aggregates; (C) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (D) an in vivo circulatory property (e.g., a circulatory half-life; e.g., a biodistribution); (E) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (F) an immunogenic potential.

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) an

identification and/or quantification of one or more higher order structures (HOS) motifs (e.g., particular secondary structures; e.g., particular tertiary structures; e.g., particular quaternary structures) of the associated known biologic (e.g., identification of the presence of particular HOS structures within the associated known biologic; e.g., a quantification of a relative fraction of the associated known biologic having a particular HOS); and (B) an identification and/or quantification of a post-translational modification (PTM) of the associated known biologic (e.g., intra- and inter-chain disulfide bonds; e.g., mismatched disulfides, free cysteine and/or trisulfide; e.g., glycosylation patterns; e.g., deamidation; e.g., oxidation; e.g., chain cleavage; e.g., phosphorylation; e.g., methylation).

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more glycosylation pattern GSAs [e.g., values or sets of values that represent an identification and/or quantification of a particular glycosylation pattern measured for the associated known biologic].

In certain embodiments, the one or more glycosylation pattern GSAs comprise(s) one or more members selected from the group consisting of: (A) an identification of presence (or absence) of a particular glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O- linked sites) within the associated known biologic [e.g., an identification of presence of mannose rich Man5 at a particular site; e.g., an identification of absence of a(l,6) fucose; e.g., an identification of presence of low-abundance Man8; e.g., an identification of presence of a(l,3)]; (B) an identification and/or quantification of a particular type of glycan at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) within the associated known biologic (e.g., a quantification of levels of hybrid glycans at one or more particular sites; e.g., a quantification of levels of complex glycans at one or more particular sites); (C) a quantification of relative levels of two or more particular glycans or types of glycans at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked sites) (e.g., a ratio of hybrid to complex glycans at one or more particular sites); (D) an expected level of glycan occupancy; and (E) an identification and/or quantification of glycan content at one or more particular sites (e.g., specific amino acid positions within the associated known biologic) or types of sites (e.g., N-linked sites; e.g., O-linked) within the associated known biologic (e.g., galactose content, afucosylation, sialic acid content, mannose content, etc.).

In certain embodiments, the one or more GSAs of the associated known biologic comprise(s) one or more members selected from the group consisting of: (A) a sequence motif; (B) a molecule type of the associated known biologic; (C) a quantification of one or more specific amino acids within the associated known biologic [e.g., a total number of one or more specific amino acids (e.g., cysteines) within the associated known biologic; e.g., a fraction of one or more specific amino acids within the associated known biologic]; (D) a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties; and (E) an identification and/or quantification of patterns of amino acid motifs associated with propensity towards one or more specific types of amino acid modifications [e.g., positions and/or number of potential sites of oxidation; e.g., positions and/or number of potential sites of deamidation; e.g., positions and/or number of potential sites of post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

In certain embodiments, the one or more GSAs of the associated known biologic comprises a proportion of amino acids within the associated known biologic having a particular classification based on one or more specific properties, wherein the one or more specific properties comprise(s) one or more members selected from the group consisting of: (i) hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic]; (ii) hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic]; (iii) charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral); (iv) acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral); and (v) aromaticity (e.g., classified as aromatic).

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) the associated known biologic and (ii) a reference biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values that identify and/or quantify a difference or similarity in a structural characteristic between (i) a first lot of the associated known biologic and (ii) a second lot of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from a CQA map of the associated known biologic.

In certain embodiments, the one or more GSAs of the associated known biologic comprise one or more values derived from an in vivo comparability profile of the associated known biologic. Generating biologic records

In certain embodiments, at least a portion of the plurality of biologic records in the attribute store are generated from published documents via automated processing using text mining with or without natural language processing.

In certain embodiments, at least a portion of the plurality of biologic records are generated from published documents via automated processing in combination with a user interaction.

In certain embodiments, at least a portion of the biologic records are generated from direct measurement of biologic structural features.

I). Machine Learning Module Detail

In certain embodiments, the machine learning module of step (c) implements a supervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements a reinforcement machine learning technique.

In certain embodiments, the machine learning module of step (c) implements an unsupervised machine learning technique.

In certain embodiments, the machine learning module of step (c) implements the unsupervised machine learning technique as a precursor to a supervised machine learning technique.

IV. Combinations of bioprocess/formulation/structure-function recommendations determined based on any combination of bionrocess/formulation/structure-function parameters in the biologic records of the attribute store.

In another aspect, the invention is directed to a method for automated analysis of generalizable structural attributes (GSAs) of a target biologic for refinement of at least one of a bioprocess for producing the target biologic, a formulation of the target biologic, and a structure-function profile of the target biologic, the method comprising: (a) receiving, by a processor of a computing device, an input query comprising one or more GSAs of the target biologic; (b) accessing, by the processor, an attribute store comprising a plurality of biologic records and GSAs for a set of known biologies, wherein: (i) each biologic record comprises one or more of (A), (B), and (C): (A) a set of bioprocess parameter values each of which corresponds to a value of a corresponding bioprocess parameter used in production of the associated known biologic; (B) a set of formulation parameter values each of which corresponds to a value of a corresponding formulation parameter used in a formulation of the associated known biologic; and (C) a set of structure-function parameter values, each of which represents a previously determined specific structural or functional characteristic of the associated known biologic; (ii) each biologic record is linked to one or more GSAs of the associated known biologic; and (c) determining, by the processor, responsive to the input query any one or more of (i), (ii), and (iii): (i) one or more bioprocess recommendations; (ii) one or more formulation recommendations; and (iii) one or more structure-function recommendations, wherein step (c) comprises using a machine learning module that identifies patterns relating the GSAs for the set of known biologies with the biologic records of the attribute store.

In certain embodiments, the method comprises one or more features described above with respect to aspects I, II, and III of the invention (e.g., in certain embodiments, features of the claims depending from any of the other independent claims may apply to this independent claim as well).

V. System Claim

In another aspect, the invention is directed to a system for automated analysis of generalizable structural attributes (GSAs) of a target biologic for refinement of at least one of a bioprocess for producing the target biologic, a formulation of the target biologic, and a structure-function profile of the target biologic, the system comprising: a processor; and a memory having instructions stored thereon, wherein the instructions, when executed by the processor, cause the processor to perform any of the methods described above (e.g., under headers I to IV).

Embodiments described with respect to one aspect of the invention may be applied to another aspect of the invention (e.g., features of embodiments described with respect to one independent claim are contemplated to be applicable to other embodiments of other independent claims).

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects, features, and advantages of the present disclosure will become more apparent and better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a block flow diagram showing a process for populating an attribute store, according to an illustrative embodiment.

FIG. 2A is a block flow diagram of a process for determining biologic development recommendations, according to an illustrative embodiment.

FIG. 2B is a block diagram showing a process for determining biologic development recommendations, according to an illustrative embodiment.

FIG. 3 is a block diagram showing a hierarchical organization of various structural attributes of biologies, according to an illustrative embodiment.

FIG. 4A is a block diagram showing bioprocess parameters, according to an illustrative embodiment.

FIG. 4B is a block diagram showing formulation parameters, according to an illustrative embodiment. FIG. 4C is a block diagram showing structure-function parameters, according to an illustrative embodiment.

FIG. 5A is a block diagram showing data structures to represent records for biologic molecules, according to an illustrative embodiment.

FIG. 5B is a block diagram showing data structures to represent biologic records, according to an illustrative embodiment.

FIG. 6A is a connection diagram illustrating an example for representing data associations between entities, according to an illustrative embodiment.

FIG. 6B is a connection diagram illustrating a search and generalization algorithm, according to an illustrative embodiment.

FIG. 6C is a connection diagram showing a typical user workflow, according to an illustrative embodiment.

FIG. 7 is a block diagram illustrating a mapping of select GSAs to correlated bioprocess parameters, according to an illustrative embodiment.

FIG. 8A is a schematic illustrating antibody structure and glycosylation (from Higel et al, Eur J Pharm Biopharm 100:94-100, 2016).

FIG. 8B is a schematic illustrating common gly coforms found on monoclonal antibodies (from Putnam et al, Trends Biotech 28(l0):509-l6, 2010).

FIG. 8C is another schematic illustrating common gly coforms found on monoclonal antibodies (from Aich et al, J Pharm Sci 105(3): 1221-1232, 2016).

FIG. 8D is a schematic illustrating gly can composition and associated functional relationships (from Jones, BioPharm International 30(6):20-25, 2017,

FIG. 8E is a schematic illustrating major types of N-glycan structures typically observed on antibodies or other proteins (from Palmigiano et al., Comp Anal Chem 2017, in press, available online at https://doi . org/l 0.1016/bs. coac.2017.06.009).

FIG. 8F is a schematic illustrating antibody N-glycan interaction with a receptor (from Subedi & Barb, Structure 23(9): 1573-83, 2015).

FIG. 8G is a schematic illustrating a UDP-GlcNAc synthesis pathway (left panel from Swamy et al, Nat Immunol 17:712-720, 2016; right panel from Li et al., PloS One

7(8):e42769, 2012, doi: l0. l37l/joumal.pone.0042769)

FIG. 8H is a set of graphs showing various structural characterization data sets that can be used as input data to the biologic development determination tool described herein, according to an illustrative embodiment. The graphs are from Fan et al., Biotechnol Bioeng 112(10):2172-84, 2015.

FIG. 9 is a block diagram of an exemplary cloud computing environment, used in certain embodiments.

FIG. 10 is a block diagram of an example computing device and an example mobile computing device, used in certain embodiments.

FIG. 11 is a block diagram showing a layered and modular software architecture that can be used to implement the techniques described in this disclosure, according to an illustrative embodiment.

FIG. 12A is a schematic illustrating pH-dependent deamidation pathways and possible subsequent transformations.

FIG. 12B is a schematic illustrating deamidation mechanisms occurring at higher pH values. FIG. 12C is a set of graphs showing local structure and deamidation differences in two monoclonal antibodies (mAbs) detected by HDX LC-MS (from Phillips et al, Anal Chem., 89(4):236l-8, 2017).

FIG. 12D is a schematic illustrating local structure changes resulting from

deamidation in a beta-turn region of a protein (structure maps obtained from RCSB Protein Data Bank PDB ID 1FS3 (Wild Type) and 1DY5 (Deamidated Derivative)).

FIG. 13 A is a set of graphs showing results of intravital microscopy measurements of a biodistribution of a biologic (from Arlauckas et al, Science Translational Medicine 9(389):eaal3604, 2017, doi: l0.H26/scitranslmed.aal3604).

FIG. 13B is a set of graphs showing results of glycan analysis digested from nivolumab anti -PD 1 mAh showing distribution of glycan structures based on size exclusion (from Arlauckas et al., Science Translational Medicine 9(389):eaal3604, 2017,

doi: 10.1 l26/scitranslmed.aal3604).

FIG. 13C is a graph showing data that demonstrates how disrupting Fc binding affects macrophage uptake of anti-PDl and improves treatment efficacy (from Arlauckas et al, Science Translational Medicine 9(389):eaal3604, 2017, doi: l0. H26/scitranslmed.aal3604).

FIG. 14 is a schematic showing relationships between structural features of the target biologic, GSAs of the target biologic, and determined biologic development

recommendations, in particular, formulation recommendations, according to an illustrative embodiment.

FIG. 15 is a schematic showing relationships between structural features of a target biologic, GSAs of the target biologic, and associated structure-function parameters that correspond to predicted functional characteristics of the target biologic based on analysis of its GSAs, according to an illustrative embodiment. The features and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements.

DEFINITIONS

In this application, the use of "or" means "and/or" unless stated otherwise. As used in this application, the term "comprise" and variations of the term, such as "comprising" and "comprises," are not intended to exclude other additives, components, integers or steps. As used in this application, the terms "about" and "approximately" are used as equivalents. Any numerals used in this application with or without about/approximately are meant to cover any normal fluctuations appreciated by one of ordinary skill in the relevant art. In certain embodiments, the term "approximately" or "about" refers to a range of values that fall within

25%, 20%, 19%, 18%, 17%, 16%, 15%, 14%, 13%, 12%, 11%, 10%, 9%, 8%, 7%, 6%, 5%,

4%, 3%, 2%, 1%, or less in either direction (greater than or less than) of the stated reference value unless otherwise stated or otherwise evident from the context (except where such number would exceed 100% of a possible value).

Biologic: As used herein, the terms“biologic” and refers to a composition that is produced by recombinant DNA technologies, peptide synthesis, or purified from natural sources and that has a desired biological activity. The biologic can be, for example, a protein, peptide, glycoprotein, polysaccharide, a mixture of proteins or peptides, a mixture of glycoproteins, a mixture of polysaccharides, a mixture of one or more of a protein, peptide, glycoprotein or polysaccharide, or a derivatized form of any of the foregoing entities. The molecular weight of biologies can vary widely, from about 1000 Da for small peptides such as peptide hormones to one thousand kDa or more for complex polysaccharides, mucins, and other heavily glycosylated proteins. The biologic subject of the process of this invention can have a molecular weight of 1 kDa to 1000 kDa, more typically 20 kDa to 200 kDa, and often 30 kDa to 150 kDa. By way of example, desmopressin, oxytocin, angiotensin and bradykinin each have a molecular weight of about 1 kDa, calcitonin is 3.5 kDa, insulin is 5.8 kDa,

Kineret is 17.3 kDa, erythropoietin is about 30 kDa, Ontak is 58 kDa, Orencia is 92 kDa, and antibodies are approximately 150 kDa (Rituxan 145 kDa, Erbitux 152 kDa). Hyaluronic acids and salts have an average molecular weight often greater than 1000 kDa.

In certain embodiments, a biologic is a drug used for treatment of diseases and/or medical conditions. Examples of biologic drugs include, for example, native or engineered antibodies or antigen binding fragments thereof, and antibody-drug conjugates, which comprise an antibody or antigen binding fragments thereof conjugated directly or indirectly (e.g., via a linker) to a drug of interest, such as a cytotoxic drug or toxin.

In certain embodiments, a biologic is a diagnostic, used to diagnose diseases and/or medical conditions. For example, allergen patch tests utilize biologies (e.g., biologies manufactured from natural substances) that are known to cause contact dermatitis.

Diagnostic biologies may also include medical imaging agents, such as proteins that are labelled with agents that provide a detectable signal that facilitates imaging such as fluorescent markers, dyes, radionuclides, and the like.

Reference biologic: As used herein, the terms“reference biologic” and“reference biologic drug” refer to a biologic that is representative of the biologic drug under development or that that has been approved for marketing, and provides a reference standard for the biologic drug with, for example, the appropriate, pre-determined composition, purity and/or biological activity. Generalizable Structural Attributes, GSAs: As used herein, the term“generalizable structural attributes (GSAs)”, refers to sets or patterns of induced structural characteristics from biologic molecules associated via heuristic and/or domain knowledge with bioprocess, formulation, design, or functional parameters. In particular, a GSA of a particular biologic is a value or set of values representing a set or pattern of structural features that (i) are derived from and generalize measurements of the particular biologic and (ii) are associated (e.g., correlated) with one or more specific properties of the particular biologic. These specific properties may include, without limitation: (i) stability and/or propensity for chemical degradation; (ii) a likelihood and/or propensity to form aggregates; (iii) a likelihood and/or propensity for alterations following release (e.g., in storage, or in vivo when administered to patients and exposed to blood-bome enzymes, physiological temperatures, and the like); (iv) in vivo circulatory properties (e.g., circulatory half-life, e.g., biodistribution); (v) efficacy (e.g., a binding affinity for a specific drug-target; e.g., a specificity for a specific drug-target; e.g., a duration of drug-target engagement); and (vi) an immunogenic potential.

Examples of GSA’s induced from structural features of a biologic include, without limitation any of the following: (i) glycosylation patterns at different N-linked or O-linked sites, in different amounts, and/or with different sugars (e.g., they may vary by galactose content, afucosylation, sialic acid content, mannose content, etc.); (ii) deamidation patterns; (iii) oxidation patterns; (iv) methylation patterns; (v) disulfide bonding patterns within a molecule and/or between molecules; (vi) patterns of higher-order structure (e.g., secondary structure, tertiary structure and/or quaternary structure); (vii) patterns of antibody-drug ratios; (viii) and other patterns resulting from other chemical or post-translational modifications of a biologic. For example, a GSA corresponding to the glycosylation profile of a given biologic may be a value representing the relative level of mannose-rich glycans, or a value representing the relative levels of hybrid to complex glycan ratios (a H:C ratio). GSAs induced from structural features of a biologic may also include a quantification of (e.g., a number of; e.g., a fraction of) one or more specific amino acids [e.g., arginine (also referred to as Arg, or R); e.g., lysine (also referred to as Lys, or K); e.g., cysteine (also referred to as Cys, or C)] within the biologic. GSAs may also include an identification and/or quantification of patterns of amino acid motifs associated with propensity towards certain types of modification [e.g., a position and/or number of potential (e.g., predicted) or known sites of oxidation; e.g., a position and/or number of potential (e.g., predicted) or known sites of deamidation; a position and/or number of potential (e.g., predicted) or known sites of various post-translational modifications (e.g., N-linked glycosylation; e.g., disulfide bridges; e.g., disulfide knots; e.g., modification of cysteine to formylglycine)].

GSAs representing structural features of a given biologic also include proportions of amino acids of various properties (e.g., hydrophobicity, hydrophilicity, charge, acidity, aromaticity, and the like). For example, GSAs representing proportions of amino acids of various properties within a given biologic include a proportion of amino acids having a particular classification based on hydrophobicity [e.g., having at least a give level of hydrophobicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophobicity; e.g., having a level of hydrophobicity within a particular range; classified as hydrophobic], a proportion of amino acids having a particular classification based on hydrophilicity [e.g., having at least a give level of hydrophilicity (e.g., as measured on a predefined scale); having less than or equal to a given level of hydrophilicity; e.g., having a level of hydrophilicity within a particular range; classified as hydrophilic], a proportion of amino acids having a particular classification based on charge (e.g., having a charge greater than or equal to a specific charge; e.g., having charge less than or equal to a specific charge; e.g., having a positive charge; e.g., having a negative charge; e.g., neutral), a proportion of amino acids having a particular classification based on acidity (e.g., classified as acidic; e.g., classified as basic; e.g., classified as neutral), and a proportion of amino acids having a particular classification based on aromaticity (e.g., classified as aromatic).

In certain embodiments, the GSAs of the target biologic are derived from

measurements obtained from one or more samples of the target biologic and, accordingly, generalize structural features of the target biologic itself. In certain embodiments, the GSAs of the target biologic are differential GSAs. Differential GSAs may identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic, or between different lots or formulations of the target biologic.

Critical Quality Attribute, CQA: As used herein, the term“Critical Quality Attribute (CQA)” for a biologic drug refers to a set of physical, chemical, biological or microbiological properties or characteristics that should be within an appropriate limit, range or distribution to ensure the desired product quality for the biologic drug. CQA’s are generally associated with the drug substance, excipients, intermediates (in-process materials), and drug product, including all impurities such as host cell proteins.

Link: As used herein, the terms“link”, and“linked”, as in a first data structure or data element is linked to a second data structure or data element, refer to a computer representation of an association between two data structures or data elements that is stored electronically (e.g. in computer memory). The association may be a correspondence between the two data structures or data elements. In certain embodiments, the association is a statistical correlation between the two data structures or data elements. An association corresponding to a statistical correlation may be represented by a stored value or set of values that identify a direction of the correlation and/or a strength of the correlation. For example, a value corresponding to a sign of a correlation coefficient may be used to represent a direction of the correlation. For example, a correlation coefficient value may be used to represent both a direction and strength of a correlation. Other manners of representation may be used. DETAILED DESCRIPTION

It is contemplated that systems, architectures, devices, methods, and processes of the claimed invention encompass variations and adaptations developed using information from the embodiments described herein. Adaptation and/or modification of the systems, architectures, devices, methods, and processes described herein may be performed, as contemplated by this description.

Throughout the description, where articles, devices, systems, and architectures are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are articles, devices, systems, and architectures of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.

It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.

The mention herein of any publication, for example, in the Background section, is not an admission that the publication serves as prior art with respect to any of the claims presented herein. The Background section is presented for purposes of clarity and is not meant as a description of prior art with respect to any claim.

Documents are incorporated herein by reference as noted. Where there is any discrepancy in the meaning of a particular term, the meaning provided in the Definition section above is controlling. Headers are provided for the convenience of the reader - the presence and/or placement of a header is not intended to limit the scope of the subject matter described herein.

A. Biologic Development Recommendations

In certain embodiments, the systems and methods described herein facilitate the development of biologies by automatically analyzing structural features of a target biologic and providing, based on the analysis of the target biologic’s structural features, one or more biologic development recommendations. Biologic development recommendations refer to any one of: (i) a bioprocess recommendation, (ii) a formulation recommendation, and (iii) a structure-function recommendation. As described in the following, bioprocess

recommendations, formulation recommendations, and structure-function recommendations are computer representations of specific sets of data (e.g., data structures) that provide recommendations and guidance to a user for refining a bioprocess, a formulation, or structure-function profile of a target biologic.

A. i Bioprocess Recommendations

A bioprocess is a manufacturing process for producing biologies from cell cultures.

In certain embodiments, a bioprocess comprises a step corresponding to cell line

development, clonal selection and cell banking. A second step in a bioprocess corresponds to upstream processing in a bioreactor. A specific upstream processing step uses a specific combination of various buffers, cell culture nutrients, and other additives, along with particular bioreactor culture and harvesting conditions. A third step in a bioprocess corresponds to downstream processing, wherein the biologic is recovered from the host cell used to produce it and purified. In certain embodiments the downstream processing step comprises a primary recovery step wherein host cells are removed and the biologic product is stabilized and concentrated. The primary recovery step may be performed using

centrifugation and/or filtration techniques. Downstream processing may also include an initial purification sub-step to remove major product and process impurities. For example, an initial purification step may be performed using Protein-A affinity chromatography - relevant for initial purification of monoclonal antibodies. A polishing sub-step may be included to remove remaining trace impurities such as host cell proteins (HCPs), media additives, DNA, endotoxins, and the like. The polishing sub-step may comprise any number of

chromatographic techniques such as ion exchange chromatography (IEC), hydrophobic interaction chromatography (HIC), size exclusion chromatography (SEC), affinity chromatography, continuous countercurrent tangential chromatography (CCTC), and other chromatography techniques.

Accordingly, a bioprocess for manufacturing a particular biologic comprises a variety of steps, that may be accomplished using a variety of techniques, using different components and conditions, such as various different types host cells, cell culture media additives, bioreactor conditions, and the like. Optimizing the various techniques, components, and conditions used in a particular bioprocess to ensure that important structural features, e.g., influencing functionality and/or stability, of the produced biologic are maintained while maximizing yields is a challenging process.

In certain embodiments, in order to facilitate refinement of a bioprocess, the biologic development determination technology may output one or more bioprocess

recommendations. The bioprocess recommendations provide guidance for refining a bioprocess used to produce the target biologic.

In certain embodiments, each of the one or more bioprocess recommendations comprises an identification of a relevant bioprocess parameter. The relevant bioprocess parameters are identified based on automated analysis of GSAs of the target biologic and mappings between GSAs and biologic records of the attribute store, as described in the following. In particular, the relevant bioprocess parameters identified in the bioprocess recommendations correspond to bioprocess parameters that are determined, through automated analysis of the GSAs of the target biologic and the biologic records and GSAs of the attribute store (e.g., by a machine learning module), as associated with one or more GSAs of the target biologic.

FIG. 4A shows examples of bioprocess parameters corresponding to parameters that are set and varied to determine a bioprocess protocol for manufacturing a particular biologic. Any of the bioprocess parameters shown in FIG. 4A may be determined as a relevant bioprocess parameter, and identified via a bioprocess recommendation. For example, particular culture media components, such as glutamine may be identified as associated with GSAs related to a glycan profile of a target biologic (e.g., presence of a Man5 glycan; e.g., glycan occupancy). Waste products, such as ammonium ion, also may influence GSAs, in particular those associated with a glycan profile of the target biologic. Accordingly, one or more bioprocess recommendations may identify a particular waste product (e.g., via a waste buildup monitoring indicator), or another bioprocess parameter that influences buildup of waste products, such a perfusion flow rate of a bioreactor.

Bioprocess recommendations may be output as textual strings that identify relevant bioprocess parameters, thereby identifying for a user those bioprocess parameters whose values may be influencing structural features of the target biologic. Bioprocess

recommendations may also comprise, along with the identification of the relevant bioprocess parameter, a flag that indicates to the user whether the relevant bioprocess parameter’s value should be further analyzed for its impact on the structure of the target biologic. The flag may be a binary value (e.g., a 0 or 1, a textual label, a Boolean true or false, and the like) that indicates whether or not the relevant bioprocess parameter should be checked, or the flag may be a value selected from a set of values indicating a determined level of impact of the bioprocess parameter (e.g., a value on a particular scale). The value identifying the level of impact may be a function of a determined correlation between the bioprocess parameter and one or more structural features of the target biologic.

In addition to an identification of a relevant bioprocess parameter, a bioprocess recommendation output by the biologic development determination tool may also comprise an identification of a set of GSAs associated with the relevant bioprocess parameter. For example, a bioprocess recommendation may comprise both an identification of glutamine level as a relevant bioprocess parameter and also an identification of the GSAs of the target biologic that it is determined to be associated with, such as presence of Man5 and glycan occupancy at a particular site.

A bioprocess recommendation that identifies GSAs of the target biologic in this manner may also include an indication of their criticality, for example by a binary variable whose value indicates whether or not a particular GSA is critical and influences a functional characteristic of the target biologic. The bioprocess parameter recommendation may, optionally, identify various GSAs of the target biologic as critical quality attributes (CQAs).

A bioprocess recommendation that identifies a set of GSAs of the target biologic that are associated with a relevant bioprocess parameter may also include a representation of a correlation between the set of GSAs and the relevant bioprocess parameter. For example, a bioprocess recommendation may identify a relevant bioprocess parameter (e.g., glutamine), a set of associated GSAs of the target biologic (e.g., Man5 presence and glycan occupancy at a particular site), and a representation of a correlation between the relevant bioprocess parameter and the set of associated GSAs (e.g., an identification that Man5 presence is correlated with low glutamine levels and/or that glutamine level correlates with amount of glycan occupancy). In this manner, the bioprocess recommendations can provide various levels of detailed information to a user that guide them in identifying which bioprocess parameters are relevant, and should be considered in refinement of a bioprocess, as well as their significance in terms of association with particular GSAs of the target biologic.

The bioprocess recommendations may also comprise recommendations for specific values or changes to values of particular relevant bioprocess parameters. In particular, in certain embodiments, the bioprocess recommendations comprise a bioprocess design result. The bioprocess design result comprises identifications of one or more relevant bioprocess parameters and, for each relevant bioprocess parameter, a bioprocess parameter value recommendation corresponding to a recommended value or change to a value of the relevant bioprocess parameter. The bioprocess design result may comprise a complete set of bioprocess parameter value recommendations that specify settings of all bioprocess parameters to use in a bioprocess protocol. A user may then directly use such bioprocess design results adjust a bioprocess protocol for producing the target biologic. For example, a bioprocess design result may identify perfusion rate as a relevant bioprocess parameter, and recommend a specific perfusion rate setting or a change (e.g., increase) to a perfusion rate setting. A user may adjust their bioprocess protocol to use the recommended perfusion rate setting, or to increase the perfusion rate setting, and produce the target biologic using the adjusted bioprocess protocol.

The bioprocess recommendations may also comprise a recommended analytical study to be carried out on the target biologic. In this manner, the bioprocess recommendation provides the user with guidance on how to perform analysis to further clarify potential relationships between bioprocess parameters and structural features of the target biologic.

Accordingly, by providing a user with various bioprocess recommendations, the biologic development tool described herein provides a user with a wealth of information and useful recommendations that can be used to refine a bioprocess for manufacturing a target biologic.

A. ii Formulation Recommendations

Formulation refers to the process of identification, development, and optimization of stabilizing conditions for a biologic molecule, resulting in a specific formulation comprising a specific combination of the biologic and various stabilizing conditions and additional components. Such additional components may include particular buffers, as well as additives that enhance physicochemical stability. These additives, referred to as excipients, include buffers, acids, bases, sugars, polyols, surfactants, detergents, amino acids, chelators, antioxidants, polymers or other compounds generally regarded as safe (GRAS). In certain embodiments, a formulation comprises a particular storage condition, such as a form in which the biologic is stored (e.g., liquid; e.g., lyophilized; e.g., spray dried; e.g., sustained release), a concentration of the biologic, a pH, a temperature, and even particular types of containers, container materials, and closure systems.

Accordingly, developing a particular formulation that appropriately stabilizes a particular biologic requires optimizing a variety of parameters defining the particular formulation. As with development of a bioprocess, determining such optimized

combinations of a wide variety of parameters whose influence of the stability of a particular biologic may depend in a complex fashion on each other, as well as the particular biologic, is a complex process.

In certain embodiments, in order to facilitate refinement of a formulation of a target biologic, the biologic development determination technology may output one or more formulation recommendations. The formulation recommendations provide guidance for refining a formulation of the target biologic. In certain embodiments, each of the one or more formulation recommendations comprises an identification of a relevant formulation parameter. As with relevant bioprocess parameters, the relevant formulation parameters are identified based on automated analysis of GSAs of the target biologic and mappings between GSAs and biologic records of the attribute store, as described in the following. In particular, the relevant formulation parameters identified in the formulation recommendations correspond to formulation parameters that are determined, through automated analysis of the GSAs of the target biologic and the biologic records and GSAs of the attribute store (e.g., by a machine learning module), as associated with one or more GSAs of the target biologic.

FIG. 4B shows examples of formulation parameters corresponding to parameters that define a particular formulation of a biologic. Any of the formulation parameters shown in FIG. 4B may be determined as a relevant formulation parameter, and identified via a formulation recommendation. For example, particular surfactants, buffers, and other excipients may be identified as associated with or protection against deamidation of a target biologic. Formulation parameters such as particular surfactants, buffers, and other excipients may be identified as relevant formulation parameters.

As with bioprocess recommendations, formulation recommendations may be output as textual strings that identify relevant formulation parameters, thereby identifying for a user those formulation parameters whose values (e.g., type, concentration) may be influencing GSAs of the target biologic. Like bioprocess recommendations, formulation

recommendations may also comprise, along with the identification of the relevant formulation parameter, a flag that indicates to the user whether the relevant formulation parameter’s value should be further analyzed for its impact on the structure of the target biologic. In a similar fashion to that described above with respect to bioprocess recommendations, in addition to an identification of a relevant formulation parameter, a formulation recommendation output by the biologic development determination tool may also comprise an identification of a set of GSAs associated with the relevant formulation parameter. Formulation recommendations that identify structural features of the target biologic associated with relevant formulation parameters may also include indications of criticality of the GSAs, identify particular GSAs as CQAs, and include representations of correlations between GSAs and relevant formulation parameters.

The formulation recommendations may also comprise formulation design results that include recommendations for specific values or changes to values of particular relevant formulation parameters. A particular formulation design result may comprise a complete set of formulation parameter value recommendations that specify settings of all formulation parameters to use in a formulation of the target biologic. A user may then directly use such formulation design results to adjust a formulation of the target biologic. For example, a formulation design result may identify surfactant concentration as a relevant formulation parameter, and recommend a specific surfactant concentration or a change (e.g., decrease) to the surfactant concentration. A user may adjust their formulation to use the recommended surfactant concentration, or to reduce the surfactant, and produce the adjusted formulation of target biologic.

As described above with respect to the bioprocess recommendations, formulation recommendations may also comprise a recommended analytical study to be carried out on the target biologic.

Accordingly, by providing a user with various formulation recommendations, the biologic development tool described herein provides a user with a wealth of information and useful recommendations that can be used to refine a formulation of a target biologic. A. iii Structure-Function Recommendations

In certain embodiments, the biologic development determination technology facilitates determination and/or refinement of a structure-function profile of a target biologic. A structure function profile identifies primary, secondary, tertiary and/or higher order structure(s) and modifications of the biologic drug design affecting (i) its intrinsic stability, including resistance to degradation such as deamidation or oxidation, resistance to denaturation, propensity to form aggregates; (ii) its mechanism of action, such as antibody- dependent cell-mediated toxicity or complement-dependent cytotoxicity; (iii) its

pharmacokinetic profile, such as its in vivo circulatory half-life, its biodistribution and/or its clearance mechanism (iv) its efficacy, such as target binding and occupancy time, and dose requirements; and/or (v) its immunogenicity profile, such as binding or eliciting anti-drug antibody or hypersensitivity anti-drug antibody. Elements of a structure-function profile may comprise structural elements alone, or functional elements alone, or structure-function elements in combination. Accordingly, a structure-function profile of particular biologic provides valuable information that may allow a user to, for example, refine a design of a biologic, identify potentially problematic functional behavior, and/or confirm desired functional behavior.

Traditionally, generating and/or refining a structure-function profile requires extensive experimentation and measurements of not only of structural characteristics of a particular biologic, but also of its functional behavior.

In certain embodiments, the systems and methods described herein provide for automated generation of one or more structure-function recommendations that facilitate refinement of a structure-function profile of a target biologic. In particular, similar to the bioprocess recommendations and formulation

recommendations described above, the structure-function recommendations may comprise identifications of relevant structure-function parameters, which correspond to functional characteristics of the target biologic that are associated with its structure. Structure-function recommendations may also identify possible structural alterations of a target biologic that can be effected without influencing function.

As with relevant bioprocess parameters and relevant formulation parameters, relevant structure-function parameters are identified based on automated analysis of GSAs of the target biologic and mapping between GSAs of known biologies and biologic records of the attribute store. Relevant structure-function parameters are determined as associated with one or more GSAs of the target biologic.

FIG. 4C shows examples of structure-function parameters that may be identified as relevant structure function parameters. As shown in FIG. 4C, structure-function parameters may identify functional characteristics of a target biologic such as an intrinsic stability, a pharmacokinetic (PK) profile, a mechanism of action, an efficacy, and immunogenicity. An identification of a relevant structure-function parameter may be indicated in isolation, e.g., as a textual string, or in conjunction with a flag that identifies whether it should be checked. For example, based on analysis of GSAs of a target biologic, the immunogenicity of the target biologic could be determined to be a relevant structure-function parameter and output along with a warning flag indicating that the user should check immunogenicity. This is relevant, for example, if a target biologic’s structure includes features that, based on analysis of data from known biologies, are predicted as potentially causing immunogenic responses, such as anaphylactic shock, in patients administered the target biologic.

In addition to identification of relevant structure-function parameters, a structure- function recommendation output by the biologic development determination tool may also comprise an identification of a set of GSAs of the target biologic that are associated with the relevant structure-function parameter. Structure-function parameter recommendations that identify target biologic GSAs that are associated with relevant structure-function parameters may also include indications of criticality of the GSAs, identify particular GSAs as CQAs, and include representations of correlations between target biologic GSAs and relevant structure-function parameters.

A user may thus leverage the structure-function parameter recommendations to refine a structure-function profile of a target biologic, for example to improve stability or efficacy. In particular, in certain embodiments, the structure-function recommendations may comprise a structure-function profile result that represents a predicted structure-function profile of the target biologic based on automated analysis of its GSAs. The structure-function profile result may comprise identifications of relevant structure-function parameters and, for each relevant structure-function parameter, a predicted value of that structure-function parameter that represents a predicted functional characteristic of the target biologic.

As described above with respect to the bioprocess recommendations and formulation recommendations, structure-function recommendations may also comprise a recommended analytical study to be carried out on the target biologic.

As certain structure-function parameters relate to the functional behavior of a target biologic when administered to a patient, structure-function recommendations may also comprise recommended settings for clinical trial specifications and/or safety profile recommendations such as recommended tests (e.g., testing for particular IgE levels) to perform on patients before administering the target biologic.

In certain embodiments, bioprocess recommendations and/or formulation

recommendations may be determined in combination with structure-function

recommendations. In particular, bioprocess recommendations and/or formulation recommendations determined in combination with structure-function recommendations may correspond to (e.g., include identification of) relevant bioprocess parameters and/or relevant formulation parameters that are associated with the same GSAs of the target biologic with which a determined relevant structure-function parameter of a structure-function

recommendation is associated. For example, a particular relevant structural-function parameter may be determined to be associated with a particular target biologic GSA or set of target biologic GSAs. The biologic development determination tool may then, determine relevant bioprocess parameters that are also associated with the same particular target biologic GSA, and uses these determined relevant bioprocess parameters as the basis for one or more bioprocess recommendations. A user is then, in this manner, provided with guidance on how a bioprocess for producing the target biologic could be adjusted to refine the structure-function profile of the target biologic. Formulation recommendations determined in combination with structure-function recommendations in an analogous fashion, thereby providing a user with guidance on how a formulation of the target biologic could be adjusted to refine the structure-function profile of the target biologic.

B. Determination of Generalizable Structural Attributes

In certain embodiments, the biologic development determination technology described herein determines a set of GSAs for the target biologic. The set of target biologic GSAs may be determined from a received user input comprising data corresponding to one or more measured structural features of the target biologic. The data corresponding to measured structural features of the target biologic may include a variety of structural characterizations of the target biologic, obtained via a variety of structural characterization studies. FIG. 3 shows an organization of several relevant types of structural characterizations of a target biologic that may be measured and from which GSAs may be induced (e.g., derived). Approaches for performing structural characterizations of biologies and obtaining measurements of structural features such as those shown in FIG. 4 are described in detail in U.S. Provisional Patent Application Number 62/506,443, filed May 15, 2017, PCT

Application Number PCT/US2014/059150, filed October 3, 2014, and PCT Application Number PCT/US2016/053434, filed September 23, 2016, the contents of each of which are hereby incorporated by reference in their entirety.

In certain embodiments, the GSAs of the target biologic are derived from measurements obtained from one or more samples of the target biologic and, accordingly, generalize structural features of the target biologic itself. In certain embodiments, the GSAs of the target biologic are differential GSAs. Differential GSAs may identify and/or quantify a difference or similarity in a structural characteristic between (i) the target biologic and (ii) a reference biologic, or between different lots or formulations of the target biologic.

Accordingly, in certain embodiments, the set of target biologic GSAs includes GSAs that characterize generalized structural paterns that are associated via heuristic and/or domain knowledge with bioprocess, formulation, and/or structure-function parameters, such as those described above in Section A. Accordingly, for example, for a given target biologic, the tool utilizes a determined set of target biologic GSAs to identify relevant bioprocess parameters that may serve as a basis for generating bioprocess recommendations,, as described above. In particular, as described in the following, a set of target biologic GSAs is compared with GSAs associated with biologic records and stored within the attribute store. Similarly, the tool may utilize a determined set of target biologic GSAs to identify relevant formulation parameters and/or structure-function parameters based on a comparison between the target biologic GSAs and GSAs of known biologies associated with biologic records of the atribute store. In certain embodiments, GSAs of a target biologic are determined via a preprocessing step, which may be implemented by a preprocessing module, as shown in FIG. 1 and FIG.

2B. The preprocessing module may be the same preprocessing module that is used to determine GSAs of known biologies, for inclusion in the attribute store as described below.

C. User Input

A user may input data corresponding to one or more measured structural features of the target biologic used to determine target biologic GSAs in a variety of fashions. For example, a user may input data corresponding one or more measured structural features of the target biologic as a text file or other file format.

In certain embodiments, the tool receives, as input, additional information about the target biologic, such as an identifier of a molecule type. Molecule types that may be identified include, but are not limited to, a recombinant protein, a fusion protein, a monoclonal antibody, and an antibody-drug conjugate. A user may input an identifier of a molecule type through a variety of user interactions. For example, molecule type may be identified via a textual label that refers to one or more entries in a molecule type dictionary stored in memory. For example, the label“Fc-fusion” may be used to identify an Fc-fusion protein molecule type. Accordingly, a user may thus provide text input identifying the molecule type at a command line prompt. User input of a molecule type may also be provided via a GUI. For example, a user may select one or more molecule types from a drop down list, or other types of graphical control elements (e.g., radio boxes, check boxes, and the like).

In certain embodiments, the identification of the molecule type of the target biologic may be used as input to the attribute preprocessing module, as shown in FIG. 2B, and used for determination of the one or more target biologic GSAs. In certain embodiments, the identification of the molecule type of the target biologic may be used as a GSA itself, in order to identify e.g., relevant bioprocess parameters, relevant formulation parameters, and/or relevant structure-function parameters. D. Attribute Store

I) i Biologic Records

In certain embodiments, the atribute store is a database that stores a plurality of biologic records. Biologic records are data structures used to store data corresponding to parameters of (i) bioprocesses that are used to produce known biologies, (ii) formulations of known biologies, and (iii) structure-function profiles of known biologies. Accordingly, each biologic record comprises various combinations of bioprocess, formulation, and structure- function parameter values and is associated with a particular known biologic. As used herein, the term“known biologic” refers to a particular biologic that has previously been

characterized via a structural characterization study and for which GSAs have been determined and stored in the atribute store, as described below. The known biologies can include biologic molecules that are entirely different from the target biologic of interest, or may also be the same molecule as the target biologic, but produced by a different bioprocess, or included in a different formulation than that of the target biologic of interest or used for treatment of a different indication.

In certain embodiments, a biologic record includes data representing a set of bioprocess parameter values, each of which corresponds to a value of a corresponding bioprocess parameter used in the production of the associated known biologic. The bioprocess parameters to which the bioprocess parameter values correspond may be the same bioprocess parameters as described above in Section A with regard to bioprocess

recommendations and FIG. 4A. In certain embodiments, a biologic record includes data representing a set of formulation parameter values, each of which corresponds to a value of a corresponding formulation parameter used in the formulation of the associated known biologic. The formulation parameters to which the formulation parameter values correspond may be the same formulation parameters as described above in section A with regard to formulation recommendations and FIG. 4B.

In certain embodiments, a biologic record includes data representing a set of structure-function parameter values, each of which corresponds to a particular structure- function parameter that represents a previously determined specific structural or functional characteristic of the associated known biologic. The structure-function parameters to which the structure-function parameter values correspond may be the same structure-function parameters as described above in section A with regard to structure-function

recommendations and FIG. 4C.

For example, a biologic record may be represented by a data structure such as the one shown below. The biologic record,“BRT”, includes a series of fi el d/value pairs that identify values of particular bioprocess parameters, formulation parameters, and structure-function parameters.

BR1 = {

"host cell": "CHO";

"purification": {"initial": " Protein-A" ;

"polishing": { "SEC"; "HIC"} } ;

"Waste product ammonium": "high";

"Amino acid feed glutamine": "low";

"Glucose concentration g/L": 0.5;

"Bioreactor" : { "type" : "perfusion"; "flow rate_WD" : 0.85}; "Formulation pH": 5.5;

"Buffers": {{"sodium

citrate" : 10mM} ; {"histidine" : 10mM} ; { "KC1" : 120mM} } ;

"Surfactants" :{ "polysorbate 20": 0.1%};

"DS Container": { "borosilicate vial": {3 mL; 20°C}}; "DP Container": {"siliconized borosilicate syringe":

{ 3mL; 4°C }};

"Deamidation half-life": 78.4hr;

"In vivo half-life" : 21d;

"Target occupancy time": 2hr;

"MoA": "ADCC" ;

"Binding ADA" : none

}

As shown in the“BR1” biologic record, a particular biologic record stored in the attribute store need not be a complete record. That is, the biologic record need not include bioprocess parameter values for every bioprocess parameter used in a particular bioprocess, and similarly, need not include formulation parameter values for every formulation parameter used in a particular formulation, or structure-function parameter values for every structure- function parameter of a particular known biologic’s structure-function profile. In certain embodiments, the bioprocess record may be a partial record, for example comprising parameter values for a subset of bioprocess parameters, formulation parameters, and/or structure-function parameters for which data exists, or could be extracted.

FIG. 5A illustrates an embodiment of data structures to represent records for biologic molecules. As shown in the figure, representation of biologic molecules can be organized into hierarchical classes, with important inheritable attributes shared among members of the same class and its derivative subclasses. Derivative subclasses can, in turn, introduce new attributes that are specific to that subclass only, and which are not shared by other subclasses (sibling nodes) or the general class (parent node). The hierarchical class structure provides a basis for guiding search and generalization across multiple correlations. The attributes can include, among others: sets of physicochemical properties, special amino acid motifs, protein domains including domains identified by function (such as target-binding domain, activity domain, substrate or co-factor binding domain, pharmacokinetic-influencing domain), expression systems, aspects of higher-order structure, as well as accession numbers to external databanks (such as UniprotKB). The provision of such a representation scheme allows for the gradual extension and enrichment of the data store over time, as new data becomes available.

FIG. 5B illustrates an embodiment of data structures to represent other biologic records more generally. As shown in the figure, similarly to the representation of biologic molecules, other records such as GSA’s, development and process impacts, can be organized into hierarchical classes. An important data structure is the class of correlations, as it brings together all entities identified as correlates, which could range over all the entities in the data store. In an embodiment of the present invention, the correlation data structure includes a link to a specificity data structure, which delineates the subspace of the data store to which a given correlation is applicable (for example, the entire class of monoclonal antibodies, or the Fc domain of fusion proteins, or the subset of biologic molecules expressed in CHO cells, or the subset of clinical impacts defined by anti-drug antibody reactions). As with the representation of biologic molecules, such a flexible and extensible representation scheme for general biologic records enables the gradual enrichment of the data store with new data and insights.

Accordingly, the attribute store may use biologic records to store information about any one, or combinations of the following: (i) bioprocesses used to produce known biologies, (ii) formulations of known biologies, and (iii) structure-function profiles of known biologies.

I) ii Linkins Biolosic Records with GSAs of Associated Known Biolosics

In certain embodiments, in order to allow patterns in various bioprocess parameters, formulation parameters, and structure function parameters to be identified and used to determine various bioprocess recommendations, formulation recommendations, and structure-function recommendations, each biologic record is associated (e.g., linked, as in a stored association in computer code/memory) with one or more (e.g., a plurality of) GSAs of the known biologic with which it is associated.

An example process 100 for building an attribute store 120, including the

determination of GSAs is shown in FIG. 1. As shown in the figure, a preprocessing step 110 (“Attribute pre-processing”) is used to determine, for a given known biologic, a set of GSAs 122. As described above with respect to the target biologic GSAs, the set of GSAs of the known biologic may be determined from structural characterization data for the known biologic.

In certain embodiments, for a given biologic record, the one or more GSAs associated with the biologic record are determined and stored when the biologic record is created. Once the GSAs associated with a particular biologic record are determined, they may be stored within the biologic record, or elsewhere, and linked with the biologic record.

In certain embodiments, once the GSAs associated with a particular biologic record are determined it is no longer necessary for any record of an identification of a specific known biologic associated with the biologic record to be maintained. Accordingly, in certain embodiments, a given biologic record in the attribute store is linked to one or more GSAs of an associated known biologic, but other identifying information of the associated known biologic (e.g., a nominal primary structure of the associated known biologic; e.g., any measured data from a study performed on the associated known biologic) is not stored. In this manner, in certain embodiments, linking a given biologic record with GSAs of an associated known biologic provides for safeguards relevant to data security and

confidentiality considerations.

In certain embodiments, a biologic record stores, or is linked to structural characterization data for the associated known biologic. In this manner, one or more GSAs for the biologic record can be determined as needed. Accordingly, in certain embodiments, if new or different sets of GSAs are relevant for a given application (e.g., addressing a given user input query), they can be determined using the stored or linked structural

characterization data for the associated known biologic.

FIG. 6A is a connection diagram illustrating an example for representing data associations between entities such as biologic molecules, GSA’s, development process attributes such as bioproduction and formulation, and product attributes such as

cbnical/handbng (also referred to as“Impacts”). As shown in FIG. 6A, the biologic development determination tool described herein represents such entities across different planes or dimensions. Accordingly, entities that are quite different along one dimension (for example, biologic molecules from different classes) may share some attributes along a different dimension (for example, a bioproduction process parameter or a glycosylation structural attribute) which enables the formation of a string of associations. This

representation supports the definition of heuristic measures of similarity between data objects contained within any of these dimensions, and the utilization of such similarity scores to guide search and generalization algorithms. For example, biologic molecules from different classes that happen to share GSAs with high degrees of similarity may consequently inherit a corresponding range of impacts (as illustrated in edges“1” of the figure). Likewise, biologic molecules belonging to the same subclass and sharing a high degree of similarity with one another may inherit part of each other’s attributes, thus filling gaps in characterization (as illustrated in edge“2” of the figure). Impacts or GSAs that are shared across different biologic subclasses may be generalized to their parent class (as illustrated in edge“3” of the figure). I). iii Populating the Attribute Store

The biologic records in the attribute store may be obtained from a variety of sources. For example, biologic records may be created from publicly available sources (e.g., published literature using article databases such as PubMed (https://wvvw.ncbi.nixn.nih.gov/pubmedO ; e.g., public data repositories such as the Proteomics Identifications [PRIDE] Archive database (Mips:// www.ebi.ac.uk/pride/ardhive/) or the UniProtKB/Swiss-Prot database of annotated functional information on proteins (¾ttp:/Avww.umprot.org/) or drug effect databases (b Up ://open . fda. gov/dmg/e v en / or http: //si deeffects. or from in-house

studies that characterize biologies produced by different bioprocesses.

In certain embodiments, a given study includes data on any combination of (i) bioprocess parameters (labeled 124 in Fig. 1), (ii) formulation parameters (labeled 126 in Fig. 1), and (iii) structure-function parameters(labeled 128 in Fig. 1) associated with various structural features of a known biologic. Accordingly, the combination of (i) bioprocess parameters, (ii) formulation parameters, and (iii) structure-function parameters included in the study may be extracted and stored as a biologic record.

In certain embodiments, a given study of a particular biologic may include one or more bioprocesses, each comprising multiple bioprocess parameters. Accordingly, a given study can be used to generate one or more biologic records, each corresponding to a respective bioprocess of the given study. Similarly, studies comprising multiple

formulations, or multiple structure-function profiles of a particular biologic can be used to create multiple corresponding biologic records.

Data stored in biologic records can be obtained from a source that describes a study in a variety of ways. In certain embodiments, the source is a published document, and biologic record is created from the source manually, by an expert or a technician who reads and interprets the study, and inputs data stored in the biologic records manually. In certain embodiments, biologic records may be created from published documents via automated processing using text mining with or without additional natural language processing. In certain embodiments, a hybrid combination of interaction with a user and automated processing is used to create biologic records from published documents. In certain embodiments, biologic records generated from in-house studies are created in an automated fashion via dedicated software as part of a laboratory information management system.

In certain embodiments, in order to determine GSAs of a particular known biologic associated with a particular biologic record, structural characterization data for the particular know biologic is obtained from the source used to create the biologic record. For example, if the source is a published document, structural characterization data for the particular known biologic may be extracted as described above with respect to the manner in which biologic records may be created from a published document. For biologic records created from in- house studies, structural characterization data may have been obtained as part of the in-house study, and can be obtained in an automated fashion, e.g., as part of a laboratory management system. Once structural characterization data for the known biologic is obtained, a set of GSAs for the known biologic may be determined, and linked with the associated biologic record as described above.

E. Determining Biologic Development Recommendations

In certain embodiments, the systems and methods described herein utilize (e.g., via a machine learning approach) the attribute store to determine one or more biologic

development recommendations (e.g., bioprocess recommendations; e.g., formulation recommendations; e.g., structure-function recommendations) in response to an input query.

In particular, the approaches described herein use the GSAs associated with biologic records of the attribute store and determined GSAs of a target biologic to identify related parameters corresponding to any of (i) bioprocess parameters, (ii) formulation parameters, and (iii) structure function parameters.

FIG. 2A is a block flow diagram illustrating an example process for determining biologic development recommendations based on automated analysis of one or more GSAs of a target biologic. As shown in FIG. 2A, the biologic development determination tool described herein receives an input query comprising the one or more target biologic GSAs (202). As described above, the one or more target biologic GSAs are derived from structural features (e.g., measured structural features) of the target biologic. The structural features of the target biologic may be provided via a user input, and preprocessed to determine the one or more target biologic GSAs that are used as the input query. Such preprocessing may be accomplished by an attribute preprocessing module, as shown in FIG. 2B, which illustrates an example interaction between various components (e.g., modules, databases, and data elements) used for determining biologic development recommendations. As shown in FIG. 2B, a received user input comprises structural characterization data for a target biologic, along with an identification of a molecule type of the target biologic. The attribute preprocessing module determines, using the user input structural characterization data, a plurality of target GSAs of the target biologic.

A machine learning module takes the determined GSAs of the target biologic as input and accesses the attribution store comprising biologic records and known biologic GSAs (204). The machine learning module compares the GSAs of the target biologic with GSAs of known biologies for which biologic records are stored in the attribute store. Based on the comparison between the GSAs of the target biologic and the GSAs of known biologies, the machine learning module identifies relevant bioprocess, formulation, and/or structure- function parameters that are associated with (e.g., correlated with) one or more GSAs of the target biologic. Relevant bioprocess, formulation, and structure-function parameters can be used to determine biologic development recommendations (206), such as any of the bioprocess (206a), formulation (206b), and structure-function (206c) recommendations described above.

In an illustrative embodiment for a search and generalization algorithm of the present invention, a user inputs a query by specifying data points on one or several planes (for example, specification of a target biologic subclass and specification of a GSA). In a first step (labeled 1 in FIG. 6B), the tool computes a similarity score for each data point on the plane around the user’s input and creates a radius of similarity within each of the input plane. In a second step (labeled 2 in FIG. 6B), the tool searches and returns all retrospective associations between all pairs of points within the similarity radii in the user input planes and their correlates on different planes. In a third step (labeled 3 in FIG. 6B), each retrospective association is evaluated for its specificity and generalized using established machine learning generalization techniques and/or ad hoc heuristics. In a final step (labeled 4 in FIG. 6B), inference-chaining of relevant associations is performed to return a scored and prioritized array of predictive associations, which form the basis of the biologic development determinations returned by the tool.

FIG. 6C provides an illustrative example of a typical user workflow. In this particular example, the user indicates interest in a subclass of biologic molecules known as cytokine hormones. The user also indicates interest in a specific GSA, namely Aggregation. In a first step (labeled 1 in FIG. 6C), the tool computes a similarity score for each data point on the plane around the user’s input and creates a radius of similarity within each of the input plane. In this particular example, the tool identifies erythropoietin (EPO) as a biologic entity sharing some similarity with the user’s target biologic X. The tool also identifies data entities in its attribute store related to the Aggregation GSA. In a second step (labeled 2 in FIG. 6C), the tool searches and returns all retrospective associations between all pairs of points within the similarity radii in the user input planes and their correlates on different planes. To illustrate, associations are found between EPO, the formulation ingredients polysorbate-80 (PS080) and human serum albumin (HSA), and the use of a rubber stopper in the container. Associations are also found between the Aggregation GSA and the anti-drug antibody clinical impact (ADA). In a third step (labeled 3 in FIG. 6C), each retrospective association is evaluated for its specificity and generalized using established machine learning generalization techniques and/or ad hoc heuristics. To illustrate, an association is found between PS-80 and the rubber stopper. An association is also found between the rubber stopper and the presence of organic leachates, as well as an association between organic leachates and ADA immunogenic reactions. In a final step (labeled 4 in FIG. 6C), inference-chaining of relevant associations is performed to return a scored and prioritized array of predictive associations, such as:

Cytokine hormone X, being similar to EPO, is likely to be formulated with PS-80; a container with a rubber stopper will likely contribute, in the presence of PS-80, to the presence of organic leachates, which are associated with an ADA immunogenic reaction. The resulting development determinations include, in this illustrative example: (1) Adjusting the concentration of PS-80 in the formulation, and (2) switching to a Teflon-coated stopper as the preferred closure system.

For example, in certain embodiments, the machine learning module identifies patterns within the GSAs linked to the biologic records of the attribute store, and uses the identified patterns to identify relevant bioprocess parameters. Relevant bioprocess parameters may be identified based on their correspondence to bioprocess parameter values that the biologic records linked to the GSAs comprise. In this manner, the machine learning module may generate a mapping between bioprocess parameters and particular GSAs of the known biologies associated with bioprocess records of the attribute store. For example, FIG. 7 shows a mapping between select GSAs and two particular bioprocess parameters. By virtue of determined mappings between known biologic GSAs and various bioprocess parameters, comparing GSAs of a target biologic to the known biologic GSAs allows for relevant bioprocess parameters to be identified, and correlations between relevant bioprocess parameters and GSAs of the target biologic to be determined. This same approach can be used, in a similar fashion, to identify relevant formulation parameters and relevant structure- function parameters that are correlated with one or more GSAs of the target biologic.

Once any one of a relevant bioprocess parameter, a relevant formulation parameter, and a relevant structure-function parameter is identified as associated (e.g., correlated) with a particular set of one or more GSAs of the target biologic, it can be used to determine a biologic development recommendation. For example, based on determined associations (e.g., correlations) between relevant bioprocess parameters and GSAs of the target biologic, a variety of bioprocess recommendations, as described in Section A above, can be determined. Similarly, in certain embodiments, based on determined associations (e.g., correlations) between relevant formulation parameters and GSAs of the target biologic, a variety of formulation recommendations, as described in Section A above, can be determined. In certain embodiments, based on determined associations (e.g., correlations) between relevant structure-function parameters and GSAs of the target biologic, a variety of structure-function recommendations, as described in Section A above, can be determined.

Notably, in certain embodiments, bioprocess recommendations and formulation recommendations need not be determined based on relevant bioprocess parameters and relevant formulation parameters, respectively. Instead, in certain embodiments, bioprocess recommendations may be determined based on identified associations between structure- function parameters and GSAs of the target biologic. Similarly, in certain embodiments, formulation recommendations may also be determined based on identified associations between structure-function parameters and GSAs of the target biologic. In general, a determination of an association between one or more GSAs of the target biologic and any one of a bioprocess parameter, a formulation parameter, and a structure-function parameter may serve as the basis for any of the biologic development recommendations described herein (e.g., bioprocess recommendations; e.g., formulation recommendations; e.g., structure- function recommendations).

In certain embodiments, comparing target biologic GSAs with known biologic GSAs to determine biologic development recommendations comprises performing classification and/or cluster analysis. For example, cluster analysis is used to categorize biologic records in the attribute store as belonging to one or more particular groups based on the set of known biologic GSAs that they are linked to.

An example approach for performing cluster analysis to identify related data structures linked to quantities representing structural features of biologies is described in U.S. Provisional Patent Application Number 62/506,443, filed May 15, 2017, the contents of which is hereby incorporated by reference in its entirety. The data structures in U.S.

Provisional Patent Application Number 62/506,443 are referred to as analytical method records, and the quantities representing structural features of biologies are referred to as GBAs. A similar approach can be utilized for identifying related biologic records, instead of analytical method records, by using the GSAs to which they are linked (as opposed to the GBAs of U.S. Provisional Patent Application Number 62/506,443). In certain embodiments, various GSA vectors can be created from GSAs of known biologies and used to map biologic records to various points in a multi-dimensional space. Clusters of related biologic records (e.g., in close proximity to each other in multi-dimensional space) can be identified. The GSAs of the target biologic can then be used to identify the target biologic as associated to a particular cluster, based on the position in multi-dimensional space identified by a vector created from its GSAs. In certain embodiments, a GSA vector is determined as a weighted sum of a subset of a given set of GSAs (e.g., a set of target biologic GSAs; e.g., a set of known biologic GSAs). Different GSA vectors are determined using different weightings and/or different subsets of GSAs. For a given set of GSAs, determined values of each of the GSA vectors identify a point in a multi-dimensional space (e.g., the number of dimensions corresponding to the number of GSA vectors). Accordingly, each set of known biologic GSAs maps to a point in a multi-dimensional space.

In certain embodiments, in order to identify relevant biologic records for a given target biologic, values of the two or more GSA vectors are determined for the set of GSAs of the target biologic. The values of the two or more GSA vectors for the target biologic thus map the GSAs of the target biologic to a point in two or higher dimensional space. In certain embodiments, the target biologic can be identified as belonging to a particular biologic record cluster based on whether it’s GSAs map to a point within the region in space to which the biologic record cluster corresponds. For example, the GSAs of two different target biologies map to two different points in two-dimensional space. A first target biologic is associated with a first biologic record cluster, and a second target biologic is associated with a second biologic record cluster. In this manner, various combinations of biologic record relevant to a given target biologic may be identified by cluster analysis. Multiple different sets of GSA vectors can be used, either in combination (e.g., such that A GSA vectors define an N- dimensional space) or in multiple rounds (e.g., using a first set of N GSA vectors in a first round and a second set of M GSA vectors in a second round). Different sets of GSA vectors can be used to define different multi-dimensional spaces and to map various known and target biologies to different points in these spaces based on the specific combinations and weightings of various GSAs from which the different GSA vectors are computed. In certain embodiments, use of multiple different sets (e.g., in multiple rounds) of GSA vectors is valuable if a biologic record cluster cannot be identified for a given target biologic using a first set of GSA vectors. For example, in certain embodiments, a particular set of GSA vectors maps GS As of a target biologic to a point that does not fall within any of the regions corresponding to identified biologic record clusters. Accordingly, another set of GSA vectors may be used to associate the target biologic with a particular biologic records cluster.

In certain embodiments, a non-linear transform is used in combination with computation of GSA vectors to separate groups of biologic records of known biologies into different clusters. Applying a non-linear transform in this manner allows attributes that were previously non-separable to be separated via a linear function (e.g., a line in two-dimensional space).

In certain embodiments, various other machine learning approaches may be utilized, in combination with or in place of the cluster analysis approach described above. For example, in addition to cluster analysis, a variety of other unsupervised machine learning techniques, such as k-means clustering and self-organizing maps may be used. Unsupervised machine learning techniques are useful where training data such as biologic records of known biologies does not include output information that maps parameters of interest (e.g., bioprocess parameters; e.g., formulation parameters; e.g., structure-function parameters) to particular GSA’s (represented by a record in the attribute store) for a given known biologic.

In this manner, unsupervised learning is viewed as the task of finding patterns and structure in the data - a way of creating a higher-level representation of the data.

In certain embodiments, performance indices included within and/or linked to biologic records are used as a measure of performance of the particular bioprocessing, formulation, or structure-function profile that a given biologic record represents. This allows reinforcement learning machine learning techniques to be used.

In certain embodiments, multiple machine learning techniques are used in

combination. For example, in certain embodiments, an unsupervised machine learning technique (e.g., cluster analysis) is used as a precursor to a supervised machine learning technique.

Accordingly, by virtue of the specific manner in which the attribute store codifies knowledge regarding previously used bioprocesses, formulations, and structure-function profiles via biologic records that are linked with GSAs of known biologies, the systems and methods described herein allow a variety of machine learning techniques to be utilized to determine biologic development recommendations (e.g., bioprocess recommendations; e.g., formulation recommendations; e.g., structure-function recommendations). As described above, such biologic development recommendations - bioprocess recommendations, formulation recommendations, and structure-function recommendations, provide guidance to a user for refining a bioprocess, a formulation, or structure-function profile of a target biologic.

F. Constructive Examples

F.i Example 1: Bioprocess: Correlation between structural characterization of slvcan composition and key bioyrocess conditions such as nutrients in cell culture media

Example 1 is an example of how the biologic development determination tool described herein may be used to determine various bioprocess recommendations

corresponding to representations of correlations between GSAs of a target biologic and relevant bioprocess parameters. As described in the following, correlations are identified between GSAs of a target biologic corresponding to glycan composition patterns and relevant bioprocess parameters such as nutrients in cell culture media.

The importance of correlating how bioprocessing parameters influence glycosylation can be seen in assessments of physicochemical stability and bioactivity. IgG antibodies typically are glycosylated at N297 in the CH2 domain of their Fc region as shown, for example, in FIG. 8A. The pool of gly covariants found in antibodies is heterogeneous (see, e.g., FIG. 8B and FIG. 8C) and individual gly coforms differ in their biological activity (Jefferis, Nat Biotechnol 24(10): 1230-1, 2006; see, e.g., FIG. 8D). Removal of this glycan abrogates binding to FcyRs and related effector functions (Pound et al., Mol Immunol 30:469-78, 1993; Sarmay et al., Mol Immunol 29:633-9, 1992; Tao & Morrison, J Immunol 143:2595-601, 1989; Jefferis et al., Immunol Rev 163:59-76, 1998). Glycans on monoclonal antibodies (mAbs) are less diverse than other glycoproteins due to at least in part structural constraints that limit the size of the added appendage in this position. For example, FIG. 8E shows a comparison of the major types of N-glycan structures typically observed on antibodies, versus those observed on other proteins. Chemical moieties on the glycan interact directly with the protein, and as such, the composition of the glycan influences stability and bioactivity of the mAh. Individual Fc glycans are known to direct several effector functions and have been shown to impart functional specificity through direct binding with a receptor, as shown in the high-resolution structure of glycosylated Fc and receptor FcyRIIIa (Subedi & Barb, Structure 23(9): 1573-83, 2015; FIG. 8F) or through the antibody by favoring a preferred conformation, such as for increased binding to compliment proteins or Fc receptor. Of significant importance in the development of therapeutic monoclonal antibodies, glycosylation at N297 in Fc leads to a structurally integrated glycoprotein with improved resistance to denaturation and chemical degradation due to specific structure-stabilizing binding interactions. Using high-resolution x-ray crystallography to study a set of specific glycosylated forms of a mAb, Krapp et al. demonstrated that unique glycan modifications lead to distinct conformations of Fc (Krapp et al, JMB 325(5):979-89, 2003). These unique glycan modifications and the distinct conformations to which they lead have functional implications because the binding site region for multiple partner proteins is altered.

Crystallization of the deglycosylated form was not possible due to a high degree of conformational flexibility (Krapp et al, JMB 325(5):979-89, 2003; Higel et al., Eur J Pharm Biopharm 100:94-100, 2016). High-resolution solution NMR was used to show that fucosylation alone does not alter mAb conformation but terminal galactose addition has a substantial effect on structure (Houde et al, Proteomics 9(8)1716-28, 2010) and, separately, that terminal galactose residues in the 1,3 and 1,6 arms in G2F glycan are exposed and each has different flexibility and solvent accessibility (Barb & Prestegard, Nat Chem Biol 7(3): 147-53, 2011). Complete removal of the glycan lowers the thermal melting temperature of the antibody and correlates with increased aggregation (Zheng et al, mAbs 3(6):568-76, 2011). The increase in aggregation has been tied to a distinct mechanism wherein steric repulsion from attaching more than two saccharide units protects against assembly (Subedi & Barb, Structure 23(9): 1573-83, 2015).

Glycosylation is also effective at protecting proteins from proteolysis, which affects pharmacokinetics of biologies, and has been reported to be due to steric hindrance of the glycan limiting enzyme approach to the protein. One study showed that modifying the end- terminal glycan structure to be, N-acetylglucosamine, galactose and sialic acid led to increasing proteolytic stability to papain digestion at the hinge region of antibodies (Raju & Scallon, Biochem Biophys Res Commun 34l(3):797-803, 2006). This result suggests that the differences in conformation of the antibody resulting from altered contact with the different glycan moieties is responsible for differences in proteolytic stability. In a separate study, HDX MS analysis of IgGl and IgG2 molecules showed greater conformational flexibility of the CH2 domain and more proteolysis when the glycan contained high-mannose or hybrid structure compared to complex glycans lacking sialyation. Sialylation led to similar results as high-mannose, weakening the interaction between the glycan moiety and protein compared to the asialylated form, which suggests this modification may lead to steric repulsion with the CH2 domain (Fang et al, Biochemistry 55(6): 860-8, 2016). In most mAbs only a small amount of sialylation is observed (Zhang et al. Drug Dis Today 2l(5):740-65, 2016; Padlan, Mol Immunol 31(3): 169-217, 1994). The encoded C-terminal lysine in mAh heavy chain often is removed during bioprocessing. A separate study showed fucosylation is highly correlated with retention of C-terminal lysine in mAbs (Yang et al, Anal Biochem 448:82-91, 2014).

Chemical degradation also can be mitigated by glycosylation. For example, erythropoietin is naturally glycosylated, and this modification protects the protein from oxidation at a tyrosine residue that destroys activity.

Protein glycosylation is a complex process modulated by numerous factors both intrinsic and extrinsic to a cell and, as exemplified below, the cumulative data demonstrate that prediction of types and proportions of protein gly covariants produced by a cellular system is highly complex and lacks fully understood mechanistic models. Nonetheless, it has been established that culture conditions impact protein production and the distribution of post-translational modifications (PTMs) to the protein, including glycosylation, in a systematic manner through pathways that regulate cellular processes in response to the environment. Although cells are comprised of extremely complex and interconnected networks, the influences exerted by individual and combinations of components/conditions on protein production and modification pathways are being studied and reported. While there are many diverse cell types, a few have gained common usage for the purpose of recombinant protein production (e.g., CHO cells for monoclonal antibodies) and as such have been investigated with respect to how culture conditions and media components affect protein biosynthesis (i.e. bioprocessing). Understanding how to modulate the PTMs profile of recombinant protein products offers enormous potential for controlling pharmaceutical properties and therapeutic attributes of biologic drugs. Presently, only a few such correlations have been identified using data from intentionally designed studies, but a much larger pool of data generated for other purposes is available from which mappings of the relationships between bioprocessing parameters and PTMs may be derived. Most data are dispersed widely amongst many and diverse sources, and the ability to extract information and develop correlations from partial, unstructured and sometimes conflicting data requires sophisticated data analysis approaches, including machine learning, to interpret non- mechanistic information accurately in the correct context to yield constructive knowledge about bioprocessing.

Glucose, obtained from the extracellular environment, is the primary compound utilized by cells to generate usable energy and to drive diverse biosynthetic reactions, cellular signaling and post-translational modification of proteins. Most glucose is converted into pyruvate by glycolysis but a small amount of leakage of fructose-6-phosphate into the hexamine biosynthetic pathway facilitates UDP-N-acetylglucosamine (UDP-GlcNAc) production (FIG. 8G) to support protein glycosylation by O-GlcNAc transferase (OGT) or by a series of five N-acetylglucosaminyltransferases (Mgatl, Mgat2, Mgat4a/b/c and Mgat5). Because glycosylation influences protein activity, stability and/or phosphorylation status, it provides a central link between nutrient uptake and cellular decision making.

Amino acids are the basic building blocks of protein biosynthesis but also are utilized in numerous metabolic pathways to support and regulate many cellular processes. Some amino acids can drive cellular changes via regulation of gene expression and/or are used to generate key components involved in biosynthetic reactions. Dependence on uptake from the environment (e.g., cell culture media) versus intracellular production varies among amino acids, such that some are synthesized in a tightly regulated manner and others are largely obtained from the environment. Monoclonal antibody (mAh) production is dependent upon availability of the most frequently utilized amino acids in the sequence such that a reduction in concentration of an amino acid can limit the expression of mAh.

De novo UDP-GlcNAc synthesis requires both glucose and glutamine (Gln) (see, e.g., FIG. 8G). Example studies report dynamic relationships between glucose and glutamine concentrations and glycosylation of proteins. Fan et al. demonstrated an integrated relationship of glucose and amino acid concentration with metabolism of these nutrients, nucleotide sugar metabolism, cell growth, IgG production and N-glycosylation in fedbatch CHO cell culture (Fan et al, Biotech Bioeng 112(10):2172-84, 2015). Insufficient levels of Gln in media can result in limited biosynthesis of UDP-GlcNAc by the cell and can result in the appearance of Man5 glycan at N297 in Fc of IgGl (Fan et al., Biotech Bioeng

112(10):2l 72-84, 2015). Specific proportions of glycan variants depend on at least several factors, including levels of transporters, enzymes and other sugars (e.g., GlcNAc), which can also vary with cell line and growth phase. Competition between phosphorylation and O- GlcNAc modification of Ser/Thr in key proteins, such as c-Myc, controls cellular transformations and is ultimately regulated by levels of glucose and Gln through their modulation of expression of metabolic enzymes and nutrient transporters (Swamy et al, Nat Immunol 17:712-720, 2016; FIG. 8H).

Build-up of waste products resulting from metabolism can regulate and/or interfere with cellular processes as well. For example, high levels of ammonia (NH4+) in culture can result in higher Man5 glycan on mAh due to decreases in production of the key enzyme a- l,3-mannosyl-gly coprotein 2-P-N-acetylglucosaminyl transferase (GnTI) and diminished activity of the UDP-GlcNAc transporter (Fan et al, Biotech Bioeng 112(10):2172-84, 2015). In addition, gly coconjugate turnover and recycling can make a significant contribution to the intracellular gly can pool and profile of modified protein gly covariants, and it was shown that elevating extracellular GlcNAc levels led to increased N-glycan branching, particularly for tri- and tetra-antennary N-glycans, in hepatocytes and rat liver (Ryczko et al., Sci Rep 6:23043, 2016). The phenomenon is not universal among cells, however, as demonstrated by

Shikhman et al. where elevated levels of exogenous GlcNAc did not increase its uptake into chondrocytes but did accelerate glucose transport (Shikhman et al, Osteoarthritis and Cartilage 17(8): 1022-8, 2009).

Other sugars affect glycosylation in analogous ways in the cell as well and competition amongst pathways can alter incorporation of other saccharides. For example, galactosylation of mAb Fc gly can also depends on extracellular glucose and Gln

concentrations and is limited by UDP-Gal biosynthesis.

As this complex set of inter-relationships is represented and analyzed, correlations are established and decisions can be made about production parameters based on a biologic’s PTM profile. Based on current knowledge, for example, mass spectrometry analysis of a batch of mAb A from CHO cells that shows a high percentage of Man5 at N297 is likely to have a sub-optimal concentration of glutamine in the media or a build-up of ammonium ion. If a new correlation emerged from data mining that showed, for example, that ammonium ion build-up down-regulates Mgat2, then the ratio of hybrid to complex gly cans on mAb A also would be altered and the increased hybrid: complex gly can ratio would indicate that ammonium ion build-up is causing elevated Man5. In this case, the PTM profile would indicate that the culture media should be replaced, rather than glutamine should be added. If, however, the hybrid: complex gly can ratio is not altered and a quantitative relationship between glutamine concentration and relative percentage of Man5 is known, then the amount of glutamine to be added to eliminate Man5 can be calculated from the PTM profile. Adjusting culture conditions would be expected to improve thermal stability of mAb A because N297-Man5 is less resistant to aggregation than other gly coforms. In addition, using this information to optimize the glycan profile would be expected to improve in vivo pharmacokinetics (e.g., circulation half-life, proteolytic stability) of the biologic because Man5 is cleared more rapidly than other gly coforms, reducing effective residence time of the therapeutic.

Example Bioprocess Workflow

A typical workflow illustrating the use of the biologic development determination technology described herein encompasses the following steps:

A user inputs (i) an identification of a molecule type of the target biologic, and (ii) data corresponding to measured structural features of the target biologic obtained from a structural characterization study of the target biologic. The identification of the molecule type could identify the target biologic as a therapeutic monoclonal IgGl antibody produced in CHO cells using a perfusion bioreactor. The structural characterization data could originate from measurements performed on samples of two lots of the target biologic. The two lots correspond to two separate production runs, one (Run 1) at an existing facility and another (Run 2) at a new facility. Example structural characterization data could include mass- spectrometry-based glycan profiles collected from each run showing N-glycosylation at -96% occupancy at the expected site Asn297 with the following results:

The biologic development determination tool described herein uses the structural characterization data received via the user input to determine several Generalizable Structural Attributes (GSAs) of the target biologic. Determining GSAs can be accomplished through preprocessing applied via the attribute pre-processing module. Examples of relevant GSAs that can be determined from the measurements described above (the glycan profiles of the two lots of the target biologic) include the following differential comparison GSA, which identify/quantify differences and similarities in GSAs of the target biologic as produced by Run 2 compared to Run 1 :

• A GSA corresponding to the presence of mannose-rich Man5 glycan

• A GSA corresponding to relatively low levels of complex glycans

• A GSA corresponding to relatively high levels of hybrid glycans

· A GSA corresponding to expected levels of glycan occupancy

• A GSA corresponding to changes in relative amounts of hybrid to complex glycan ratios (H:C)

Once the GSAs of the target biologic are determined, they can be used as input to a machine learning module that accesses the attribute store and identify relationships between the GSAs of the target biologic and relevant bioprocess parameters. The identified relationships between the GSAs of the target biologic and relevant bioprocess parameters as used to determine bioprocess recommendations, which are output by the tool. As illustrated below, the bioprocess recommendations may include representations of identified correlations between relevant bioprocess parameters and GSAs of the target biologic. The bioprocess recommendations may also include bioprocess design results comprising recommended values of relevant bioprocess parameters, determined based on the identified correlations. Based on the GSAs of the target biologic listed above, the following bioprocess recommendations are determined through analysis of the attribute store: • Data analysis of glycosylation patterns for IgGl mAbs produced in CHO cells reveals that high Man5 correlates with high levels of ammonium ion in media at harvest and separately with low levels of glutamine in media.

• Analysis also shows that high levels of ammonium ion correlate with lower levels of complex glycan (due to down-regulation of Mgat2 transferase enzyme) and also higher amounts of hybrid glycan.

• Data analysis further reveals that glutamine level correlates with the amount of glycan occupancy but not a change in relative amounts of hybrid to complex glycans.

Together the correlations derived from these data analyses indicate that a build-up of ammonium ion from metabolic processes is likely altering the glycan profile and that glutamine levels are not low. The finding suggests that increasing the perfusion flow rate at the new facility would likely yield the more desirable glycan profile (less Man5 and more complex glycans), comparable to the biologic produced at the existing facility. The identification of perfusion flow rate as a relevant bioprocess parameter, and the

recommendation to increase the flow rate could itself be output as a bioprocess

recommendation (optionally, including a value of a recommended flow rate), for example as part of a bioprocess design result.

F. ii Example 2: Formulation: Correlations between formulation and deamidation Example 2 is an example showing how the biologic development determination tool described herein may be used to determine formulation recommendations corresponding to representations of correlations between GSAs of a target biologic and formulation parameters. As described in the following, correlations are identified between (i) GSAs of a target biologic corresponding to deamidation propensity, and chemical degradation and (ii) formulation parameters such as presence of particular surfactants and storage conditions corresponding to particular types of containers.

Physicochemical stability of biologies during manufacture, storage and/or administration of product can decrease due to solution conditions, handling and/or environmental factors. Individual formulations are developed to reduce degradation and preserve integrity of the active ingredient. Understanding the relationship between formulation conditions and stability for even a well-established class of molecules, like mAbs, is insufficient for closed-form prediction at present, and therefore, formulation determination for a biologic product is achieved through an extensive screening process. Biologic formulation development typically involves utilization of forced/accelerated degradation approaches and monitoring of samples using a set of analytical characterization tools to evaluate physical (e.g., unfolding, dimerization, oligomerization, aggregation, adhesion) and chemical (e.g., oxidation, deamidation, gly cation, isomerization, hydrolysis, disulfide scrambling) degradation of the biologic in the presence of numerous individual chemical components, solution conditions and/or potential formulations. A diverse set of additives, referred to as excipients, may be examined for this purpose, including salts, buffers, acids, bases, sugars, polyols, surfactants, detergents, amino acids, chelators, antioxidants, polymers or other generally regarded as safe (GRAS) compounds. Within each category, numerous possible individual and/or combinations of compounds may be selected but it is not currently possible to predict which formulation conditions will stabilize an individual protein. For example, addition of the amino acid arginine may improve physical stability and minimize aggregation of a mAh at one pH but may lead to destabilization at another. Although arginine may provide physical stabilization to one mAh, it may have little to no impact on a closely related mAh in the same conditions. Moreover, in conditions that are otherwise the same with related molecules (e.g., immunoglobulin), addition of one specific amino acid (e.g., Arg or Lys) to the formulation may improve stability, while addition of another amino acid, even with very similar chemical structure (e.g., ornithine), may not provide stabilization. As such, predicting formulation conditions has not been achieved even for highly similar, related molecules.

Within the complex landscape of formulation development for a biologic, some specific aspects of formulation pertaining to molecular attributes are known to correlate with specific types of chemical stability, such as the effect of pH on deamidation rate (Patel & Borchardt Pharm Res 7(7):703-l 1, 1990). Deamidation occurs primarily at asparagine (Asn) residues and to a limited extent at glutamine (Gln). Deamidation proceeds slowest at around pH 5 and becomes faster with increasingly acidic and also relatively more basic conditions. Deamidation at low pH proceeds via a different mechanism than at higher pH, and products of the two reactions also may differ (FIG. 12A). At low pH, direct hydrolysis of the amide side chain moiety leads to formation of Asp (L-Asp). Subsequently, isomerization and racemization may occur to generate Iso-Asp, D-Asp, D-cyclic imide and D-Iso-Asp. At higher pH, deamidation proceeds via reaction of the amide side chain moiety within Asn with the adjacent C-terminal backbone nitrogen to form an intramolecular cyclic imide (Asu) intermediate, which then may be hydrolyzed to open the ring, generating Iso- Asp and Asp, typically in approximately a 3 to 1 ratio, respectively (FIG. 12B). Both reactions require water, and as such, lyophilization often has been used to provide additional protection for products prone to deamidation (Carpenter et al., Pharm Res l4(8):969-75, 1997).

Deamidation rates may also depend on the neighboring amino acid sequence, as well as various formulation parameters (e.g., whether the biologic drug is stored as a liquid or solid). For example, for proteins in solution form, deamidation rate may be influenced by the residue located C-terminal to Asn (Pl) (Patel & Borchardt, Pharm Res 7(7):703-l l, 1990), whereas the N-terminal N-l and N-2 positions have a significant impact on deamidation in the solid state (Li et al, AAPS J 8(l):El66-73, 2006). In the Pl position, residues that permit greater flexibility are correlated with faster deamidation rates. For example, Asn-Gly (NG) sequences typically deamidate faster than when Asn is followed by any other residue. It is well established in unstructured/flexible peptides in solution that deamidation rate is dictated by pH and Pl composition.

Most proteins of therapeutic interest have higher-order structure, and in the context of the three-dimensional fold deamidation may be modulated by local structure, involving a complex set of parameters that impact rate (Robinson, PNAS 99(8):5283-3, 2002). Asn residues embedded in the same sequence can have substantially different degradation rates, for example depending on the local structure (e.g., secondary structure) near the Asn residue. For example, in similar sequences, studies have shown that deamidation may proceed at a slower rate for an Asn residue within a well-ordered alpha-helix compared with one within an unstructured loop (Kosky et al, Prot Sci 8(1 l):25l9-23, 1999). It is thought that the helical structure confines the orientation of the Asn and Pl residues, greatly reducing the probability of reaction. In particular, it is thought that deamidation rate is decreased because the backbone peptide bond is held in a stabilized conformation by hydrogen bonding, preventing the groups from adopting the correct orientation for chemical attack to proceed. When deamidation does occur, local helical structure is broken because the requisite hydrogen bonds cannot form due to altered sterics.

Deamidation in beta-sheet structures is more complex (Xie & Schowen, J Pharm Sci

88(1): 8-13, 1999). When embedded in the interior of a sheet, stabilization against degradation may be observed, but when Asn is located in an edge strand, protection may not be conferred and the residue may even be more vulnerable to deamidation than in an unstructured loop. Although rare, some particular structures are associated with more rapid deamidation at a site than would occur in the unstructured equivalent. For example, Asnl 06-Asp in human hypoxanthine guanine phosphoribosyltransferase chain A is held in a position that promotes formation of Asu, which, thereby accelerates the deamidation reaction. In horse heart cytochrome c and in triosephosphate isomerase a structural change resulting from a preceding deamidation event increases deamidation at Asn54-Lys and Asn l5-Gly, respectively (Robinson & Robinson, PNAS 98(8):4367-72, 2001; Flatmark & Sletten, J Biol Chem 243: 1623-9, 1968; Robinson et al., Int J Pept Protein Res 6:31-5, 1974).

Unexpectedly, Interleukin-2 deamidates extremely rapidly at Asn88-Ile. Deamidation proceeds in two steps and completion of the reaction depends on tertiary and local structure. For example, the structure of glutaminase from Methanocaldococcus jannaschii facilitates rapid formation of the cyclic imide intermediate at Asnl 09 but it adopts a highly stable conformation around the intermediate that prevents hydrolysis and greatly increases the structural stability of the protein to both high temperature and chemical denaturation (Kumar et al, Nat Commun 7: 12798, 2016).

One might expect that it would be reasonable to predict deamidation of a conserved Asn residue in a protein variant based on the known behavior of a closely related molecule. While this may be possible in some cases, Phillips et al. compared deamidation rates of analogous Asn residues in NG sequences located in the CDRH2 of two mAbs and found that mAbl underwent rapid deamidation at this site whereas no deamidation was detected in mAb2, indicating that despite having the same overall structure, a local conformational difference between the two homologues dramatically alters deamidation rate at the equivalent position (Phillips et al, Anal Chem 89(4):236l-8, 2017). This in-depth characterization study utilized HDX-MS. The Phillips study also reported the hydrogen-exchange rate at the adjacent (to the deamidation site) two amides. These amides often act as the nucleophiles in the reaction that results in deadmidation at that particular site, such that the hydrogen exchange rates of these adjacent amides can be correlated with the rate of deamidation at that site (see, e.g., FIG. 12C).

One reason why prediction of formulation conditions with respect to stability is challenging is that proteins have different susceptibilities to unfolding with respect to pH and as such the pH of the solution may alter the structure of the protein. Several examples of pH- dependent conformational changes are reported in the literature. In b2M residues in a beta- strand undergo a conformational change between pH 4 and 5 such that the backbone of a beta-strand packs differently at the different pH values. An anti-parallel structure is present at low pH and at slightly higher pH rearranges to form a bulge, and it was shown that this bulge protects the protein from forming amyloidogenic aggregates (Park & Saven, Prot Sci l5(l):200-7, 2006). Such pH-dependent differences in physical structure therefore may alter deamidation rate and/or physical stability unexpectedly. One study showed that deamidation may be altered at multiple positions in a mAh as a function of not only pH but also buffer type and temperature (Pace et al, J Pharm Sci 102(6): 1712-23, 2013). In addition, the presence of impurities, including spontaneous formation of degraded protein species in an otherwise stable formulation can unexpectedly alter stability of the system. Amylin peptide is physically stable in the absence of impurities but when it undergoes deamidation, accumulating to less than 5%, aggregation proceeds unexpectedly rapidly (Nilsson et al, Prot Sci l l(2):342-9, 2002). The crystal structures of ribonuclease A (RNase A) and its deamidated variant containing Iso-Asp at Asn67 show how the protein’s conformation is altered considerably by insertion of an extra methylene group into the backbone (FIG. 12D), converting a rigid region held together by hydrogen bonding into a more flexible structure, yet the protein remained stable at high concentration and amenable to crystallization

(Capasso et al, J Mol Biol 257(3):492-6, 1996). RNase A when subjected to lyophilization from acidic pH (or other denaturing conditions) can undergo domain swapping at the N- and C-termini, leading to self-association. Interestingly, deamidation of RNase A at Asn67 and other positions inhibits this structured association and hence oligomerization (Fagagnini et al, Biochim Biophys Acta Prot Proteomics l865(l):76-87, 2017).

It is evident from the diverse set of proteins, formulation conditions and outcomes reported that sophisticated data analysis approaches, including construction of predictive models via machine learning will be necessary to extract understanding from the vast, dispersed available data to advance toward prediction of stability and formulation conditions. Even with a relatively large data set available in the literature for this chemical degradation, challenges remain in predicting accurately deamidation in structured proteins because it depends on many chemical and structural attributes of the system, and consequently, this area continues to be a subject of ongoing investigation (Lorenzo et al., PLoS One

10(12):b0145186, 2015, doi: l0. l37l/joumal.pone.0l45l86; Jia & Sun, PLoS One 12(7): e0l8l347, 2017, https://doi.org/l0. l37l/joumal. pone.0l8l347). Sequence based prediction results in all similar Asn sites (e.g., NG sites) being equivalent and inclusion of structure- based information in an algorithm improves accuracy of prediction, permitting identification of some non-deami dating NG sites, but not all (Jia & Sun, PLoS One 12(7): e0l8l347, 2017, https://doi.org/l0. l37l/joumal. pone.0l8l347). Current algorithms focus solely on parameters intrinsic to the protein and do not include extrinsic influences (such as formulation conditions, for example). Assessment of these predictions, however, is based on comparison to measured values, and in fact, the measured deamidation values for different proteins were obtained in diverse pH and solution conditions. As such, using data analysis to determine how extrinsic factors influence deamidation and incorporating this information into prediction algorithms leads to more accurate in silico determination of susceptibility to degradation. The ability to predict accurately and reliably formulation conditions that stabilize successfully a given protein would provide enormous cost, time and resource savings in the development of therapeutic protein formulations by reducing the need for screening and/or narrowing the range of conditions to be analyzed and/or informing protein engineering approaches.

Example Formulation Workflow

A typical workflow illustrating the use of the biologic development determination technology described herein encompasses the following steps:

A user inputs (i) a molecule type of the target biologic, and (ii) data corresponding to measured structural features of the target biologic obtained from a structural characterization of the target biologic. The molecule type could be a therapeutic enzyme for example L- asparaginase produced in CHO cells, and the drug product (DP) is to be administered from a pre-fill ed syringe as a liquid formulation. L-asparaginase is a homo-tetramer. The structural characterization data could originate from measurements performed on two samples of the target biologic generated from the same batch formulated at pH 5.5 in 10 mM sodium citrate, 10 mM histidine, 120 mM KC1. Form 1 is a Drug Substance (DS) stored in a 3-mL borosilicate glass vial at 20 °C is and Form 2 is a Drug Product (DP), which is the DS corresponding to Form 1, plus 0.1% polysorbate 20, stored at 4 °C in a siliconized 3-mL borosilicate syringe with a stainless steel needle attached, with mass-spectrometry-based PTM profiles collected from each sample showing the following results:

Preprocessing of the structural characterization data input by the user is used to obtain target biologic General Structural Attributes (GSA) for Form 2 compared to Form 1 that can be correlated with various formulation parameters including the following:

• A GSA of oligomer dissociation from the quaternary protein structure that can be correlated with increasing surfactant concentration

• A GSA of hydrophobically associated oligomer dissociation from the quaternary protein structure that can be correlated with surfactant concentration

• A GSA of oligomer dissociation from the protein quaternary structure that can be correlated with surfactant use in siliconized vials

· A GSA corresponding to increased chemical degradation evidenced by deamidation that can be correlated with the presence of polysorbate/ surfactant

• A GSA corresponding to increased chemical degradation evidenced by deamidation that can be correlated with the use of siliconized containers

• A GSA corresponding to greater chemical degradation evidenced by deamidation that can be correlated with the presence of polysorbate/surfactant in siliconized containers compared to either surfactant or siliconized containers individually

• Dual simultaneous GSA’s of oligomer dissociation from the quaternary protein structure and of higher relative levels of deamidation

Once the GSAs of the target biologic are determined, they can be used as input to a machine learning module that accesses the attribute store and identifies relationships (e.g., correlation) between the GSAs of the target biologic and relevant formulation parameters. These identified relationships may be then used to determine formulation recommendations, which may themselves correspond to representations of the determined correlations, or also include additional actionable recommendations, such as formulation design results. Based on the GSAs of the target biologic, the following formulation recommendations are determined through analysis of the attribute store:

• Data analysis of formulation conditions reveals that oligomeric proteins are more prone to dissociation in the presence of surfactant, siliconized vials and both.

· Analysis also shows that the chemical basis for oligomer association affects the extent of dissociation resulting from surfactant inclusion, with a strong correlation to greater dissociation with hydrophobic interactions.

• Data analysis shows that use of surfactant or siliconized containers correlates with increased chemical degradation level and that combined use correlates with further increased chemical degradation level.

Together the correlations derived from these data analyses indicate that dissociation of the L-asparaginase tetramer likely is encouraged by the presence of surfactant and/or is enhanced by storage in siliconized containers. The findings suggest that deamidation at N281 and N41 is likely accelerated by a change in quaternary structure, leading to greater solvent exposure at these previously buried residues. The findings suggest that the surfactant concentration should be reduced or eliminated and use of a non-siliconized syringe should be examined. The identification of surfactant concentration as a relevant formulation parameter, and the recommendation to reduce surfactant concentration could itself be output as a formulation recommendation (optionally, including a recommended surfactant concentration value) and included in a formulation design result. Similarly, the identification of siliconization of containers as a relevant formulation parameter and the recommendation to use non-siliconized containers may also be output as formulation recommendations and included in formulation design results. FIG. 14 summarizes the workflow described in this example, illustrating the relation between structural features of the target biologic, GSAs of the target biologic, and determined biologic development recommendations, in particular, formulation recommendations.

F. in Example 3: Structure-Function Profile: Glvcosylation effects on mAb-based anti-PDl immunotherayeutic efficacy

Example 3 is an example showing determination of a structure-function

recommendations that includes identified relationships between GSAs of the target biologic corresponding to glycosylation patterns and structure-function parameters corresponding to immunotherapeutic efficiency.

Monoclonal antibody (mAb)-based therapeutics are gaining importance across a number of disease areas. One of the more prominent recent developments has been in cancer immunotherapy, for example the development of antibodies against immune checkpoints such as programmed death-l (PD-l) (Topalian et al, Cancer Cell 27(4):450-6l, 2015).

While sometimes demonstrating impressive results in the treatment of some cancers, these therapeutics do not always work. To be effective, these antibodies must both bind and remain bound to their intended targets. In some cases, physical interaction with the tumor microenvironment transfers the drug from its intended target to other cells, thus inactivating the drug and contributing to a loss of efficacy (Arlauckas et al., Science Translational Medicine 9(389):eaal3604, 2017, doi: l0. H26/scitranslmed.aal3604).

The mechanisms contributing to efficacy and resistance are not completely understood, but evidence is emerging in the scientific literature linking structural characteristics of the biologic molecules, such as glycosylation profile, to efficacy or lack thereof (Jefferis, Journal of Immunology Research, 2016:5358272, 2016,

http://dx.doi.org/l0. H55/20l6/5358272). As an illustration, Arlauckas et al. tracked the in vivo fate and activity of an anti-PDl mAh in real time in mice using time-lapse intravital microscopy at a subcellular resolution. The study confirmed effective binding of the antibody to its intended target, in this case PD-1+ tumor-infiltrating CD8+ T cells at early time points after treatment administration. The left graph of FIG. 13 A demonstrates binding between the antibody and the PD-1+ tumor-infiltrating CD8+ T cells. However, as shown in FIG. 13 A, this binding is short-lived as the antibody is then captured within minutes from the T cell surface by PD-l- macrophage cells (see FIG. 13A, right graph). The study elucidated the transfer mechanism of anti-PD-l mAh from T cells to macrophages by tracing it back to the Fc-domain glycosylation of the antibody and its binding to Fc-gamma receptors (FcyR) expressed by the macrophage cells. A released glycan analysis of the mAb revealed a predominant G0F gly coform that lacks terminal galactose residues and is fucosylated on the penultimate N-acetylglucosamine (FIG. 13B). Confirming the hypothesis that Fc/FcyR binding interactions for many IgG subclasses are affected by Fc glycosylation, the study authors demonstrated they could substantially prolong the occupancy time of the anti-PD-l mAb on CD8+ T cells in the tumor bed by therapeutically inhibiting anti-PD-l/ FcyR interactions with FcyR Ilb/III blocking antibodies. Blocking FcyR interactions effectively eliminated the fraction of non-responders observed, with complete tumor rejection ensuing from the combination treatment (FIG. 13C).

Other studies also link structural characteristic patterns in the Fc-region of an antibody in general, and glycosylation in particular, to Fc-effector activity such as antibody- dependent cellular cytotoxicity (ADCC) and to the Cl component of complement. Fc glycosylation is essential for binding and activation, and activity can be modulated by gly co engineering (Jefferis, Arch Biochem Biophys, 526(2): 159-66, 2012). Absence of core fucose has been linked with elevated ADCC activity (Chung & Alter, AIDS 28(l7):2523-50, 2014). Similarly, decreased galactosylation (GO and G0F) has been found to inversely correlate with antibody-dependent cellular phagocytosis, whereas N-glycolylneuraminic acid structures predict stronger antibody-dependent cellular phagocytosis (Chung & Alter, AIDS

28(l7):2523-50, 2014). Galactosylation arm linkages have been also found to predict ADCC activity.

The impact of sialic acid on the pharmacokinetic profile of biotherapeutics has also been widely reported, with two known major pathways for selective glycoprotein clearance (Higel et al, Eur J Pharm Biopharm 100:94-100, 2016). First, glycoproteins with accessible terminal galactosylation are bound and cleared by the asialoglycoprotein receptor expressed in the liver. The second clearance pathway is executed by the mannose receptor expressed on immune cells, which binds selectively to mannose and N-acetylglucosamine residues of N- glycans. High mannose glycosylated mAbs generally show a strong decrease in half-life compared to complex or hybrid glycosylated mAbs.

In addition to the glycosylation profile’s impact on effector function potential (e.g., the mechanism of action) and on clearance, it can also affect immunogenicity. Some of the cell lines used in biotherapeutic production (for example CHO, NS0, and Sp2/0) may add sugars that are not expressed on human glycoproteins and may therefore be immunogenic.

For example, CHO cell lines may add N-acetylneuraminic acid residues in a(2-3) linkage rather than the a(2-6) linkage present in human IgG-Fc (Jefferis, Journal of Immunology Research, 2016:5358272, 2016, http://dx.doi.org/l0. H55/20l6/5358272). Similarly, CHO, NS0 and Sp2/0 cells may add an N-glycolylneuraminic acid in a(2-3) linkage that also may be immunogenic in humans.

In summary, the glycosylation profile of a biotherapeutic molecule is often identified as a critical quality attribute (CQA). The preceding examples highlight the significance and benefits of correlating structural attributes (such as glycosylation profile) to

functi onal/clinical behavior such as treatment efficacy and resistance. As these examples illustrate, such correlations typically map multiple structural characteristics onto several functional attributes in simultaneous fashion. These correlations reflect the existence of complex underlying patterns, the precise mechanistic nature of which is not usually fully known. Patterns lacking fully defined domain-specific models that can be solved in closed form (in the case of this constructive example, the biochemistry of glycoproteins), but for which experimental mapping data exists, are amenable to learning from data analysis (Abu- Mostafa et al., Learning from Data. Pasadena, CA: AML, 2012). The premise of learning from data is the use of sets of observations to uncover underlying processes. The learning algorithm searches for a hypothesis that classifies the data set well.

Insights deriving from the use of such correlations include predictive modeling of pharmacokinetic clearance, target engagement, potential to form aggregates, immunogenicity profile, and stability among others. Qualitative and quantitative functional differences between structural characteristics such as gly coforms can thus be anticipated and/or mined from known structural attribute stores. Applications of such insights introduce opportunities to improve upon existing biotherapeutics. Applications also introduce opportunities to improve upon current treatment options, for example by combining anti -PD- 1 therapeutics with therapies that inhibit FcyR binding in vivo or therapies that target tumor macrophages.

Example Structure-Function Workflow

A typical workflow illustrating the use of the biologic development determination technology described herein encompasses the following steps:

A user inputs (i) a molecule type of the target biologic, and (ii) data corresponding to measured structural features of the target biologic obtained from a structural characterization of the target biologic. The molecule type could be a monoclonal antibody, for example. The structural characterization data could comprise results of a released gly can analysis of this target biologic using ultra performance liquid chromatography -hydrophilic interaction liquid chromatography (UPLC-HILIC) with fluorescence and on-line mass spectrometry detection.

Experimental results from a released glycan study include identification of N-linked glycan compositions and sequences as well as their quantitative distribution.

Preprocessing of the structural characterization data input by the user is used to obtain target biologic General Structural Attributes (GSA) that include the following:

• A GSA corresponding to the absence of a(l,6) fucose

• A GSA corresponding to the presence of low-abundance Man8

• A GSA corresponding to the presence of a(l,3)-linked terminal galactose

Based on the GSAs of the target biologic, as described above, the following structure- function recommendations are then obtained through mining of the attribute store. Notably, the GSAs of the target biologic that are identified in the structure-function recommendations listed below correspond to CQAs, and, as such, the structure-function recommendations described below could include an identification of the associated GSAs as CQAs.

• Removal of core fucose residue confirms elevated levels of antibody - dependent cellular cytotoxicity, as expected for the drug’s mechanism of action (Zhang and Rudd 2016). No further determination is required.

• High-mannose glycans can sometimes lead to a decrease of in vivo circulatory half-life. If antibody-dependent cellular cytotoxicity is part of the mode of action of the target biologic, then mannosylation may be a CQA affecting efficacy and dosing

requirements (Reusch and Tejada 2015). A likely structure-function recommendation could include setting specifications for upcoming clinical studies (Goetze and Flynn 2010).

• The presence of glycan motifs such as a(l,3)-linked terminal galactose raises a warning flag, as it has posed a problem for the safety profile of previous drugs, for example cetuximab (Chung and Platts-Mills 2008). Severe anaphylactic reactions have been reported to occur rapidly after infusion, and most of these have been associated with IgE antibodies against galactose-a-l,3-galacotse. The Fab portion of the cetuximab heavy chain is glycosylated at N88 with a range of sugars, including galactose-a-l, 3-galactose and a sialic acid, N-glycolylneuraminic acid (NGNA). As cetuximab is produced in the mouse cell line Sp2/0, which expresses the gene for a-l,3-galactosyltransferase, a bioprocess

recommendations that include the following may also be determined: monitoring and optimizing upstream bioprocess conditions; design of purification and polishing downstream processing steps to selectively capture and remove glycans, for example with lectin-based purification. A structure-function recommendation may also include accurate glycosylation- profiling of the drug substance (e.g., a recommended analytical study). Another structure- function recommendation may also include a safety profile recommendation for monitoring patients for levels of pre-existing IgE antibodies against galactose-a-l, 3 -galactose that may be present through natural exposure to this oligosaccharide.

FIG. 15 summarizes the workflow described in this example, illustrating the relation between structural features of the target biologic, GSAs of the target biologic, and associated structure-function parameters that correspond to predicted functional characteristics of the target biologic based on analysis of its GSAs. Notably, as described in this example an illustrated in the biologic development recommendations (gold boxes) include, not only structure-function recommendations, but also bioprocess recommendations. G. Computer System and Network Environment

FIG. 11 shows an example of a layered and modular software architecture that can be used to implement the techniques described in this disclosure. A foundational layer labeled “Predictive Model Platform” in the figure implements the general elements of an application layer (including user interface principles, data management back-end, basic query, search and rendering engines); the general elements of a data analysis layer (including statistical tools, machine learning tools, data mining tools); and the general elements of a persistence layer (including interfaces and/or API’s to internal and external databases and portals). One or several application-specific modules labeled“Predictive Module 1” to“Predictive Module N” can be plugged into or interfaced with the foundational layer according to a range of techniques. Alternatively, such application-specific modules can also be included as part of a unified code base, and revealed individually by a license-management scheme. The modules implement application-specific query and search terms and ontologies; application-specific data visualization techniques such as network diagrams, chord diagrams, collapsible hierarchical trees, Sankey diagrams, heatmaps, line plots, bat charts, box and violin plots or others; and access to application-specific data via databases, databanks and portals. Two illustrative application modules are shown in the figure, labeled“Stability and Formulation” and“Safety and Immunogenicity”.

As shown in FIG. 9, an implementation of a network environment 900 for use in providing systems and methods for a biologic development determination tool as described herein is shown and described. In brief overview, referring now to FIG. 9, a block diagram of an exemplary cloud computing environment 900 is shown and described. The cloud computing environment 900 may include one or more resource providers 902a, 902b, 902c (collectively, 902). Each resource provider 902 may include computing resources. In some implementations, computing resources may include any hardware and/or software used to process data. For example, computing resources may include hardware and/or software capable of executing algorithms, computer programs, and/or computer applications. In some implementations, exemplary computing resources may include application servers and/or databases with storage and retrieval capabilities. Each resource provider 902 may be connected to any other resource provider 902 in the cloud computing environment 900. In some implementations, the resource providers 902 may be connected over a computer network 908. Each resource provider 902 may be connected to one or more computing device 904a, 904b, 904c (collectively, 904), over the computer network 908.

The cloud computing environment 900 may include a resource manager 906. The resource manager 906 may be connected to the resource providers 902 and the computing devices 904 over the computer network 908. In some implementations, the resource manager 906 may facilitate the provision of computing resources by one or more resource providers 902 to one or more computing devices 904. The resource manager 906 may receive a request for a computing resource from a particular computing device 904. The resource manager 906 may identify one or more resource providers 902 capable of providing the computing resource requested by the computing device 904. The resource manager 906 may select a resource provider 902 to provide the computing resource. The resource manager 906 may facilitate a connection between the resource provider 902 and a particular computing device 904. In some implementations, the resource manager 906 may establish a connection between a particular resource provider 902 and a particular computing device 904. In some implementations, the resource manager 906 may redirect a particular computing device 904 to a particular resource provider 902 with the requested computing resource.

FIG. 10 shows an example of a computing device 1000 and a mobile computing device 1050 that can be used to implement the techniques described in this disclosure. The computing device 1000 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The mobile computing device 1050 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart-phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to be limiting. The computing device 1000 includes a processor 1002, a memory 1004, a storage device 1006, a high-speed interface 1008 connecting to the memory 1004 and multiple high speed expansion ports 1010, and a low-speed interface 1012 connecting to a low-speed expansion port 1014 and the storage device 1006. Each of the processor 1002, the memory 1004, the storage device 1006, the high-speed interface 1008, the high-speed expansion ports

1010, and the low-speed interface 1012, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 1002 can process instructions for execution within the computing device 1000, including instructions stored in the memory 1004 or on the storage device 1006 to display graphical information for a GUI on an external input/output device, such as a display 1016 coupled to the high-speed interface 1008. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.

Also, multiple computing devices may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system). Thus, as the term is used herein, where a plurality of functions are described as being performed by“a processor”, this encompasses embodiments wherein the plurality of functions are performed by any number of processors (one or more) of any number of computing devices (one or more). Furthermore, where a function is described as being performed by“a processor”, this encompasses embodiments wherein the function is performed by any number of processors (one or more) of any number of computing devices (one or more) (e.g., in a distributed computing system).

The memory 1004 stores information within the computing device 1000. In some implementations, the memory 1004 is a volatile memory unit or units. In some

implementations, the memory 1004 is a non-volatile memory unit or units. The memory 1004 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 1006 is capable of providing mass storage for the computing device 1000. In some implementations, the storage device 1006 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.

Instructions can be stored in an information carrier. The instructions, when executed by one or more processing devices (for example, processor 1002), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices such as computer- or machine-readable mediums (for example, the memory 1004, the storage device 1006, or memory on the processor 1002).

The high-speed interface 1008 manages bandwidth-intensive operations for the computing device 1000, while the low-speed interface 1012 manages lower bandwidth- intensive operations. Such allocation of functions is an example only. In some

implementations, the high-speed interface 1008 is coupled to the memory 1004, the display 1016 (e.g., through a graphics processor or accelerator), and to the high-speed expansion ports 1010, which may accept various expansion cards (not shown). In the implementation, the low-speed interface 1012 is coupled to the storage device 1006 and the low-speed expansion port 1014. The low-speed expansion port 1014, which may include various communication ports (e.g., USB, Bluetooth®, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.

The computing device 1000 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 1020, or multiple times in a group of such servers. In addition, it may be implemented in a personal computer such as a laptop computer 1022. It may also be implemented as part of a rack server system 1024. Alternatively, components from the computing device 1000 may be combined with other components in a mobile device (not shown), such as a mobile computing device 1050. Each of such devices may contain one or more of the computing device 1000 and the mobile computing device 1050, and an entire system may be made up of multiple computing devices communicating with each other.

The mobile computing device 1050 includes a processor 1052, a memory 1064, an input/output device such as a display 1054, a communication interface 1066, and a transceiver 1068, among other components. The mobile computing device 1050 may also be provided with a storage device, such as a micro-drive or other device, to provide additional storage. Each of the processor 1052, the memory 1064, the display 1054, the communication interface 1066, and the transceiver 1068, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.

The processor 1052 can execute instructions within the mobile computing device 1050, including instructions stored in the memory 1064. The processor 1052 may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor 1052 may provide, for example, for coordination of the other components of the mobile computing device 1050, such as control of user interfaces, applications run by the mobile computing device 1050, and wireless communication by the mobile computing device 1050.

The processor 1052 may communicate with a user through a control interface 1058 and a display interface 1056 coupled to the display 1054. The display 1054 may be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 1056 may comprise appropriate circuitry for driving the display 1054 to present graphical and other information to a user. The control interface 1058 may receive commands from a user and convert them for submission to the processor 1052. In addition, an external interface 1062 may provide communication with the processor 1052, so as to enable near area communication of the mobile computing device 1050 with other devices. The external interface 1062 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.

The memory 1064 stores information within the mobile computing device 1050. The memory 1064 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. An expansion memory 1074 may also be provided and connected to the mobile computing device 1050 through an expansion interface 1072, which may include, for example, a SIMM (Single In Line Memory Module) card interface. The expansion memory 1074 may provide extra storage space for the mobile computing device 1050, or may also store applications or other information for the mobile computing device 1050. Specifically, the expansion memory 1074 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, the expansion memory 1074 may be provide as a security module for the mobile computing device 1050, and may be programmed with instructions that permit secure use of the mobile computing device 1050.

In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner. The memory may include, for example, flash memory and/or NVRAM memory (non volatile random access memory), as discussed below. In some implementations, instructions are stored in an information carrier that the instructions, when executed by one or more processing devices (for example, processor 1052), perform one or more methods, such as those described above. The instructions can also be stored by one or more storage devices, such as one or more computer- or machine-readable mediums (for example, the memory 1064, the expansion memory 1074, or memory on the processor 1052). In some

implementations, the instructions can be received in a propagated signal, for example, over the transceiver 1068 or the external interface 1062.

The mobile computing device 1050 may communicate wirelessly through the communication interface 1066, which may include digital signal processing circuitry where necessary. The communication interface 1066 may provide for communications under various modes or protocols, such as GSM voice calls (Global System for Mobile communications), SMS (Short Message Service), EMS (Enhanced Messaging Service), or MMS messaging (Multimedia Messaging Service), CDMA (code division multiple access), TDMA (time division multiple access), PDC (Personal Digital Cellular), WCDMA

(Wideband Code Division Multiple Access), CDMA2000, or GPRS (General Packet Radio Service), among others. Such communication may occur, for example, through the transceiver 1068 using a radio-frequency. In addition, short-range communication may occur, such as using a Bluetooth®, Wi-Fi™, or other such transceiver (not shown). In addition, a GPS (Global Positioning System) receiver module 1070 may provide additional navigation- and location-related wireless data to the mobile computing device 1050, which may be used as appropriate by applications running on the mobile computing device 1050.

The mobile computing device 1050 may also communicate audibly using an audio codec 1060, which may receive spoken information from a user and convert it to usable digital information. The audio codec 1060 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of the mobile computing device 1050. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on the mobile computing device 1050.

The mobile computing device 1050 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 1080. It may also be implemented as part of a smart-phone 1082, personal digital assistant, or other similar mobile device.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms machine-readable medium and computer-readable medium refer to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine- readable medium that receives machine instructions as a machine-readable signal. The term machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), and the Internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

In some implementations, the modules (e.g. attribute preprocessing module, machine learning module) described herein can be separated, combined or incorporated into single or combined modules. The modules depicted in the figures are not intended to limit the systems described herein to the software architectures shown therein.

Elements of different implementations described herein may be combined to form other implementations not specifically set forth above. Elements may be left out of the processes, computer programs, databases, etc. described herein without adversely affecting their operation. In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Various separate elements may be combined into one or more individual elements to perform the functions described herein. In view of the structure, functions and apparatus of the systems and methods described here, in some implementations.

Throughout the description, where apparatus and systems are described as having, including, or comprising specific components, or where processes and methods are described as having, including, or comprising specific steps, it is contemplated that, additionally, there are apparatus, and systems of the present invention that consist essentially of, or consist of, the recited components, and that there are processes and methods according to the present invention that consist essentially of, or consist of, the recited processing steps.

It should be understood that the order of steps or order for performing certain action is immaterial so long as the invention remains operable. Moreover, two or more steps or actions may be conducted simultaneously.

While the invention has been particularly shown and described with reference to specific preferred embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.