Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GLYCOSYLATION ENGINEERING
Document Type and Number:
WIPO Patent Application WO/2023/205642
Kind Code:
A2
Abstract:
Disclosed herein are methods and systems for engineering glycosylation. The methods and systems may use structure of sequence information of biomolecules to predict glycosylation features. The methods and systems may employ one or more trained algorithms described herein.

Inventors:
LEWIS NATHAN (US)
KELLMAN BENJAMIN (US)
SANDOVAL DANIEL (US)
NACHMANSON DANIELA (US)
CHIANG WAN-TIEN (US)
Application Number:
PCT/US2023/065895
Publication Date:
October 26, 2023
Filing Date:
April 18, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV CALIFORNIA (US)
International Classes:
G16B20/00; C07K9/00
Attorney, Agent or Firm:
WARREN, William, L. et al. (US)
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1. A method of modifying a reference glycopeptide to alter a glycan substructure of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: calculating whether there is a positive or negative IMR association between one or more amino acid substitutions of a protein feature proximal to the glycosite and the glycan substructure, and generating the modified glycopeptide having the one or more ammo acid substitutions if a magnitude of the IMR association is at least a threshold value.

2. The method of claim 1, wherein the threshold value is about 50%, 60%, 70%, 80%, 90%, or higher.

3. The method of claim 1, wherein the IMR is as generalized estimating equation (GEE) IMR.

4. The method of claim 1, wherein the IMR is a Fisher’s exact test IMR.

5. The method of claim 3 or claim 4, wherein the IMR is significant if it has a false discover}' rate (FDR) correction less than about 0.1.

6. The method of any one of claims 3-5, wherein the IMR is significant if it has a p- value less than about 0.05.

7. The method of any one of claims 1-6, wherein the IMR comprises a logarithm of an odds ratio (logOR) with a magnitude greater then about 1.

8. The method of any one of claims 1-6, wherein the IMR comprises a logOR with a magnitude greater then about 0.5.

9. The method of any one of claims 1-6, wherein the IMR comprises a logOR with a magnitude greater then about 0.1.

10. The method of claim 1 or claim 2, wherein the IMR association is determined using a matrix describing the expected glycoimpact of the one or more amino acid substitutions.

11. The method of any one of claims 1-3, wherein the IMR association is determined at least based on the identity of one or more amino acids.

12. The method of any one of claims 1-3, wherein the IMR association is determined at least based on the proximity of the one or more amino acids to the glycosite.

13. The method of claim 12, wherein the proximity is the distance from the glycosite as measured in angstroms.

14. The method of claim 13, wherein the proximity is less than or equal to about 6 angstroms to about 25 angstroms.

15. The method of claim 12, wherein the proximity is the number of amino acids between the each of the one or more amino acids and the glycosite.

16. The method of claim 15, wherein the distance is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 amino acids.

17. A method of modifying a reference glycopeptide to alter a glycan substructure of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: substituting one or more amino acids of a protein feature proximal to the glycosite to generate the modified glycopeptide.

18. The method of any one of claims 1-17, wherein the protein feature proximal to the glycosite comprises a structural feature.

19. The method of claim 18, wherein the structural feature is less than or equal to about 6 angstroms to about 25 angstroms from the glycosite.

20. The method of claim 18 or claim 19, wherein the structural feature is a secondary structure comprising a beta strand, alpha helix, extended strand, beta-bridge, turn, or bend, or a combination of two or more thereof.

21. The method of any one of claims 1-17, wherein the protein feature proximal to the glycosite comprises an amino acid within about 6 amino acids of the glycosite in the N- or C- terminal direction.

22. The method of any one of claims 1-21, wherein the glycan substructure is selected from Table 2 or Table 3.

23. The method of any one of claims 1 -22, employing a computational approach.

24. The method of any one of claims 1-23, wherein the structure of the reference glycopeptide is or has been determined using X-ray crystallography, homology modeling, and/or de novo prediction based on primary amino acid sequence.

25. The method of any one of claims 1-24, further comprising administering a therapeutically effective amount of the modified glycopeptide to a subject in need thereof based at least in part on the altered glycan substructure of the modified glycopeptide.

26. A modified glycopeptide having a first glycan substructure that is different from a reference glycan substructure of a glycosite of a reference glycoprotein, wherein the modified glycopeptide has one or more amino acid substitutions of a protein feature proximal to the glycosite as compared to the reference glycoprotein.

27. The modified glycopeptide of claim 26, wherein the protein feature proximal to the glycosite comprises a structural feature.

28. The modified glycopeptide of claim 27, wherein the structural feature is less than or equal to about 6 angstroms to about 15 angstroms from the glycosite.

29. The modified glycopeptide of claim 27 or claim 28, wherein the structural feature is a secondary structure comprising a beta strand, alpha helix, extended strand, beta-bridge, turn, or bend, or a combination of two or more thereof.

30. The modified glycopeptide of claim 26, wherein the protein feature proximal to the glycosite comprises an amino acid within about 6 amino acids of the glycosite in the N- or C-terminal direction.

31. The modified glycopeptide of any one of claims 26-30, wherein the glycan substructure is selected from Table 2 or Table 3.

32. The modified glycopeptide of any one of claims 26-31, wherein the protein feature is selected from Table 2 or Table 3

33. A method comprising administering a therapeutically effective amount of the modified glycopeptide of any one of claims 26-32 to a subject in need thereof based at least in part on the first glycan substructure of the modified glycopeptide.

34. A modified glycopeptide having an increase, decrease, or change in a glycan structure at a glycosite of the modified glycopeptide as compared to a reference glycopeptide, as determined based on the associations of Table 2 and/or Table 3 (e.g., wherein the modified glycoprotein has a Phe within 5 amino acids upstream of the glycosite, and the reference glycopeptide does not have a Phe within 5 amino acids upstream of the glycosite of the reference glycopeptide).

35. A method comprising administering a therapeutically effective amount of the modified glycopeptide of claim 34 to a subject in need thereof based at least in part on the increase, decrease, or change any of the glycan features selected from Table 1.

36. A method for determining the effect of a variation of a reference sequence on glycosylation of a first glycosite in the reference sequence, wherein the reference sequence comprises the first glycosite and a second glycosite, the method comprising:

(a) providing a plurality of sequences comprising (1) the reference sequence and (2) a plurality of variant sequences each having a different glycosylation feature at the second glycosite as compared to the reference sequence; and

(b) for each of the plurality of variant sequences: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the first glycosite based at least on the identity of the glycosylation feature at the second glycosite; thereby determining the effect of the variation of the reference sequence on glycosylation of the first glycosite.

37. A method for determining the effect of a variation of the structure of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising:

(a) providing a plurality of sequences comprising (1) the reference sequence and (2) a plurality of variant sequences having one or more amino acid substitution as compared to the reference sequence; and

(b) for each of the plurality of variant sequences: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence based at least on the structure of the variant sequence; thereby determining the effect of the variation of the reference sequence structure on glycosylation of the glycosite.

38. The method of claim 37, wherein the structure is secondary structure, tertiary structure, or quaternary structure, or a combination of two or more thereof.

39. The method of system of any previous claim, wherein the sequence is a viral sequence.

40. A method for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a viral sequence, the method comprising:

(a) providing the viral sequence and the plurality of candidate glycans, observed glycans, desired glycans, undesired glycans;

(b) for each of the plurality of candidate, observed, desired, or undesired glycans at each glycosite: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence; and

(c) computer processing the predicted presence for each of the plurality of candidate, observed, desired, or undesired glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

41. A method of determining a likelihood of a disease or disorder associated with a glycoprotein in an individual, the method comprising: calculating a first IMR association between a glycosite of the glycoprotein and a glycosylation feature; calculating a second IMR association between the glycosylation feature and a glycosite of a modified glycoprotein, wherein the modified glycoprotein comprises one or more amino acid substitutions relative to the glycoprotein; and determining said likelihood based on a difference between said first IMR and said second IMR.

42. A method for determining an IMR association between a glycosylation feature and one or more candidate glycoconjugates, the method comprising: (a) applying a trained algorithm to one or more candidate glycans to calculate a predicted presence of the glycosylation feature at a glycosite of at least a subset of the one or more candidate glycoconjugates; and

(b) estimating a likelihood of the glycosylation feature at the glycosite of the at least a subset of the one or more candidate glycoconjugates .

43. The method of claim 42, further comprising: synthesizing a glycoconjugate if the likelihood is above a threshold.

44. The method of claim 42, further comprising predicting a pathogenicity of a mutation based on the likelihood calculated in (b).

45. The method of claim 42, comprising administering to an individual a gene therapy vector based on said likelihood calculated in (b).

46. The method of claim 42, wherein the at least a subset of the one or more glycoconjugates comprises a protein, peptide, polynucleotide, lipid, sugar, small molecule, or part thereof.

47. The method of claim 42 wherein the at least a subset of the one or more glycoconjugates comprises a surface protein of a cell.

48. A method for determining the importance of a glycosite, comprising

(a) providing, in computer memory, one or more datasets comprising coevolution or conservation data associated with the glycosite;

(b) identifying one or more features of the glycosite; and

(c) calculating, with at least one computer processor, an importance of the glycosite based at least in part on the one or more datasets and the one or more features.

Description:
GLYCOSYLATION ENGINEERING

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to U.S. Provisional Patent Application No. 63/332,385 filed on April 19, 2022, the entire contents of which are incorporated herein by reference.

STATEMENT AS TO FEDERALLY SPONSORED RESEARCH

[0002] This invention was made with government support under Contract number GM119850 awarded by the National Institutes of Health. The government has certain rights in the invention.

BACKGROUND OF THE INVENTION

[0003] Glycosylation is the reaction in which a carbohydrate (or 'glycan'), i.e. a glycosyl donor, is attached to a hydroxyl or other functional group of another molecule (a glycosyl acceptor) in order to form a glycoconjugate. Such glycoconjugates serve various chemical and biological functions. Unlike biosynthesis of other biological molecules, glycan synthesis and glycosylation have resisted characterization as templated processes which has led to poor understanding of the structural and sequence factors that impact them.

SUMMARY OF THE INVENTION

[0004] Recognized herein is a need for systems and methods that lead to a predictive understanding of glycosylation and glycan biosynthesis. Using curated datasets of glycosylation features associated with protein structures and sequences, associations between these biomolecular features and glycosylation features may be determined. In some embodiments, such associations provide insights into the principles underlying glycan biosynthesis as well as contribute to methods and systems for engineering novel glycosylated biomolecules.

[0005] Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.

[0006] Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.

[0007] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. The present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

INCORPORATION BY REFERENCE

[0008] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:

[0010] FIG. 1 illustrates discovered mechanisms of glycan biosynthesis considered in comparison to previously known templated biosynthesis of DNA, RNA, and protein.

[0011] FIG. 2A shows feature weights from a Factor Analysis with Mixed Data (FAMD) (of annotated sites. FIG. 2B shows a two-dimensional projection of the FAMD using a Uniform Manifold Approximation and Projection (UMAP). Each point is a site on a glycoprotein, each color indicates the source of that protein. FIG. 2C shows an FDR distribution of p-values from a multivariate gaussian trained on the UMAP 2D projection.

[0012] FIG. 3A illustrates a specification of the event space to each observed glycosylation event at a glycosylation site. FIG. 3B shows a volcano plot of the log odds ratio and False Discovery Rate (FDR) adjusted p-values from a Fisher exact test between glycan and protein structure occurrence. Proximal AAs for the top 20 most significant relations are shown. FIG. 3C shows a Kullback-Leibler divergence when either glycan structures, G, or protein structures, P, are specified as present, 1 , or absent, 0. FIG. 3D shows probabilities of glycan structures when protein structures were known or “fixed.” FIG. 3E shows protein structure probabilities conditioned on fixed glycan structures. FIG. 3F illustrates non-independence, absolute difference between conditional and marginal probabilities (Pr(A|B)-Pr(A)), stratified by glycan motif size for protein-glycan relations when glycan structures are fixed and when protein structures are fixed.

[0013] FIGs. 4A-4C show distributions of high-confidence/high-certainty (OR-derived probability is effectively 1 or 0; within 0.001) amino acid-glycan IMRs. Each plot is split indicating when the protein structure is fixed present (T, left) or absent (F, nght). FIG. 4A shows distribution of high-confidence (grey) and other (black) aa-glycan IMRs out of 60,000 IMRs. FIG. 4B shows a number of unique glycan substructures involved in high-confidence aa-glycan IMRs. FIG. 4C shows high-certainty aa-glycan IMRs by amino-acid (x-axis) proximity type and probability: close to zero (top) or close to one (bottom).

[0014] FIG. 5A shows the number of significant (FDR<0.1, |log(OR)|>0.1) IMRs relating to structurally proximal AAs (N+6A), sequence proximal AAs C-terminal (N+5), N-terminal (N-5), or either direction (N+/-5), predicted secondary structure from sequence (SSpro8) and structure (DSSP): alpha-helix (ss.H), extended strand (ss.E), beta-bridge (ss.B), turn (ss.T) bend (ss.S), other (ss.C). FIG. 5B show The Odds Ratios (x-axis) and FDR (y-axis) for IMRs relating glycan motifs to structurally (DSSP) estimated Turns (T). FIGs. 5C and 5D show IMRs (FDRO.l, |log(OR)|>0.1) relating structurally proximal amino acids to motifs stratified by the number of Sialic Acids (FIG. 5C) and 4-Sulfated GalNAc (FIG. 5D). FIG. 5E shows the Spearman correlation between the monosaccharide count of glycan substructures from protein-structure features; protein structure features with an average absolute correlation > 0.2 were retained. FIGs. 5F and 5G show IMRs (FDR<0.1, | log(OR)|>0. 1) compared across two sequence-proximal (FIG. 5F) and structure-proximal (FIG. 5G) amino acids, phenylalanine (F) and tryptophan (W). The direct comparison of proximal-ammo acid effects visualized the expected change in glycosylation associated with that substitution; this expected change is the basic concept underlying “glycoimpacf ’ (the expected impact on glycosylation of a protein-structure change and/or amino-acid substitution. FIG. 5H illustrates a network depicting the predicted magnitude of glycoimpact (edge color) of structure-proximal (within 5 A) substitutions for low impact (Blosum64) substitutions. Predicted glycoimpact is calculated as Euclidean distance between the significant log(OR) for all glycomotifs associated with a protein structure. FIG. 5H shows glycoimpact predicted from BLAMO 0.5:0 1 (|log(OR)| > 0.5, FDR<0.1 ). The present disclosure uses glycoimpact from BLAMO 0.5:0. 1 unless otherwise specified.

[0015] FIG. 6 shows for additional thresholds, raw scores, and sequence-proximal substitution predictions. BLAMO 0.1:0.1, BLAMO 0.5:0.1, and BLAMO 1:0.1 from left to right.

[0016] FIG. 7A shows a comparison of the error between the PAM and BLOSUM substitution matrices and the glycoimpact for corresponding substitutions. Linear regressions are split into two relevant ranges: null glycoimpact (<2.5) and impactful (>2.5). Glycoimpact scores from BLAMO 0.5:0. 1 were used; those computed from strong IMRs (|log(OR)| > 0.5, FDR<0.1). Error (y-axis) was calculated as the root mean square error (RMSE) between PAM and BLOSUM scores. Multiple versions of the PAM (PAM30-250) and BLOSUM (BLOSUM45-100) were examined; all pairings are shown in FIG8. FIG. 7B illustrates null and impactful glycoimpact (BLAMO 0.5:0. 1) stratified by pathogenicity in ClinVar of mutations within 20A of an N-glycosylation site. FIG. 7C shows a hierarchical biclustered heatmap (average-linkage with Euclidean distance) of Spearman correlations between glycoimpact (BLAM00.5:0.1) and error between various pathogenicity predictions. Prediction-type and protein structure indicate the training data used to build various pathogenicity prediction tools. FIG. 7D shows the minimum distance from all residues within human PrP to the N197 or N181 glycosylation sites. Residues are stratified by all sites (All) and causative mutations of prion disease including Creutzfeldt-Jakob disease (CJD) and Gerstmann-Straussler disease (GSD). Significance determined by a one-sided Wilcoxon test. For site-proximity to each glycosylation site, see FIG. 9. FIG. 7E illustrates the number of high-ranking (p<L/5 to p<L/3) evolutionary coupling events between glycosites (GN), asparagines (N), or any residue (AA) with all other residues in each of 2,005 alignments. FIG. 7F shows evolutionary coupling (EC) probability between serine or threonine with a glycosite (GN), any asparagine (N), or any amino acid (AA). Serines and threonines considered appear two residues C-terminal to the GN, N or X (N+2). The upper right subpanel of FIG. 7G shows the proportion of high-ranking ECs (Rank<L) at N+/-i relative to a glycosylation site (GN, highlighted by pointer), asparagine (N) or any amino acid (X). The significance of EC-enrichment with glycosites was measured with a hypergeometric enrichment comparing the proportion of high-ranking ECs with GN vs those with either N, or X at the same relative position (black-lines). The central panel of FIG. 7G shows an aggregation of all residue-gly cosite enrichments from at N+/-10. The proportion of high- ranking ECs for each amino acid (rows) at the column-specified relative position was compared with GN, N, and X. An opaque square indicates that for that residue (row) in that position (column), high-EC proportion is higher with GN than N. An opaque triangle indicates high-EC proportion is higher with GN than any amino acid (AA). A transparent triangle or square indicates GN was not significantly more coupled. Significance was assessed at multiple EC-rank thresholds between L/3 and 3L and p-values were pooled using a Fisher’s method; FDR was used to correct for multiple testing. FIG. 7H illustrates a hierarchical clustering of coupling-masked (Rank<4L) amino acids surrounding (+/-6aa) a glycosite. Each of 5 clusters was summarized as a motif. Height is the log of cumulative reciprocal EC-Rank with a pseudo-count of 0.25. The asparagine at the center was fixed at a height of 2 for context. Residues are colored by chemical properties; this analysis is repeated with 25 clusters in FIG. 14. FIG. 71 shows Glycosite Alignments corresponding to Gly Connect-documented tetra-antennary structures (Hex: 7 HexNAc:6) with no sialic acid or fucose. Abbreviated alignments are shown above the brown middle-most “conservation” track. Below the conservation track, the first (top line) and second (bottom line) most popular amino acids are displayed for each position N+/-30. The full alignment can be found in FIG. 15. Consensus amino acids consistent with other analyses are highlighted in bold and marked with a “+” indicating gly cosite-coupling (FIG. 7G) or indicating IMR-associated (FIG. 5A). FIG 71 discloses SEQ ID NOS 1-6, respectively, in order of appearance.

[0017] FIG. 8 shows the root mean square error between PAM and BLOSUM substitution scores correlated with predicted glycoimpact.

[0018] FIG. 9 show the minimum distance between amino acids within human PrP for disease and library mutants (FIG. 9A and 9B) and low expression mutants (FIG. 9C and 9D)

[0019] FIG. 10 show the correlation between glycoimpact and pathogenicity score pairwise errors when pathogenicity scores are shuffled within score.

[0020] FIG. 11A shows GEE-calculated odds ratio (x-axis) and FDR-adjusted p-values defining IMRs from PGES-DB denoting relations between sequence (triangle & square) or structural protein features (circle & plus) and motifs containing >3 mannose (hybrid & high- mannose). FIG. 11B illustrates the range of observed complex gly can to high- mannose/hybnd gly cans (N203/N3) at each site on the HIV envelope gp!60 (BG505 SOSIP.664, PDB:4TVP). FIG. 11C shows distributions of low/high-complexity (N203/N3) stratified by proximal protein structure features. FIG. 11D shows GEE-leamed IMRs relating to Sequence-proximal (upstream/N -terminal) effects of He and Phe. FTG. HE shows TgG allotypes, Phe299 (circle) and 1299 (triangle), segregated by Principal Component Analysis of relative abundance across BALB/c and C57BL/6 mice. FIG. HF shows Galactose (Gal) and Sialylation (Neu5Ac) relative abundance distributions for IgGl Phe299 and Ile299 alloty pes across BALB/c and C57BL/6 mice. FIG. 11G shows the mean proportion of massspectroscopy-observed peptide fragments with mass offsets corresponding to Complex, Oligomannose/Hybrid, or unoccupied glycosylation sites in the SARS-CoV-2 spike SI across the original 2019 strain, and the Delta and Gamma variants.

[0021] FIGs. 12A-12F illustrate Protein-Glycan structure relations in HIV ENV gpl60. Relationships between glycosylation feature (specifically the high-mannose/hybrid to complex ratio shown in FIGs. 11B-C; y-axis) and specific protein structure features are shown. Structural elements illustrated number of glycosite 3D-proximinal Phe (e.g., within 6 A of the glycosite) (struct_aa.F; FIG. 12A), number of glyosite-3D proximal Leu (e.g., within 6 A of the glycosite) (struct_aa.L; FIG. 12B), number of glyosite-3D proximal Gly (e.g., within 6 A of the glycosite) (struct_aa.G; FIG. 12B), number of downstream Cys (e.g., up to 6 amino acids) downstream of the glycosite (seq_aaDown.C; FIG. 12D), number of downstream Asn (e.g., up to 6 amino acids) downstream of the glycosite (seq_aaDown.N; FIG. 12E), and number of downstream Arg (e.g., within 6 A) downstream of the glycosite (struct_aa.R; FIG. 12F). Where “Mann_v_Complex” indicates the Low/High-complexity ratio (N203/N3) shown in FIG. 11B andllC.

[0022] FIG. 13 shows IgG3 N-glycosylation measurements in mice.

[0023] FIG. 14 shows EC-masked extended sequon clustering with motif logos describing high-ranking glycosite-coupled ECs at each position. Figure discloses SEQ ID NOS 1, 7-8, 2, 2-3, 9-19, 4, 20, 20, 19, 21-23, 16, 15, 17-18, 24, 23, 25-36, 34, 34, 34, 31, 24, 37, 15, 15, 37- 46, 29, 47-49, 15, 37, 9, 3, 50-53, 52-60, 60-61, 9, 3, 55, 54, and 4-6, respectively, in order of appearance.

[0024] FIG. 15 illustrates glycosite alignment for tetra-antennary structures with no sialic acid or fucose from Glycositealign at gly connect.

[0025] FIGs. 16A-16B show Fisher Odds Ratios (OR) estimated IMRs indicating the preference for afucosylation (square) or core-fucosylation (circle) across multiple antibody alloty pes. IMRs above y=x (dotted line) are correlated with the mutant allotype and/or anticorrelated with the wild-type. Each substitution is written relative to P0DOX5 with an N297 glycosite. Plots show IMRs relating to (FTG. 16A) structure-specific IMRs, within 5 A, and (FIG. 16B) sequence-specific IMRs, within 5 amino acids up or down stream.

[0026] FIG. 17A illustrates model architectures to analyze glycan sequences. FIG. 17B illustrates the full model architecture of a trained algorithm as described herein.

[0027] FIG. 18A shows the dependence of glycan feature prediction performance on occurrence. Using a trained algorithm as described herein, the averaged prediction performance of GlyCompare features against their counts in a dataset were plotted. FIG. 18B illustrates GlyCompare feature accuracy distribution. A histogram of the prediction performance for all observed GlyCompare features is shown. FIG. 18C shows a t-SNE visualizing the glycan representation learned by InSaNNe for all GlyCompare features Each feature was shaded by its averaged prediction performance to identify structurally related clusters of glycan features that are more difficult to predict for InSaNNe. Ovals highlight clusters of difficult-to-predict GlyCompare features. FIG. 18D shows a visualization of the prediction performance depending on the sequon. For all sequons in the dataset, prediction performance was averaged over all glycans and sequons were colored by prediction performance to observe whether any sequon clusters would be difficult to predict for InSaNNe. Clusters were labeled as to whether they contained sequons with N-linked or O- hnked glycans. FIG. 18E illustrates a comparison of experimentally observed and predicted glycans at a glycosylation site of human uromodulin. The sequon GTVLTRNETHATYS (SEQ ID NO: 62) was used to predict permissible glycans using the trained InSaNNe model. The top 100 predicted glycans were analyzed as to their characteristics and examples are shown. FIG. 18F illustrates predicting glycans for an HIV-1 Env sequon. Using the sequon PVQINCTRPN (SEQ ID NO: 63) as input for the trained algorithm, a t-SNE of the glycan representations learned the trained algorithm was colored by the predicted probability of occurring on that sequon to identify structurally related glycans that were predicted to be present. FIG. 18G shows comparing predicted glycans at sequons of the HIV-1 Env protein. For three sequons, a t-SNE of the glycan representations learned by the trained algorithm was colored by the predicted probability by the trained algorithm and indicated structural glycan features common to predicted glycans. Figure 18G discloses SEQ ID NOS 65-67, respecitively, in order of appearance.

[0028] FIG. 19 illustrates effects of amino acid substitutions on predicted glycosylation ranges. For all AMinked sequons in the dataset, all amino acids were systematically substituted with every other amino acid and the modified sequons were used as input for a trained algorithm as described herein, obtaining a predicted glycosylation range. The average change was then calculated and compared to the glycosylation range of the wild-type sequons, which is depicted with a 95% confidence interval. Lines for changes to high- mannose (“g”), fucosylated (“r”), and sialylated (“p”) glycans are shown, with a horizontal line at zero.

[0029] FIGs. 20A-20B show distributions of predicted-presence for the L and F variants at N-2 stratified by mannose per glycan (FIG. 20A) and sialic acids per glycan (FIG. 20B). FIG. 20C illustrates simpler boxplots which describe predicted glycosylation by mannose per glycan and sialic acid per glycan for three oligomannose sites in the SARS-CoV-2 spike glycoprotein. FIG. 20D shows glycan predict-presence fold-changes at site N717 (by galactose and sialic acid between the wild-type and B. 1.1.7 spike protein. Predicted-presence fold-change (y-axis) is stratified by the basal predicted-presence for each glycan in the wildtype (x-axis). Predicted-presence fold-change from wild-type by galactose, mannose, GlcNAc, and sialic acid is provided for N717 and N616 in B.1.1.7 and D615G variants respectively, ns: p>0.05, *: p<0.05, **: p<0.01, **: p<0.001, ***: p<le-3, **** : p<i e -4 [0030] FIG. 21 shows predicted-presence by monosaccharide for all sites in the SARS-CoV- 2 spike. Glycosylation by mannose per glycan and sialic acid per glycan. Bottom bar indicates the dominant glycan type at each site, hybrid (122, 603), oligomannose (61, 234, 717, 801), complex (remainder) characterized at that site in the wildtype spike.

[0031] FIGs. 22A-D shows predicted change in presence for glycans at N616 in D614G. Predicted-presence fold-change (y-axis) is stratified by the basal predicted-presence for each glycan in the wild-type (x-axis). Predicted-presence fold-change from wild-type by galactose (FIG 22A), mannose (FIG. 22B), GlcNAc (FIG. 22C), and sialic acid (FIG. 22D) is provided for N616 in D614G.

[0032] FIG. 23A depicts a heatmap showing the log-scale abundance of various glycan species observed in wt and mutant Fc on human IgG3. FIG. 23B shows the background- adjusted InSaNNe predicted-presence compared with the empirical abundance in wild type (circle), R301A mutant (“+”), and the Y296A mutant (“x”). FIG. 23C shows the log fold change between glycan abundance for mutants relative to wildtype were compared between empirical and predicted abundance for all glycans. FIGs. 23D-23E mirror FIGs. 23B-23C except glycans with a predicted absolute log fold-change less than 1 were removed.

[0033] FIG. 24 depicts a table showing random forest model performance in a 2x6-fold cross-validation. Cross validation folds were split on protein identity. Training data were either annotated from SWTSSMOD curated homology models, PDB empirical models, or ab initio I-TASSER homology models using structural resolutions between 4-10 A. Performance was measured using average AUROC, Sensitivity and Specificity across cross-validation and the standard error across 12 folds in parenthesis. For each model type, the top two scores are bold. Rows with two or more top scores are noted in the final column.

[0034] FIG. 25 depicts a table showing ablation of major protein structure feature types. Seven ablations of major feature-types including: struct (all structure-derived annotation), Depth & Accessibility (depth of residue, relative/absolute surface area), SS (secondary structure), aaUp/Down/All (sequence-proximal ammo acids upstream/downstream/either), structAA (structure-proximal amino acids). Because some feature-types are associated, the ablation of some feature-types such as aaUp also required the removal of other feature-ty pes indicated by the “x” left of the center line. Ablation significance is indicated by FDR- corrected Fisher’s Method pooled p-values (2-sample t-test, n=12 for each sample) comparing the performance distribution of ablation trained models to models trained on all data; performance distributions were collected over 2x6-fold cross-validation.

[0035] FIG. 26A shows the validation loss as a function of training epoch of a trained algorithm as described herein. FIG. 26B shows the validation accuracy as a function of training epoch of a trained algorithm as described herein.

[0036] FIG. 27 shows a computer control system that is programmed or otherwise configured to implement methods provided herein.

[0037] FIG. 28A depicts a plot showing false positive rate estimation of a trained algorithm as described herein. For classification thresholds between 0 and 1, the true and false positive rate of InSaNNe predictions was assessed on the independent test set and compared to a random classifier baseline. FIG. 28B shows validation of predictions by the trained algorithm with existing structures on GlyConnect. The influence of classification threshold on the hit rate (recall/sensitivity) was investigated by evaluating whether recorded structures on GlyConnect would have been predicted by InSaNNe. FIG. 28C illustrates observed compositions in GlyConnect via predicted structures from InSaNNe. For each composition, the predicted probability of structures with that composition was probed to evaluate whether the composition was predicted via matching structures.

[0038] FIG. 29 shows a dependence of sequon prediction performance on occurrence. Using a trained model as described herein (InSaNNe), the prediction performance of sequons, averaged over all their observed glycans, was determined and plotted against the number of sequon-glycan pairs in the dataset.

[0039] FIG. 30 shows the redundancies in sequon sequence. All sequon-glycan pairs in the dataset were used to obtain an averaged recall value for the trained InSaNNe model (WT), compared to an averaged recall value when removing the entire sequon sequence (all). Single or multiple amino acid positions were then iteratively removed from all sequons and model recall assessed.

DETAILED DESCRIPTION OF THE INVENTION

[0040] Unless defined otherwise, all terms of art, notations and other technical and scientific terms or terminology used herein are intended to have the same meaning as is commonly understood by one of ordinary skill in the art to which the claimed subject matter pertains. In some embodiments, terms with commonly understood meanings are defined herein for clarity and/or for ready reference, and the inclusion of such definitions herein should not necessarily be construed to represent a substantial difference over what is generally understood in the art. [0041] Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the disclosure. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

[0042] As used in the specification and claims, the singular forms “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a sample” includes a plurality of samples, including mixtures thereof.

[0043] The terms “determining,” “measuring,” “evaluating,” “assessing,” “assaying,” and “analyzing” are often used interchangeably herein to refer to forms of measurement. The terms include determining if an element is present or not (for example, detection). These terms can include quantitative, qualitative or quantitative and qualitative determinations. Assessing can be relative or absolute. “Detecting the presence of’ can include determining the amount of something present in addition to determining whether it is present or absent depending on the context.

[0044] The terms “subject,” “individual,” or “patient” are often used interchangeably herein. A “subject” can be a biological entity containing expressed genetic materials. The biological entity can be a plant, animal, or microorganism, including, for example, bacteria, viruses, fungi, and protozoa. The subject can be tissues, cells and their progeny of a biological entity obtained in vivo or cultured in vitro. The subject can be a mammal. The mammal can be a human. The subject may be diagnosed or suspected of being at high risk for a disease. In some embodiments, the subject is not necessarily diagnosed or suspected of being at high risk for the disease.

[0045] The term “ZM vivo” is used to describe an event that takes place in a subject’s body. [0046] The term "ex vivo” is used to describe an event that takes place outside of a subject’s body. An ex vivo assay is not performed on a subject. Rather, it is performed upon a sample separate from a subject. An example of an ex vivo assay performed on a sample is an “zw vitro” assay.

[0047] The term “z« vitro” is used to describe an event that takes places contained in a container for holding laboratory reagent such that it is separated from the biological source from which the material is obtained. In vitro assays can encompass cell-based assays in which living or dead cells are employed. In vitro assays can also encompass a cell-free assay in which no intact cells are employed.

[0048] As used herein, the term “about” a number refers to that number plus or minus 10% of that number. The term “about” a range refers to that range minus 10% of its lowest value and plus 10% of its greatest value.

[0049] As used herein, the terms “treatment” or “treating” are used in reference to a pharmaceutical or other intervention regimen for obtaining beneficial or desired results in the recipient. Beneficial or desired results include but are not limited to a therapeutic benefit and/or a prophylactic benefit. A therapeutic benefit may refer to eradication or amelioration of symptoms or of an underlying disorder being treated. Also, a therapeutic benefit can be achieved with the eradication or amelioration of one or more of the physiological symptoms associated with the underlying disorder such that an improvement is observed in the subject, notwithstanding that the subject may still be afflicted with the underlying disorder. A prophylactic effect includes delaying, preventing, or eliminating the appearance of a disease or condition, delaying or eliminating the onset of symptoms of a disease or condition, slowing, halting, or reversing the progression of a disease or condition, or any combination thereof. For prophylactic benefit, a subject at risk of developing a particular disease, or to a subject reporting one or more of the physiological symptoms of a disease may undergo treatment, even though a diagnosis of this disease may not have been made.

[0050] “Position” refers to a particular amino acid by its index relative to a contextual zeroposition, e.g., the first amino acid at the N-terminal end of the molecule.

[0051] “Region” refers to a portion of an amino acid or nucleic acid, wherein said portion is smaller than the entire amino acid or nucleic acid.

[0052] The terms “identical” or percent “identity” in the context of two or more nucleic acid or polypeptide sequences, refer to two or more sequences or subsequences that are the same or have a specified percentage of amino acid residues or nucleotides that are the same, when compared and aligned for maximum correspondence, e.g., as measured using one of the sequence comparison algorithms available to persons of skill or by visual inspection. Exemplary algorithms that are suitable for determining percent sequence identity and sequence similarity are the BLAST programs, which are described in, e.g., Altschul et al. (1990) “Basic local alignment search tool” J. Mol. Biol. 215:403-410, Gish et al. (1993) “Identification of protein coding regions by database similarity search” Nature Genet. 3:266- 272, Madden et al. (1996) “Applications of network BLAST server” Meth. Enzymol.

266: 131-141, Altschul et al. (1997) "’’Gapped BLAST and PSI-BLAST: a new generation of protein database search programs” Nucleic Acids Res. 25:3389-3402, and Zhang et al. (1997) “PowerBLAST: A new network BLAST application for interactive or automated sequence analysis and annotation” Genome Res. 7:649-656, which are each incorporated by reference. Many other optimal alignment algorithms are also known in the art and are optionally utilized to determine percent sequence identity.

Sequences and sequons

[0053] Methods and systems as described herein may ingest as inputs or operate on one or more sequences of biological molecules. Sequences may comprise amino acid or nucleic acid sequences. Sequences as described herein may be distinguished according to one or more property or features. The feature may comprise sequence length; amino acid identity (e.g., presence or absence of a specific amino acid or content of a specific amino acid); amino acid position (e.g., relative or absolute position of an amino acid or amino acids); ammo acid insertions (e.g., relative to a reference sequence); amino acid deletions (e g , relative to a reference sequence); amino acid substitutions (e.g., relative to a reference sequence); observed or predicted structure of the sequence, including observed or predicted secondary structure (e.g., 3-tum helix, 4-tum (alpha) helix, 5-tum (pi) helix, beta strand, bend, (random) coil), observed or predicted tertiary structure, and observed or predicted quaternary structure; observed or predicted glycosylation features associated with the sequence; or any combination thereof.

[0054] In some embodiments, an amino acid “sequence” (sometimes referred to herein as simply a “sequence”) of a peptide refers to the order and identity of amino acids in the peptide (wherein peptide is inclusive of proteins). In some embodiments, an amino acid sequence comprises a glycosite, refernng to an amino acid of the sequence that is, has the potential to be, or is predicted to be, attached via a hydroxyl or other functional group of that amino acid to a glycan (e.g., a glycan comprising one or more gly cosylation features) to form a glycoconjugate. In some embodiments, a sequence comprising a glycosite comprises amino acids that are structurally proximal (“structurally proximal amino acids” or “structurally proximal AAs”) to the glycosite. For example, in some embodiments amino acids that are structurally proximal include those amino acids that are within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, or more Angstroms of the glycosite when the monomeric peptide sequence is arranged in three-dimensional space to have secondary, tertiary, and/or quaternary structure. In some embodiments, a sequence comprising a glycosite comprises sequence proximal amino acids, where sequence proximal amino acids are amino acids that are within 20 amino acids upstream or downstream of the glycosite. For instance, for the non-limiting example amino acid sequence AA1-AA2-AA3-AA4-AA5- AA6-AA7-AA8-AA9-AA10-AA11-AA12-AA13-AA14-AA15-AA16-AA17-AA18 -AA19- AA20-AA21 -AA22-AA23-AA24-AA25 (N)-AA26-AA27-AA28-AA29-AA30-AA31 - AA32-AA33-AA34-AA35-AA36-AA37-AA38-AA39-AA40-AA41-AA42-AA43- AA44- AA45-AA46-AA47-AA48-AA49-AA50, where each “AA” is an amino acid, AA25 is a glycosite (N), and amino acids AA4-AA24 and AA26-AA46 are “sequence proximal”.

Accordingly, use of the term “sequence” is inclusive of an amino acid sequence comprising a glycosite with one or more amino acids that are structurally proximal and sequence proximal to the glycosite.

[0055] In some embodiments, a sequence comprises a sequon, which comprises a glycosite. In some cases, the sequon comprises a glycosite and one or more structurally proximal ammo acids In some cases, the sequon comprises a glycosite and one or more sequence proximal amino acids. In some cases, the sequon comprises a glycosite and one or more structurally proximal amino acids, and one or more sequence proximal amino acids. While sequon has been referred to as N-type NX[S/T], in various embodiments it is used herein to include the sequence and structure context surrounding any glycosite.

[0056] In some embodiments, a sequence or sequons comprises one or more observed or predicted glycosylation sites and any number of flanking residues (e.g., amino acids). In some embodiments, a sequence or sequon comprises a glycosylation site and about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, or more residues flanking the glycosylation site (glycosite). The flanking residues may be in either or both directions of the glycosylation site (or glycosite) (e.g., upstream, downstream, N-terrmnal, C-terrmnal). The flanking residues may comprise any monomer. In some embodiments, an N-type glycosite in a sequence or sequence, comprises an NX[S/T] motif where N is an asparagine residue which may or may potentially be glycosylated, X is any amino acid except proline, S is serine, and T is threonine. In some embodiments, a sequence or sequon comprises an extended aromatic sequon (EAS). In some embodiments, a sequence or sequon comprises an O-type glycosite, and may comprise a glycosylated Serine or Threonine and flanking amino acids. In some embodiments, a sequence may have associated with it one or more glycosites as described elsewhere herein. In some embodiments, a sequence of sequon may comprise a glycosite, amino acids flanking the glycosite (sequence proximal), and physically proximal amino acid residues within about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, or more Angstroms from the glycosylated amino acid (or glycosite).

Glycosylation Features

[0057] Glycoconjugates (e.g., a glycopeptide) as described herein may comprise one or more glycosylation features or glycans decorating a glycosite of an amino acid sequence. A glycosylation feature may comprise one or more monosaccharides linked glycosidically. A glycosylation feature may be present or otherwise associated with the glycosite. The association may comprise one or more covalent (e.g., glycosidic) bonds or the association may be non-covalent. A glycosylation feature may comprise any number of monosaccharides or derivatives. A glycosylation feature may comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more monosaccharides or derivatives thereof.

[0058] Glycosylation features as described herein may comprise any monosaccharide or derivative thereof. Monosaccharides may comprise D-glucose (Glc), D-galactose (Gal), N- acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N- acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3-deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D-manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L- arabinofuranose (Araf), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D- fructofuranose (Fruf), ascarylose (Asc), and ribitol (Rbo). Derivatives of monosaccharides may comprise sugar alcohols, amino sugars, uronic acids, ulosonic acids, aldonic acids, aldaric acids, sulfosugars, or any combination or modification thereof. A sugar modification may comprise one or more of acetylation, propylation, formylation, phosphorylation, or sulfonation or addition of one or more of deacetylated N-acetyl (N), phosphoethanolamine (Pe), inositol (In), methyl (Me), N-acetyl (NAc), O-acetyl (Ac), phosphate (P), phosphocholine (Pc), pyruvate (Pyr), sulfate (S), sulfide (Sh), aminoethylphosphonate (Ep), deoxy (d), carboxylic acid (-oic), amine (-amine), amide (-amide), and ketone (-one). Such modifications may be present at any position on the sugar, as designated by standard sugar naming/notation. In some cases, a glycosytic addition of a monosaccharide to another monosaccharide is considered a polymerizing modification that gives rise to a glycans. In some embodiments, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more modifications are present on the monosaccharide. In some embodiments, no more than 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, or fewer modifications are present on the monosaccharide. Monosaccharides may comprise any number of carbon atoms. Monosaccharides may comprise any stereoisomer, epimer, enantiomer, or anomer. In some embodiments, monosaccharides comprise 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, or more carbon atoms.

[0059] In some embodiments, a glycosylation feature comprises glyceraldehyle, threose, erythrose, lyxose, xylose (Xyl), arabinose, ribose, talose, galactose (Gal), idose, gulose, mannose (Man), glucose (Glc), altrose, allose, sedoheptulose, mannoheptulose, N-acetyl- galactosamine (Glc2NAc), glucuronic acid (GlcA), 3-O-sulfogalactose (Gal3S), N- acetylneuraminic acid (Neu5Ac), 2-keto-3-deoxynonic acid (Kdn), and any combination thereof.

[0060] A glycosylation feature may comprise one monosaccharide. A glycosylation feature may comprise a plurality of monosaccharides. In such cases, the monosaccharides may be connected in any configuration through any suitable glycosidic bond(s). Glycosidic bonds between monosaccharides in a polysaccharide glycosylation feature may be alpha or beta and connect any two carbon atoms between adjacent monosaccharide residues through an oxygen atom. Tn some embodiments, the glycosylation feature of glycan is an N-linked, O-linked, C- linked, or S-linked glycan. In some embodiments, more than one glycosylation feature is present on a single biomolecule. The more than one glycosylation features may all be linked in the same manner (e.g., N-linked, O-linked, C-linked, S-linked), or they may be independently N-linked, O-linked, C-linked, or S-linked. Glycosylation features may be branched, linear, or both. Glycosylation features may be biantennary, triantennary, tetra- antennary, or any combination thereof. In some embodiments, the glycosylation feature comprises a polysaccharide epitope. In some embodiments, the glycosylation feature comprises high-mannose. In some embodiments, the glycosylation feature comprises sialylation. In some embodiments, the glycosylation feature comprises fucosylation. In some embodiments, the glycosylation feature comprises hybrid, complex, core or distally fucosylated, terminally sialylated, terminally galactosylated, terminally GlcNAc-ylated, GlcNAc-bisected, or poly-sialylated, or a combination thereof.

[0061] A glycosylation feature may be described in relative terms. A glycosylation feature may be described as increased or decreased with respect to the amount of a given monosaccharide in the glycosylation feature relative to a reference glycosylation feature. For example, a glycosylation feature may be described as an increase or increased in sialylation or fucosylation if the glycosylation feature comprises more sialic acid or fucose residues, respectively, than a reference glycan. Alternatively or additionally, a glycosylation feature may be described as increased or decreased with respect to the configuration (e.g., branched, linear, biantennary, tri-antennary, tetra-antennary, penta-antennary) of the glycosylation feature relative to a reference glycosylation feature. For example, a glycosylation feature may be described as an increase or increased in branching if the glycosylation feature comprises more branches than a reference glycosylation feature. In some embodiments, a glycosylation feature may be described as increased or decreased in one or more of high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly- sialylation, or a glycosylation feature listed in Table 1.

[0062] In some embodiments, a glycosylation feature is one of those listed in Table 1. Table 1. Table of representative glycosylation features. Glycosylation features are described using lUPAC-extended format (doi.org/10.1351/pacl99668101919). Table 1 can be found at https://doi.org/10.5281/zenodo.6459738

Glycosites [0063] Methods and systems as described herein may analyze one or more glycosites. Glycosites may comprise any site on a molecule (e.g., protein, lipid, nucleic acid) that can be glycosylated, whether or not the site is glycosylated. Generally, such sites comprise one or more atoms (e.g., nitrogen, oxygen, sulfur, carbon), optionally in one or more moieties (e.g., amino, amido, phenol, hydroxyl, guanidino, alcohol, thiol, indole), that are capable of forming a glycosidic bond with a sugar (e.g., glycosylation feature, such as a monosaccharide, oligosaccharide, polysaccharide, or derivative) molecule or part thereof. In some embodiments, a glycosite may comprise an amino acid comprising a side chain comprising an oxygen atom. In some embodiments, a glycosite may comprise an ammo acid comprising a side chain comprising a sulfur atom, a glycosite may comprise an amino acid comprising a side chain comprising a nitrogen atom. The glycosite may comprise arginine, asparagine, serine, threonine, tyrosine, cysteine, homocysteine, ornithine, or lysine. In some embodiments, a glycosite may comprise a nucleic acid or portion (e.g., nucleotide) thereof. In some embodiments, a glycosite may comprise a lipid or portion thereof.

[0064] Glycosites may be further distinguished by the sequence or sequon comprising the glycosite. In some embodiments, sequence or sequon may comprise other atoms, moieties, residues, monomers (e.g., amino acids, monosaccharides), or glycosites or glycan features in proximity to an atom forming or capable of forming a glycosidic bond. For example, a glycosite may be designated based on the sequential or spatial proximity of a particular amino acid or nucleoside to an atom that may be or may potentially be glycosylated. Proximity may be described in relative terms (e.g., C-terminal or N-terminal to a reference amino acid), absolute terms (e.g., within three sites of a reference amino acid, within 6 A of an amino acid), or a combination thereof (e.g., within three site C-terminal to a reference amino acid). In some embodiments, a sequence or sequon may comprise one or more amino acids within a certain sequence position. The one or more amino acids may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the atom forming or capable of forming the glycosidic bond. The one or more amino acids may be within no more than 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 positions of the atom forming or capable of forming the glycosidic bond. The one or more amino acids may be C-terminal with respect to the glycosite. The one or more C-terminal amino acids may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the atom forming or capable of forming the glycosidic bond. The one or more amino acids may be N- terminal with respect to the glycosite. The one or more N-terminal amino acids may be within 1. 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the atom forming or capable of forming the glycosidic bond. The one or more amino acids may comprise an atom within about 1, about 1.5, about 1.6, about 1.7, about 1.8, about 1.9, about 2.0, about 2.1, about 2.2, about 2.3, about 2.4, about 2.5, about 2.6, about 2.7, about 2.8, about 2.9, about 3.0, about 3.1, about 3.2, about 3.3, about 3.4, about 3.5, about 3.6, about

3.7, about 3.8, about 3.9, about 4.0, about 4.1, about 4.2, about 4.3, about 4.4, about 4.5, about 4.6, about 4.7, about 4.8, about 4.9, about 5.0, about 5.1, about 5.2, about 5.3, about 5.4, about 5.5, about 5.6, about 5.7, about 5.8, about 5.9, about 6.0, about 6.5, about 7.0, about 7.5, about 8.0, about 8.5, about 9.0, about 9.5, about 10.0, about 10.5, about 11.0, about

11.5, about 12.0, about 13.0, about 14.0, about 15 0, about 16.0, about 17.0, about 18.0, about

19.0, about 20.0, about 21.0, about 22.0, about 23.0, about 24.0, about 25.0, about 26.0, about

27.0, about 28.0, about 29.0, about 30.0 A or more of the atom forming or capable of forming a glycosidic bond. The one or more amino acids may comprise an atom within about 30.0, about 29.0, about 28.0, about 27.0, about 26.0, about 25.0, about 24.0, about 23.0, about 22.0, about 21.0, about 20.0, about 19.0, about 18.0, about 17.0, about 16.0, about 15.0, about 14.0, about 13.0, about 12.0, about 11.5, about 11.0, about 10.5, about 10.0, about 9.5, about 9.0, about 8.5, about 8.0, about 7.5, about 7.0, about 6.5, about 6.0, about 5.9, about 5.8, about

5.7, about 5.6, about 5.5, about 5.4, about 5.3, about 5.2, about 5.1, about 5.0, about 4.9, about 4.8, about 4.7, about 4.6, about 4.5, about 4.4, about 4.3, about 4.2, about 4.1, about 4.0, about 3.9, about 3.8, about 3.7, about 3.6, about 3.5, about 3.4, about 3.3, about 3.2, about 3.1, about 3.0, about 2.9, about 2.8, about 2.7, about 2.6, about 2.5, about 2.4, about

2.3, about 2.2, about 2.1, about 2.0, about 1.9, about 1.8, about 1.7, about 1.6, about 1.5, about 1 A or less of the atom forming or capable of forming a glycosidic bond.

[0065] A glycosite may be distinguished based on the sequence or sequon containing the glycosite. In some embodiments, a sequence or sequon may comprise a glycosite occurrence within or near a particular protein structural element. In some embodiments, the glycosite may be in a particular secondary structural element. In some embodiments, the secondary structural element comprises one or more of alpha-helix (H), 3 w-helix (G), pi-helix (I), extended strand (E), beta-bridge (B), turn (T), bend (S), or random coil (C). In some embodiments, the glycosite may be within one or more sites of the secondary structural element. In some embodiments, the glycosite may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the secondary structural element. In some embodiments, the glycosite may be within about 1, about 1.5, about 1.6, about 1.7, about 1.8, about 1 .9, about 2.0, about 2. 1 , about 2.2, about 2.3, about 2.4, about 2.5, about 2.6, about

2.7, about 2.8, about 2.9, about 3.0, about 3.1, about 3.2, about 3.3, about 3.4, about 3.5, about 3.6, about 3.7, about 3.8, about 3.9, about 4.0, about 4.1, about 4.2, about 4.3, about

4.4, about 4.5, about 4.6, about 4.7, 4 about.8, about 4.9, about 5.0, about 5.1, about 5.2, about 5.3, about 5.4, about 5.5, about 5.6, about 5.7, about 5.8, about 5.9, about 6.0, about

6.5, about 7.0, about 7.5, about 8.0, about 8.5, about 9.0, about 9.5, about 10.0, about 10.5, about 11.0, about 11.5, about 12.0, about 13.0, about 14.0, about 15.0, about 16.0, about 17.0, about 18.0, about 19.0, about 20.0, about 21.0, about 22.0, about 23.0, about 24.0, about 25.0, about 26.0, about 27.0, about 28.0, about 29.0, about 30.0 A or more of the secondary structural element. In some embodiments, the glycosite may be within no more than about 30.0, about 29.0, about 28.0, about 27.0, about 26.0, about 25.0, about 24.0, about 23.0, about 22.0, about 21.0, about 20.0, about 19.0, about 18.0, about 17.0, about 16.0, about 15.0, about 14.0, about 13.0, about 12.0, about 11.5, about 11.0, about 10.5, about 10.0, about 9.5, about 9.0, about 8.5, about 8.0, about 7.5, about 7.0, about 6.5, about 6.0, about 5.9, about 5.8, about 5.7, about 5.6, about 5.5, about 5.4, about 5.3, about 5.2, about 5.1, about 5.0, about 4.9, about 4.8, about 4.7, about 4.6, about 4.5, about 4.4, about 4.3, about 4.2, about 4.1, about 4.0, about 3.9, about 3.8, about 3.7, about 3.6, about 3.5, about 3.4, about 3.3, about 3.2, about 3.1, about 3.0, about 2.9, about 2.8, about 2.7, about 2.6, about 2.5, about 2.4, about 2.3, about 2.2, about 2.1, about 2.0, about 1.9, about 1.8, about 1.7, about 1.6, about

1.5, about 1 A or less of the secondary structural element. In some embodiments, the glycosite may be in a particular tertiary structural element. In some embodiments, the glycosite may be within one or more sites of the tertiary structural element. In some embodiments, the glycosite may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the tertiary structural element. In some embodiments, the glycosite may be within about 1, about 1.5, about 1.6, about 1.7, about 1.8, about 1.9, about 2.0, about 2.1, about 2.2, about 2.3, about 2.4, about 2.5, about 2.6, about 2.7, about 2.8, about 2.9, about 3.0, about 3.1, about 3.2, about 3.3, about 3.4, about 3.5, about 3.6, about

3.7, about 3.8, about 3.9, about 4.0, about 4.1, about 4.2, about 4.3, about 4.4, about 4.5, about 4.6, about 4.7, 4 about.8, about 4.9, about 5.0, about 5.1, about 5.2, about 5.3, about 5.4, about 5.5, about 5.6, about 5.7, about 5.8, about 5.9, about 6.0, about 6.5, about 7.0, about 7.5, about 8.0, about 8.5, about 9.0, about 9.5, about 10.0, about 10.5, about 11.0, about

11.5, about 12.0, about 13.0, about 14.0, about 15 0, about 16.0, about 17.0, about 18.0, about 19.0, about 20.0, about 21.0, about 22.0, about 23.0, about 24.0, about 25.0, about 26.0, about 27.0, about 28.0, about 29.0, about 30.0 A or more of the tertiary structural element. In some embodiments, the glycosite may be within no more than about 30.0, about 29.0, about 28.0, about 27.0, about 26.0, about 25.0, about 24.0, about 23.0, about 22.0, about 21.0, about 20.0, about 19.0, about 18.0, about 17.0, about 16.0, about 15.0, about 14.0, about 13.0, about 12.0, about 11.5, about 11.0, about 10.5, about 10.0, about 9.5, about 9.0, about 8.5, about 8.0, about 7.5, about 7.0, about 6.5, about 6.0, about 5.9, about 5.8, about 5.7, about 5.6, about 5.5, about 5.4, about 5.3, about 5.2, about 5.1, about 5.0, about 4.9, about 4.8, about 4.7, about 4.6, about 4.5, about 4.4, about 4.3, about 4.2, about 4.1, about 4.0, about 3.9, about 3.8, about 3.7, about 3.6, about 3.5, about 3.4, about 3.3, about 3.2, about 3.1, about 3.0, about 2.9, about 2.8, about 2.7, about 2.6, about 2.5, about 2.4, about 2.3, about 2.2, about 2.1, about 2.0, about 1.9, about 1.8, about 1.7, about 1.6, about 1.5, about 1 A or less of the tertiary structural element. In some embodiments, the glycosite may be in a particular quaternary structural element. In some embodiments, the glycosite may be within one or more sites of the quaternary structural element. In some embodiments, the glycosite may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions of the quaternary structural element. In some embodiments, the glycosite may be within about 1, about 1.5, about 1.6, about 1.7, about 1.8, about 1.9, about 2.0, about 2.1, about 2.2, about

2.3, about 2.4, about 2.5, about 2.6, about 2.7, about 2.8, about 2.9, about 3.0, about 3.1, about 3.2, about 3.3, about 3.4, about 3.5, about 3.6, about 3.7, about 3.8, about 3.9, about 4.0, about 4.1, about 4.2, about 4.3, about 4.4, about 4.5, about 4.6, about 4.7, 4 about.8, about 4.9, about 5.0, about 5.1, about 5.2, about 5.3, about 5.4, about 5.5, about 5.6, about 5.7, about 5.8, about 5.9, about 6.0, about 6.5, about 7.0, about 7.5, about 8.0, about 8.5, about 9.0, about 9.5, about 10.0, about 10.5, about 11.0, about 11.5, about 12.0, about 13.0, about 14.0, about 15.0, about 16.0, about 17.0, about 18.0, about 19.0, about 20.0, about 21.0, about 22.0, about 23.0, about 24.0, about 25.0, about 26.0, about 27.0, about 28.0, about 29.0, about 30.0 A or more of the quaternary structural element. In some embodiments, the glycosite may be within no more than about 30.0, about 29.0, about 28.0, about 27.0, about 26.0, about 25.0, about 24.0, about 23.0, about 22.0, about 21.0, about 20.0, about 19.0, about 18.0, about 17.0, about 16.0, about 15.0, about 14.0, about 13.0, about 12.0, about 11.5, about 11.0, about 10.5, about 10.0, about 9.5, about 9.0, about 8.5, about 8.0, about 7.5, about 7.0, about 6.5, about 6.0, about 5.9, about 5.8, about 5.7, about 5.6, about 5.5, about 5.4, about

5.3, about 5.2, about 5.1, about 5.0, about 4.9, about 4.8, about 4.7, about 4.6, about 4.5, about 4.4, about 4.3, about 4.2, about 4.1, about 4.0, about 3.9, about 3.8, about 3.7, about 3.6, about 3.5, about 3.4, about 3.3, about 3.2, about 3.1 , about 3.0, about 2.9, about 2.8, about 2.7, about 2.6, about 2.5, about 2.4, about 2.3, about 2.2, about 2.1, about 2.0, about 1.9, about 1.8, about 1.7, about 1.6, about 1.5, about 1 A or less of the quaternary structural element.

[0066] In some embodiments, a sequence or sequon may comprise a glycosite occurrence near another glycosite. In some embodiments, the glycosite may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more positions (e.g., amino acid residues) of the other glycosite. In some embodiments, the glycosite may be within about 1, about 1.5, about 1.6, about 1.7, about 1.8, about 1.9, about 2.0, about 2.1, about 2.2, about 2.3, about

2.4, about 2.5, about 2.6, about 2.7, about 2.8, about 2.9, about 3.0, about 3. 1, about 3.2, about 3.3, about 3.4, about 3.5, about 3.6, about 3.7, about 3.8, about 3.9, about 4.0, about 4.1, about 4.2, about 4.3, about 4.4, about 4.5, about 4.6, about 4.7, 4 about.8, about 4.9, about 5.0, about 5.1, about 5.2, about 5.3, about 5.4, about 5.5, about 5.6, about 5.7, about 5.8, about 5.9, about 6.0, about 6.5, about 7.0, about 7.5, about 8.0, about 8.5, about 9.0, about 9.5, about 10.0, about 10.5, about 11.0, about 11.5, about 12.0, about 13.0, about 14.0, about 15.0, about 16.0, about 17.0, about 18.0, about 19.0, about 20.0, about 21.0, about 22.0, about 23.0, about 24.0, about 25.0, about 26.0, about 27.0, about 28.0, about 29.0, about 30.0 A or more of the other glycosite. In some embodiments, the glycosite may be within no more than about 30.0, about 29.0, about 28.0, about 27.0, about 26.0, about 25.0, about 24.0, about 23.0, about 22.0, about 21.0, about 20.0, about 19.0, about 18.0, about 17.0, about 16.0, about 15.0, about 14.0, about 13.0, about 12.0, about 11.5, about 11.0, about 10.5, about 10.0, about

9.5, about 9.0, about 8.5, about 8.0, about 7.5, about 7.0, about 6.5, about 6.0, about 5.9, about 5.8, about 5.7, about 5.6, about 5.5, about 5.4, about 5.3, about 5.2, about 5.1, about 5.0, about 4.9, about 4.8, about 4.7, about 4.6, about 4.5, about 4.4, about 4.3, about 4.2, about 4.1, about 4.0, about 3.9, about 3.8, about 3.7, about 3.6, about 3.5, about 3.4, about 3.3, about 3.2, about 3.1, about 3.0, about 2.9, about 2.8, about 2.7, about 2.6, about 2.5, about 2.4, about 2.3, about 2.2, about 2.1, about 2.0, about 1.9, about 1.8, about 1.7, about

1.6, about 1.5, about 1 A or less of the other glycosite.

[0067] In some embodiments, a sequence of sequon may comprise predicted or observed solvent accessible surface area (SASA), total solvent accessibility (ASA), relative solvent accessibility (RSA), or any combination thereof. In some embodiments, a gly cosite may be distinguished on the basis of its hydrophobicity. The hydrophobicity of a residue or residues may be estimated or determined by, for example, a Kyte-Doolittle scale. In some embodiments, a glycosite may be distinguished on the basis on the depth of a residue center or atom (e.g., alpha carbon, such as of an amino acid). In some embodiments, a glycosite may be distinguished on the basis of one or more dihedral angles. In some embodiments, the dihedral angles comprise phi and psi angles of an amino acid residue. In some embodiments, the dihedral angles comprise one or more torsion angles (e.g., alpha, beta, gamma, delta, epsilon, zeta, nu) of a nucleic acid residue.

Intermolecular relationships

[0068] Methods and systems as described herein may produce or use parameters charactenzmg the association between glycosylation features and biomolecular features (e.g., amino acids proximal to the glycosite by sequence and/or three-dimensional space, such as found in a sequence of sequon comprising a glycosite). Such parameters may characterize the strength of an association between a glycosylation feature or aspect of a glycosylation feature and a biomolecule or aspect of a biomolecule. These parameters may be referred to herein as “intermolecular relations” or “IMRs.”

[0069] In some embodiments, the parameter characterizes a relationship (e.g., an association) of a glycosylation feature or part thereof as described herein. In some embodiments, the relationship is an association with a biomolecular feature or a glycosite as described herein. The gly cosite may be characterized by one or more of glycosylation feature, biomolecular sequence (e.g., amino acid sequence), biomolecule composition (e.g., amino acid identity), amino acid substitution, amino acid position, observed or predicted structure (e.g., secondary, tertiary, quaternary structure), presence or absence of another glycosite, proximity to another glycosite, glycosylation composition, glycosylation configuration (e.g., linear, branched), hydrophobicity, solvent accessibility, and any combination thereof.

[0070] The IMRs may be determined from one or more datasets comprising sets of biomolecular features and glycosylation features. The datasets may comprise data from one or more databases of experimental structural biology data (e.g., X-ray cry stallographic data, cryogenic electron microscopy data, nuclear magnetic resonance data), experimental gly can measurements or biochemical data (e.g., mass spectrometry data, chromatographic data), computer simulation or modeling data (e.g., molecular dynamics simulations, de novo or ab initio prediction, homology modeling, fragment assembly, secondary structure prediction algorithms, trained algorithms as descnbed herein). By observing the coincidence of certain biomolecular features and corresponding glycosylation features, IMRs may be determined. Alternatively or additionally, the IMRs may be determined by a trained algorithm as described herein. The trained algorithm may be able to offer a robust noise-mitigated observation or prediction of IMRs.

[0071] Any suitable measure of association may be used to determine the association between a biomolecular feature and a glycosylation feature. In some embodiments, the measure of association is determined by a generalized estimating equation (GEE), a Fisher exact test, a chi-squared test, or a hypergeometric test. In some embodiments, the determination of the association may comprise a dimensionality reduction process. The dimensionality reduction process may comprise principle component analysis (PCA), kernel PCA, linear discriminant analysis (LDA), independent component analysis (1CA), nonnegative matrix factorization (NMF), uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbor embedding (tSNE), or neural network embedding. In some embodiments, the IMRs may be expressed as an odds ratio. In some embodiments, the IMRs may be expressed as a logarithm of an odds ratio. In some cases, the odds ratio is determined by GEE. In some cases, the odds ratio is determined by Fisher’s exact test.

[0072] IMRs as described herein may be either positive or negative. In some embodiments, an IMR may be positive. In such cases, the IMR may indicate that a glycosylation feature and a sequence, sequon, glycosite, or part thereof, tend to occur together (e.g., the sequence, sequon, or glycosite, or part thereof, is associated with the presence of the glycosylation feature). In some embodiments, an IMR may be negative. In such cases, the IMR may indicate that a glycosylation feature and a sequence, sequon, or glycosite, or part thereof, tend not to occur together (e.g., the sequence, sequon, or glycosite, or part thereof, is associated with the absence of the glycosylation feature). In some embodiments, an IMR may be zero. In such cases, the IMR may indicate the glycosylation feature and sequence, sequon, or glycosite, or part thereof, are independent (e.g., are not associated).

[0073] IMRs may have associated with them one or more measures of uncertainty. The measures of uncertainty may comprise confidence intervals, estimated errors, estimated standard errors of the mean, p-values, or false discovery rates (FDRs) or corrections thereto. [0074] In some embodiments, the IMR may be described as “significant” is it has a measure of uncertainty that is above or below a cutoff or threshold value. In some embodiments, an IMR may be described as “significant” if it is associated with a p-value less than a cutoff value. In some embodiments, the cutoff value is less than about 1.0, about 0.9, about 0.8, about 0.7, about 0.6, about 0.5, about 0.4, about 0.3, about 0.2, about 0.1, about 0.09, about 0.08, about 0.07, about 0.06, about 0.05, about 0.04, about 0.03, about 0.02, about 0.01 , about 0.005, about 0.0001, about 0.0005, about 0.0001, or less. In some embodiments, an IMR may be described as “significant” if it is associated with an FRD correction less than a cutoff value. In some embodiments, the cutoff value is less than about 1.0, about 0.9, about 0.8, about 0.7, about 0.6, about 0.5, about 0.4, about 0.3, about 0.2, about 0.1, about 0.09, about 0.08, about 0.07, about 0.06, about 0.05, about 0.04, about 0.03, about 0.02, about 0.01, about 0.005, about 0.001, about 0.0005, about 0.0001, or less.

[0075] In some embodiments, the IMR may be described as “significant” if its magnitude is above a cutoff or threshold value. In some embodiments, the threshold is at least about 0.0001, about 0.001, about 0.01, about 0.1, about 0.2, about 0.25, about 0.3, about 0.35, about 0.4, about 0.45, about 0.5, about 0.55, about 0.6, about 0.65, about 0.7, about 0.75, about 0.8, about 0.85, about 0.9, about 0.95, about 1.0, about 2.0, about 3.0, about 4.0, about 5.0, about 10.0, about 100.0, or more. In, some embodiments, the IMR may be described as “significant” if the threshold is at least about 1.0 or more. In, some embodiments, the IMR may be described as “significant” if the threshold is at least about 0. 1 or more.

[0076] In some embodiments, the IMR may be described as “moderate” if its magnitude is above a lower cutoff or threshold value but is less than some upper cutoff or threshold value. In some embodiments, the lower threshold is at least about 0.0001, about 0.001, about 0.01, about 0.1, about 0.2, about 0.25, about 0.3, about 0.35, about 0.4, about 0.45, about 0.5, about 0.55, about 0.6, about 0.65, about 0.7, about 0.75, about 0.8, about 0.85, about 0.9, about 0.95, about 1.0, about 2.0, about 3.0, about 4.0, about 5.0, about 10.0, about 100.0, or more. In some embodiments, the upper threshold is no more than about 100.0, about 10.0, about 5.0, about 4.0, about 3.0, 2 about.0, about 1.0, about 0.95, about 0.9, about 0.85, about 0.8, about 0.75, about 0.7, about 0.65, about 0.6, about 0.55, about 0.5, about 0.45, about 0.4, about 0.35, about 0.3, about 0.25, about 0.2, about 0.15, about 0.1, about 0.01, about 0.001, about 0.0001, or less. The IMR may be described as “moderate” if the threshold is at least about 0.5 but less than about 1.0.

[0077] In some embodiments, the IMR may be described as “weak” if its magnitude is below a cutoff or threshold value. In some embodiments, the threshold is no more than about 100.0, about 10.0, about 5.0, about 4.0, about 3.0, 2 about.0, about 1.0, about 0.95, about 0.9, about 0.85, about 0.8, about 0.75, about 0.7, about 0.65, about 0.6, about 0.55, about 0.5, about 0.45, about 0.4, about 0.35, about 0.3, about 0.25, about 0.2, about 0.15, about 0.1, about 0.01 , about 0.001 , about 0.0001 , or less. The TMR may be described as “weak” if the threshold is less than about 0.1.

[0078] Methods and systems as described herein may take as inputs or process one or more IMRs. IMRs may be used in methods of determining the effect of sequence modification on glycosylation, methods of modifying glycopeptides, methods of modifying other biomolecules, and any combination thereof. Methods and systems which take as input of process one of more IMRs may comprise an operation on the one or more IMRs. In some embodiments, the operation may comprise calculating a norm of two IMRs. In some embodiments, the norm is an -norm. In some embodiments, the norm comprises a Euclidean norm (also referred to herein as a “Euclidean distance” or “-norm”). In some embodiments, IMRs may be expressed or encoded in one or more substitution matrices (i.e., BLAMO). Additional example substitution matrix may comprise a point accepted mutation (PAM) matrix, a block substitution matrix (BLOSUM), or any combination or variant thereof. [0079] In some embodiments, an IMR comprises an IMR as listed in Table 2. In some embodiments, an IMR comprises an IMR as listed in Table 3. The first column of Table 2 and Table 3, “Protein_Structure,” specify a gly can-associated protein structure. The prefix for the protein structure variables is as follows, "<glycan_type>_<seq/struc>_<measurement>_ The “glycan_type” will be shown as “ASN” (indicating an N-glycan associated protein structure) or “SER.THR” (indicating an O- glycan associated protein structure). The "seq_" and "struc_" prefix indicate if the measure was derived from sequence or structure respectively. The "measurement" term indicates the type of protein structure descriptor measurement. The “measurement” can refer to proximal amino acids as “_aa_” (occurrences of a specific amino acid within n Angstroms glycosite), “_aaUp_” (occurrences of a specific amino acid within n positions of a glycosite upstream), “_aaDown_” (occurrences of a specific amino acid within n positions of a glycosite downstream), “_aaAll_” (occurrences of a specific amino acid within n positions of a glycosite upstream or downstream). When the measurement refers to amino acids, the suffix will be “_<aa>” where aa is a one letter expression of the proximal amino acid. The “measurement” can refer to secondary structure as “_SS_” (secondary structure predicted either from sequence or 3D structure). When the “measurement” refers to secondary structure, secondary structure can be predicted from sequence using the sspro or sspro8 tools in the SCRATCH protein prediction software indicated by the suffix “_sspro<ss>” or “_sspro8<ss>” where ss is a secondary structure output by sspro8 (H: alpha-helix G: 310- helix T: pi-helix (extremely rare) E: extended strand B: beta-bridge T: turn S: bend C: the rest) or sspro (H: helix E: strand C: the rest). Secondary structure can also be predicted from 3D structure using DSSP software indicated by the suffix “_dssp<ss>” where ss is a secondary structure output by dssp (H: Alpha helix, B: Beta bridge, E: Strand, G: Helix-3, 1: Helix-5, T: Turn, S: Bend). The “measurement” can refer to hydrophobicity (Kyte-Doolitle hydrophobicity based on 7 glycosite flanking amino acids) indicated by the “ hydrophobicity kd” suffix. The “measurement” can refer to alpha-carbon or whole-residue depth as “_CA_depth_” or “_RES_depth_” respectively. When the “measurement” refers to depth, depth can be predicted from 3D structure using MSMS software indicated by the “_msms” suffix. The “measurement” can refer to relative and absolute solvent accessibility and solvent-accessible surface area as “_RSA_” or “_ASA_” respectively. When the “measurement” refers to absolute solvent accessibility, absolute solvent-accessible surface area (ASA) can be predicted from 3D protein structure using the FreeS ASA software indicated by the “_[asa]_freesasa_het” suffix where asa specifies either all, backbone, polar, apolar, or residue accessibility; ASA can also be calculated from 3D protein structure using the DSSP software indicated by the “_dssp” suffix. When the “measurement” refers to relative solvent accessibility, relative accessibility (RSA) can be predicted from 3D protein structure using the FreeSASA software indicated by the “_|rsa]_freesasa_het” suffix where rsa specifies either all, backbone, polar, apolar, or residue accessibility; RSA can also be calculated from 3D protein structure using the DSSP software indicated by the “_dssp” suffix. RSA can also be predicted from protein sequence using ACCPro or ACCPro8 tools from the SCRATCH software package indicated by the “ accpro[20]” suffix. The “measurement” can refer to Psi and Phi bond angles of the glycosite residue as “_PHI_” or “ PSI ” respectively. When the “measurement” refers to bond angles, bond angles can be predicted from 3D protein structure using the DSSP software indicated by the “_dssp” suffix. LogOR ranges in the second column of Table 2 and Table 3 are denoted by the first two digest of the largest magnitude number bounding the range. Specifically, -Inf denotes [-Inf,- 4.75], -4.7 denotes (-4.75,-4.26], -4.2 denotes (-4.26,-3.72], -3.7 denotes (-3.72,-3.09], -3.0 denotes (-3.09,-2.33], -2.3 denotes (-2.33,-1.28], -1.2 denotes (-1.28,0], 1.2 denotes (0,1.28], 2.3 denotes (1.28,2.33], 3.0 denotes (2.33,3.09], 3.7 denotes (3.09,3.72], 4.2 denotes (3.72,4.26], 4.7 denotes (4.26,4.75], and Inf (4.75, Inf]) > / / / / / / / / / / / / / / / / / / / / / / >

Databases

[0080] Methods and systems described herein may comprise one or more databases. The methods and systems described herein may comprise at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, or more databases. The databases may comprise genomic, proteomic, glycomic, biological (e.g., protein sequence databases, protein structure databases, protein model databases, nucleic acid sequence databases, nucleic acid structure databases, nucleic acid model databases), biomedical, or scientific databases. The databases may comprise publicly available databases. Alternatively or additionally, the databases may comprise proprietary databases. The databases may comprise commercially available databases. The databases may include, but are not limited to, UniCarbKB, GlyConnect, and the Protein Data Bank (PDB).

[0081] Databases as described herein may comprise one or more sequences. The sequences may comprise reference or variant sequences. Databases as described herein may comprise one or more glycosylation features. Databases as described herein may comprise one or more measures of association between sequences or parts thereof (e.g., glycosites, amino acids structurally and/or sequence proximal to glycosites) and glycosylation features. The one or more measures of association may comprise intermolecular relations (IMRs) as described elsewhere herein.

[0082] The methods disclosed herein may comprise analyzing information contained in one or more databases. The method disclosed herein may comprise analyzing information contained in at least about 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20, 30, or more databases. Analyzing the information in the one or more databases may comprise one or more algorithms (e.g., trained algorithms), computers, processors, memory locations, devices, or a combination thereof.

Trained Algorithms

[0083] Methods and systems as described herein may employ one or more trained algorithms. The trained algorithm(s) may process or operate on one or more datasets comprising information about biomolecules (e.g., biomolecular features), glycans and glycosylation features, or any combination thereof. In some embodiments, the glycan embeddings (e.g., gly compare, sweetnet) that go into the algorithms that generate the IMRs are derived from empirical observations. In some embodiments, the datasets comprise structural or sequence information about biomolecules. In some embodiments, the datasets comprise one or more datasets of glycosylation features. The one or more datasets may be observed empirically, derived from computational studies, be derived from or contained in one or more databases, or any combination thereof.

[0084] The trained algorithm may comprise an unsupervised machine learning algorithm. The trained algorithm may comprise a supervised machine learning algorithm. The trained algorithm may comprise a semi-supervised machine learning algorithm. The trained algorithm may comprise a classification and regression tree (CART) algorithm. The supervised machine learning algorithm may comprise, for example, a Random Forest, a support vector machine (SVM), a neural network, or a deep learning algorithm. The trained algorithm may comprise a self-supervised machine learning algorithm.

[0085] In some embodiments, a machine learning algorithm (or software module) of a platform as described herein utilizes one or more neural networks. In some embodiments, a neural network is a type of computational system that can leam the relationships between an input dataset and a target dataset. A neural network may be a software representation of a human neural system (e.g. cognitive system), intended to capture “learning” and “generalization” abilities as used by a human. In some embodiments, the machine learning algorithm (or software module) comprises a neural network comprising a CNN. Nonlimiting examples of structural components of embodiments of the machine learning software described herein include: CNNs, recurrent neural networks, dilated CNNs, fully -connected neural networks, deep generative models, recurrent neural networks (RNNs), RNNs using long short-term memory (LSTM) units, and Boltzmann machines.

[0086] In some embodiments, a neural network comprises a series of layers termed “neurons.” In some embodiments, a neural network comprises an input layer, to which data is presented; one or more internal, and/or “hidden”, layers; and an output layer. A neuron may be connected to neurons in other layers via connections that have weights, which are parameters that control the strength of the connection. The number of neurons in each layer may be related to the complexity of the problem to be solved. The minimum number of neurons required in a layer may be determined by the problem complexity, and the maximum number may be limited by the ability of the neural network to generalize. The input neurons may receive data being presented and then transmit that data to the first hidden layer through connections’ weights, which are modified during training The first hidden layer may process the data and transmit its result to the next layer through a second set of weighted connections. Each subsequent layer may “pool” the results from the previous layers into more complex relationships. In addition, whereas conventional software programs require writing specific instructions to perform a function, neural networks are programmed by training them with a known sample set and allowing them to modify themselves during (and after) training so as to provide a desired output such as an output value. After training, when a neural network is presented with new input data, it is configured to generalize what w as “learned” during training and apply what was learned from training to the new previously unseen input data in order to generate an output associated with that input.

[0087] In some embodiments, the neural network comprises ANNs. ANN may be machine learning algorithms that may be trained to map an input dataset to an output dataset, where the ANN comprises an interconnected group of nodes organized into multiple layers of nodes. For example, the ANN architecture may comprise at least an input layer, one or more hidden layers, and an output layer. The ANN may comprise any total number of layers, and any number of hidden layers, where the hidden layers function as trainable feature extractors that allow mapping of a set of input data to an output value or set of output values. As used herein, a deep learning algorithm (such as a DNN) is an ANN comprising a plurality of hidden layers, e.g., two or more hidden layers. Each layer of the neural network may comprise a number of nodes (or “neurons”). A node receives input that comes either directly from the input data or the output of nodes in previous layers, and performs a specific operation, e.g., a summation operation. A connection from an input to a node is associated with a weight (or weighting factor). The node may sum up the products of all pairs of inputs and their associated weights. The weighted sum may be offset with a bias. The output of a node or neuron may be gated using a threshold or activation function. The activation function may be a linear or non-linear function. The activation function may be, for example, a rectified linear unit (ReLU) activation function, a Leaky' ReLU activation function, or other function such as a saturating hyperbolic tangent, identity, binary step, logistic, arctan, softsign, parametric rectified linear unit, exponential linear unit, softplus, bent identity, softexponential, sinusoid, sine, Gaussian, or sigmoid function, or any combination thereof. [0088] The weighting factors, bias values, and threshold values, or other computational parameters of the neural network, may be “taught” or “learned” in a training phase using one or more sets of training data. For example, the parameters may be trained using the input data from a training dataset and a gradient descent or backward propagation method so that the output value(s) that the ANN computes are consistent with the examples included in the training dataset.

[0089] The number of nodes used in the input layer of the ANN or DNN may be at least about 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000. 100,000, or greater. In some instances, the number of node used in the input layer may be at most about

100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or less. In some instances, the total number of layers used in the ANN or DNN (including input and output layers) may be at least about 3, 4, 5, 10, 15, 20, or greater. In some instances, the total number of layers may be at most about 20, 15, 10, 5, 4, 3, or less. [0090] In some instances, the total number of learnable or trainable parameters, e g., weighting factors, biases, or threshold values, used in the ANN or DNN may be at least about 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, 7,000, 8,000, 9,000, 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, or greater. In some instances, the number of learnable parameters may be at most about 100,000, 90,000, 80,000, 70,000, 60,000, 50,000, 40,000, 30,000, 20,000, 10,000, 9,000, 8,000, 7,000, 6,000, 5,000, 4,000, 3,000, 2,000, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or less.

[0091] In some embodiments of a machine learning software module as described herein, a machine learning software module comprises a neural network such as a deep CNN. In some embodiments in which a CNN is used, the network is constructed with any number of convolutional layers, dilated layers or fully-connected layers. In some embodiments, the number of convolutional layers is between 1-10 and the dilated layers between 0-10. The total number of convolutional layers (including input and output layers) may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater, and the total number of dilated layers may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater. The total number of convolutional layers may be at most about 20, 15, 10, 5, 4, 3, or less, and the total number of dilated layers may be at most about 20, 15, 10, 5, 4, 3, or less. In some embodiments, the number of convolutional layers is between 1-10 and the fully-connected layers between 0-10. The total number of convolutional layers (including input and output layers) may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater, and the total number of fully -connected layers may be at least about 1, 2, 3, 4, 5, 10, 15, 20, or greater. The total number of convolutional layers may be at most about 20, 15, 10, 5, 4, 3, 2, 1, or less, and the total number of fully-connected layers may be at most about 20, 15, 10, 5, 4, 3, 2, 1, or less.

[0092] In some embodiments, the input data for training of the ANN may comprise a variety of input values depending whether the machine learning algorithm is used for processing sequence or structural data. In general, the ANN or deep learning algorithm may be trained using one or more training datasets comprising the same or different sets of input and paired output data.

[0093] In some embodiments, a machine learning software module comprises a neural network comprising a CNN, RNN, dilated CNN, fully-connected neural networks, deep generative models and deep restricted Boltzmann machines.

[0094] In some embodiments, a machine learning algorithm comprises CNNs. The CNN may be deep and feedforward ANNs. The CNN may be applicable to analyzing visual imagery. The CNN may comprise an input, an output layer, and multiple hidden layers. The hidden layers of a CNN may comprise convolutional layers, pooling layers, fully-connected layers and normalization layers. The layers may be organized in 3 dimensions: width, height and depth.

[0095] The convolutional layers may apply a convolution operation to the input and pass results of the convolution operation to the next layer. For processing images, the convolution operation may reduce the number of free parameters, allowing the network to be deeper with fewer parameters. In neural networks, each neuron may receive input from some number of locations in the previous layer. In a convolutional layer, neurons may receive input from only a restricted subarea of the previous layer. The convolutional layer's parameters may comprise a set of learnable filters (or kernels). The learnable filters may have a small receptive field and extend through the full depth of the input volume. During the forward pass, each filter may be convolved across the width and height of the input volume, compute the dot product between the entries of the filter and the input, and produce a two-dimensional activation map of that filter. As a result, the network may learn filters that activate when it detects some specific type of feature at some spatial position in the input.

[0096] In some embodiments, the pooling layers comprise global pooling layers. The global pooling layers may combine the outputs of neuron clusters at one layer into a single neuron in the next layer. For example, max pooling layers may use the maximum value from each of a cluster of neurons in the prior layer; and average pooling layers may use the average value from each of a cluster of neurons at the prior layer.

[0097] In some embodiments, the fully -connected layers connect every neuron in one layer to every neuron in another layer. In neural networks, each neuron may receive input from some number locations in the previous layer. In a fully-connected layer, each neuron may receive input from every element of the previous layer.

[0098] In some embodiments, the normalization layer is a batch normalization layer. The batch normalization layer may improve the performance and stability of neural networks. The batch normalization layer may provide any layer in a neural network with inputs that are zero mean/unit variance. The advantages of using batch normalization layer may include faster trained networks, higher learning rates, easier to initialize weights, more activation functions viable, and simpler process of creating deep networks.

[0099] In some embodiments, a machine learning software module comprises a recurrent neural network software module. A recurrent neural network software module may be configured to receive sequential data as an input, such as consecutive data inputs, and the recurrent neural network softw are module updates an internal state at every time step. A recurrent neural network can use internal state (memory) to process sequences of inputs. The recurrent neural network may be applicable to tasks such as handwriting recognition or speech recognition. The recurrent neural network may also be applicable to next word prediction, music composition, image captioning, time series anomaly detection, machine translation, scene labeling, and stock market prediction. A recurrent neural network may comprise fully recurrent neural network, independently recurrent neural network, Elman networks, Jordan networks, Echo state, neural history compressor, long short-term memory, gated recurrent unit, multiple timescales model, neural Turing machines, differentiable neural computer, and neural network pushdown automata.

[0100] In some embodiments, a machine learning software module comprises a supervised or unsupervised learning method such as, for example, support vector machines (“SVMs”), random forests, clustering algorithm (or software module), gradient boosting, logistic regression, and/or decision trees The supervised learning algorithms may be algorithms that rely on the use of a set of labeled, paired training data examples to infer the relationship between an input data and output data. The unsupervised learning algorithms may be algorithms used to draw inferences from training datasets to the output data. The unsupervised learning algorithm may comprise cluster analysis, which may be used for exploratory data analysis to find hidden patterns or groupings in process data. One example of unsupervised learning method may comprise principal component analysis. The principal component analysis may comprise reducing the dimensionality of one or more variables. The dimensionality of a given variable may be at least 1, 5, 10, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1,000, 1,100, 1,200 1,300, 1,400, 1,500, 1,600, 1,700, 1,800, or greater. The dimensionality of a given variables may be at most 1,800, 1,700, 1,600, 1,500, 1,400, 1,300, 1,200, 1,100, 1,000, 900, 800, 700, 600, 500, 400, 300, 200, 100, 50, 10, or less.

[0101] In some embodiments, the machine learning algorithm may comprise reinforcement learning algorithms. The reinforcement learning algorithm may be used for optimizing Markov decision processes (i.e., mathematical models used for studying a wide range of optimization problems where future behavior cannot be accurately predicted from past behavior alone, but rather also depends on random chance or probability). One example of reinforcement learning may be Q-leaming. Reinforcement learning algorithms may differ from supervised learning algorithms in that correct training data input/ output pairs are never presented, nor are sub-optimal actions explicitly corrected. The reinforcement learning algorithms may be implemented with a focus on real-time performance through finding a balance between exploration of possible outcomes (e g., correct compound identification) based on updated input data and exploitation of past training.

[0102] In some embodiments, training data resides in a cloud-based database that is accessible from local and/or remote computer systems on which the machine learning-based sensor signal processing algorithms are running. The cloud-based database and associated software may be used for archiving electronic data, sharing electronic data, and analyzing electronic data. In some embodiments, training data generated locally may be uploaded to a cloud-based database, from which it may be accessed and used to train other machine learning-based detection systems at the same site or a different site.

[0103] The trained algorithm may accept a plurality of input variables and produce one or more output variables based on the plurality of input variables. The input variables may comprise one or more datasets indicative of a glycosylation feature. For example, the input variables may comprise glycoprotein sequences, data indicative of glycoprotein structure, data indicative of one or more glycosylation features, or any combination thereof.

[0104] The trained algorithm may be trained with a plurality of independent training samples. Each of the independent training samples may comprise a glycosylation feature and a sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids). The trained algorithm may be trained with at least about 5, at least about 10, at least about 15, at least about 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 100, at least about 150, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, at least about 500, at least about 1,000, at least about 1,500, at least about 2,000, at least about 2,500, at least about 3,000, at least about 3,500, at least about 4,000, at least about 4,500, at least about 5,000, at least about, 5,500, at least about 6,000, at least about 6,500, at least about 7,000, at least about 7,500, at least about 8,000, at least about 8,500, at least about 9,000, at least about 9,500, at least about 10,000, or more independent training samples. [0105] The trained algorithm may associate a sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids) or glycosite with a glycosylation feature at an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The accuracy of associating the sequence or glycosite and the glycosylation feature by the trained algorithm may be calculated as the percentage of independent test samples (e.g., sequences or glycosites known to be associated with the glycosylation feature or known not to be associated with the glycosylation feature) that are correctly associated or not associated.

[0106] The trained algorithm may associate the sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal ammo acids) or glycosite with a glycosylation feature with a positive predictive value (PPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The PPV of associating the sequence or glycosite and the glycosylation feature using the trained algorithm may be calculated as the percentage of sequences or glycosites classified as being associated with the glycosylation feature that truly are associated with the glycosylation feature.

[0107] The trained algorithm may associate the sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids) or glycosite with a glycosylation feature with a negative predictive value (NPV) of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The NPV of associate the sequence or glycosite with a glycosylation feature using the trained algorithm may be calculated as the percentage of sequences or glycosites identified or classified as not being associated with the glycosylation feature that truly are not associated with the glycosylation feature.

[0108] The trained algorithm may associate the sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids) or glycosite with a glycosylation feature with a sensitivity at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99. 1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The sensitivity of associating the sequence or glycosite with the glycosylation feature using the trained algorithm may be calculated as the percentage of independent test samples associated with the glycosylation feature (e.g., sequences or glycosites known to be associated with glycosylation feature) that are correctly identified or classified as being associated with the glycosylation feature.

[0109] The trained algorithm may be configured to associate the sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids) or glycosite with the glycosylation feature with a specificity of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or more. The specificity of associating the glycosylation feature using the trained algorithm may be calculated as the percentage of independent test samples associated with an absence of the glycosylation feature (e.g., sequences or glycosites known to not be associated with the glycosylation feature) that are correctly identified or classified as not associated with the glycosylation feature.

[0110] The trained algorithm may be configured to associate the sequence (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal ammo acids) or glycosite with the glycosylation feature with an Area- Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.81, at least about 0.82, at least about 0.83, at least about 0.84, at least about 0.85, at least about 0.86, at least about 0.87, at least about 0.88, at least about 0.89, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more. The AUC may be calculated as an integral of the Receiver Operator Characteristic (ROC) curve (e.g., the area under the ROC curve) associated with the trained algorithm in classifying datasets derived from a sequence or glycosite as being associated or not associated with the glycosylation feature.

[0111] The trained algorithm may be adjusted or tuned to improve one or more of the performance, accuracy, PPV, NPV, sensitivity, specificity , or AUC of associating the glycosylation feature. The trained algorithm may be adjusted or tuned by adjusting parameters of the trained algorithm (e.g., a set of cutoff values used to associate a glycosylation feature as described elsewhere herein, or weights of a neural network). The trained algorithm may be adjusted or tuned continuously during the training process or after the training process has completed.

[0112] After the trained algorithm is initially trained, a subset of the inputs may be identified as most influential or most important to be included for making high-quality predictions. For example, a subset of the data may be identified as most influential or most important to be included for making high-quality associations of sequences (e.g., a sequence or sequon comprising a glycosite, and optionally one or more structurally proximal and/or sequence proximal amino acids) or glycosites and glycosylation features. The data or a subset thereof may be ranked based on classification metrics indicative of each parameter’s influence or importance toward making high-quality associations of glycosylation features with sequences or glycosites. Such metrics may be used to reduce, in some embodiments significantly, the number of input variables (e.g., predictor variables) that may be used to train the trained algorithm to a desired performance level (e.g., based on a desired minimum accuracy, PPV, NPV, sensitivity, specificity, AUC, or a combination thereof). For example, if training the trained algorithm with a plurality comprising several dozen or hundreds of input variables in the trained algorithm results in an accuracy of classification of more than 99%, then training the trained algorithm instead with only a selected subset of no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100 such most influential or most important input variables among the plurality can yield decreased but still acceptable accuracy of classification (e.g., at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%). The subset may be selected by rank-ordering the entire plurality of input variables and selecting a predetermined number (e.g., no more than about 5, no more than about 10, no more than about 15, no more than about 20, no more than about 25, no more than about 30, no more than about 35, no more than about 40, no more than about 45, no more than about 50, or no more than about 100) of input variables with the best association metrics.

[0113] Systems and methods as described herein may use more than one trained algorithm to determine an output (e.g., association of a sequence or glycosite and glycosylation feature). Systems and methods may comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more trained algorithms. A trained algorithm of the plurality of trained algorithms may be trained on a particular type of data (e g., sequence data, structural data). Alternatively, a trained algorithm may be trained on more than one type of data. The inputs of one trained algorithm may comprise the outputs of one or more other trained algorithms. Additionally, a trained algorithm may receive as its input the output of one or more trained algorithms.

Methods and systems for determining whether a glycan will be found at a glycosite [0114] Methods and systems as described herein may determine the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence (or sequon). Methods and systems may comprise providing the sequences and the plurality of candidate glycans. The methods and systems may further comprise applying a trained algorithm such as those described herein to calculate a predicted presence for one or more of the plurality of glycans at the glycosite of the sequence. The calculation may be determined based on one or more amino acids in the sequence. The methods and systems may further comprise processing (e.g., with a computer) the one or more predicted presences of the one or more glycans of the plurality of glycans, thereby determining the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0115] The predicted presence of the glycosylation feature at a glycosite in a sequence may be based on glycosylation feature structure, glycosylation feature composition, glycosylation feature length, glycosylation feature branching, sequence length, sequence composition, position of a monomer in the sequence, substitution of one or more monomers in the sequence, insertion of one or more monomers in the sequence, deletion of one or more monomers in the sequence, observed or predicted sequence secondary structure, observed or predicted sequence tertiary structure, observed or predicted sequence quaternary structure, or any combination thereof. The predicted presence may be based on a feature of the reference sequence. The predicted presence may be determined by the position of one or more amino acids in the sequence.

[0116] The likelihood of the presence or absence of the one or more glycosylation features may be determined by applying a trained algorithm as described herein to the plurality of sequences. The trained algorithm may determine a likelihood of the one or more glycosylation features being associated with a glycosite in the reference sequences. The trained algorithm may additionally determine a likelihood of the one or more glycosylation features being associated with the corresponding glycosites in the one or more variant sequences. The effect on a reference sequence may be described categorically (e.g., positive, negative, or neutral; increase or decrease), or may be described numerically (e.g., as a difference or ratio of likelihoods). In an example embodiment, the likelihood can be used to indicate predicted presence of the one or more glycosylation features. Alternatively or additionally, the likelihood of the presence or absence of the one or more glycosylation features may be determined by looking up a measure of association (e.g., logarithm of an odds ratio) between the variant sequence and the glycosylation feature and between the reference sequence and the glycosylation feature, such as in the IMR methods provided herein.

[0117] The glycosylation feature may comprise one or more monosaccharides. The glycosylation feature may comprise mannose, sialic acid, fucose, or a combination thereof. The glycosylation feature may comprise a polysaccharide epitope. The glycosylation feature may comprise a glycosylation feature listed in Table 1. In some embodiments, the glycosylation feature is an increase or decrease in a high-mannose in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a sialylation in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a high-mannose in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a glycosylation feature listed in Table 1.

[0118] In some embodiments, the likelihood may be expressed as a probability. In some embodiments, the likelihood may be expressed as a pseudo-probability. In some embodiments, the likelihood may be expressed as a ratio or product of one or more probabilities or pseudo-probabilities. In some embodiments, the likelihood may be expressed as a sum or difference of one or more probabilities. In some embodiments, the likelihood may be expressed as an odds ratio. In some embodiments, the likelihood may be expressed as the logarithm of an odds ratio.

Methods and systems for determining the effect of sequence modification on glycosylation

[0119] Methods and systems as described herein may determine the effect of a sequence modification on glycosylation. The effect of a sequence modification may be determined by determining the likelihood (e.g., predicted presence) of one or more glycosylation features being present or absent at corresponding sites in a plurality of sequences comprising a reference sequence and one or more variant sequences. Variant sequences may differ from the reference sequence in one or more of length, monomer (e.g., amino acid, nucleotide) identity , predicted or observed secondary structure, predicted or observed tertiary' structure, glycosite composition, or glycosite position. Based on the likelihood of the presence or absence of the glycosylation features in the reference sequence as compared to the variant sequences, a determination of the effect may be made.

[0120] For example, the method may comprise providing a plurality of sequences comprising reference sequence and one or more variant sequences. The one or more variant sequences may differ from the reference sequence in one or more of length, monomer (e.g., amino acid, nucleotide) identity, predicted or observed secondary structure, predicted or observed tertiary structure, glycosite composition, or glycosite position. The variant sequences may have one or more amino acid substitutions compared to the reference sequence. For each of the plurality of sequences, a trained algorithm as described herein may calculate the likelihood (e.g., predicted presence) of a glycosylation feature at the glycosite of the reference and each of the variant sequences, thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0121] The predicted presence of the glycosylation feature at a glycosite in a sequence may be based on glycosylation feature structure, glycosylation feature composition, glycosylation feature length, glycosylation feature branching, sequence length, sequence composition, position of a monomer in the sequence, substitution of one or more monomers in the sequence, insertion of one or more monomers in the sequence, deletion of one or more monomers in the sequence, observed or predicted sequence secondary structure, observed or predicted sequence tertiary structure, observed or predicted sequence quaternary structure, or any combination thereof. The predicted presence may be based on a feature of the reference sequence. Alternatively or additionally, the predicted presence may be based on a feature of one or more of the variant sequences. In some embodiments, the predicted presence is based on the identity of one or more amino acid sequences in the variant sequence(s) compared to the reference sequence. In some embodiments, the predicted presence is based on the identity of one more amino acids in the variant sequence(s) compared to the reference sequence. In some embodiments, the predicted present is based on the position of one more amino acids in the variant sequence(s) compared to the reference sequence. The position may be a distance from the glycosite in sequence or three-dimensional space.

[0122] The variant sequences may differ from the reference sequence in amino acid sequence. The difference in amino acid sequence may comprise a difference in one or more amino acid identities or positions. In some embodiments, a variant sequence of the plurality of variant sequences differs from the reference sequence in the identity of an amino acid 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more positions from the glycosite In some embodiments, the variant sequence differs from the reference sequence in the identity of an amino acid 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer positions from the glycosite.

[0123] The variant sequence may differ from the reference sequence in length of amino acid sequence. The variant sequence may have one or more insertions or deletions with respect to the reference sequence. The variant sequence may comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more insertions or deletions. The insertions or deletions may all be contiguous, or they may comprise one or more subsets across the sequence of the variant sequence. The insertion or deletion may be proximal to a glycosite of the modified glycopeptide. In some embodiments, the insertion or deletion is within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more sites of the glycosite. In some embodiments, the insertion or deletion is within no more than 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer sites of the glycosite. In some embodiments, the insertion or deletion is in a site distal to the glycosite. The likelihood of the presence or absence of the one or more glycosylation features may be determined by applying a trained algorithm as described herein to the plurality of sequences. The trained algorithm may determine a likelihood of the one or more glycosylation features being associated with a glycosite in the reference sequences. The trained algorithm may additionally determine a likelihood of the one or more glycosylation features being associated with the corresponding glycosites in the one or more variant sequences. The effect on a reference sequence may be described categorically (e.g., positive, negative, or neutral; increase or decrease), or may be described numerically (e.g., as a difference or ratio of likelihoods). Alternatively, the likelihood of the presence or absence of the one or more glycosylation features may be determined by looking up a measure of association (e.g., logarithm of an odds ratio) between the variant sequence and the glycosylation feature and between the reference sequence and the glycosylation feature.

[0124] The glycosylation feature may comprise one or more monosaccharides. The glycosylation feature may comprise mannose, sialic acid, fucose, high-mannose, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, poly-sialylation or a combination thereof. The glycosylation feature may comprise a polysaccharide epitope. In some embodiments, the glycosylation feature may comprise one or more glycosylation features in Table 1. In some embodiments, the glycosylation feature is an increase or decrease in a high-mannose in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a sialylation in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a fucose in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in mannose, sialic acid, fucose, high-mannose, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, poly-sialylation or a combination thereof in one of the variant sequences as compared to the reference sequence. In some embodiments, the glycosylation feature is an increase or decrease in a glycosylation feature listed in Table 1.

[0125] In some embodiments, the likelihood may be expressed as a probability. In some embodiments, the likelihood may be expressed as a pseudo-probability. In some embodiments, the likelihood may be expressed as a ratio or product of one or more probabilities or pseudo-probabilities. In some embodiments, the likelihood may be expressed as a sum or difference of one or more probabilities. In some embodiments, the likelihood may be expressed as an odds ratio. In some embodiments, the likelihood may be expressed as the logarithm of an odds ratio.

[0126] In some embodiments, the method may determine that one or more sequence modifications result in an increase of the glycosylation feature. A sequence variation may be said to result in an increase of the glycosylation feature when the trained algorithm determines that the likelihood (e.g., probability, pseudo-probability) of the glycosylation feature being associated with the variant sequence is greater than the likelihood of the glycosylation feature being associated with the reference sequence. In such cases, the variant likelihood may be arbitrarily greater than the reference likelihood or the variant likelihood may have to be greater than the sequence likelihood by some cutoff value (e.g., about 1.00001, about 1.0001, about 1.001, about 1.01, about 1.1, about 1.5, about 2, about 5, about 10, or more). Alternatively or additionally, the determination of an increase of the glycosylation feature may be made by taking the ratio of the likelihoods (i.e. ratio is variant likelihood divided by reference likelihood). In such cases, the effect of the sequence variation may be to result in an increase in the glycosylation feature if the ratio of the variant likelihood to that of the reference likelihood ((variant likelihood)/(reference likelihood)) is greater than 1. The ratio may be arbitrarily greater than one or may be greater than one by some cutoff (e.g., greater than about 1.00001, about 1.0001, about 1.001, about 1.01, about 1.1, about 1.5, about 2, about 5, about 10, or more).

[0127] In some embodiments, the method may determine that one or more sequence modifications result in a decrease of the glycosylation feature. A sequence variation may be said to result in a decrease of the glycosylation feature when the trained algorithm determines that the likelihood (e.g., probability, pseudo-probability) of the glycosylation feature being associated with the variant sequence is less than the likelihood of the glycosylation feature being associated with the reference sequence. In such cases, the variant likelihood may be arbitrarily less than the reference likelihood or the variant likelihood may have to be less than the sequence likelihood by some cutoff value (e.g., 10, 1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, or less). Alternatively or additionally, the determination of a decrease of the glycosylation feature may be made by taking the ratio of the likelihoods (i.e. ratio is variant likelihood divided by reference likelihood). In such cases, the effect of the sequence variation may be to result in an increase in the glycosylation feature if the ratio of the variant likelihood to that of the reference likelihood is less than 1. The ratio may be arbitrarily less than one or may be less than one by some cutoff (e.g., less than 0 9999, 0.999, 0.99, 0.9, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, or less).

[0128] Alternatively or additionally, the effect of a sequence modification may be determined by determining an intermolecular relation (IMR) of the glycosylation feature and the variant sequence as compared to an IMR of the glycosylation feature and the reference sequence. Based on the magnitude of the IMRs, individually or in combination, a determination of the effect may be made. For example, the method may comprise an operation of calculating the Euclidean distance of two IMRs of a one glycosylation feature between the variant sequence and the reference sequence. In some embodiments, the method comprises an operation of calculating the Euclidean distance of two IMR vectors of multiple glycosylation features between the variant sequence and the reference sequence. In some embodiments, the method comprises an operation of calculating the Euclidean distance of the identity-line-orthogonal component between two points representing four IMRs where one point represents two IMRs indicate the likelihood of one desired glycosylation feature in either the variant sequence (y- coordinate) and the reference sequence (x-coordinate) and another point represents two IMRs indicate the likelihood of one undesired and mutually exclusive to the desired glycosylation feature in either the variant sequence (y-coordinate) and the reference sequence (x- coordinate).

[0129] The IMR of a glycosylation feature and one or more sequences (e.g., variant, reference) may be determined by methods described herein. In some embodiments, the IMR may be determined by a trained algorithm. In some embodiments, the IMR may be determined by a set of generalized estimating equations. In some embodiments, the IMR may be an odds ratio. In some embodiments, the odds ratio may be determined by Fisher’s exact test. In some embodiments, the odds ratio may be determined by logistic regression.

[0130] Further provided herein are method for determining the effect of a variation of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising: [0131] (a) providing a plurality of sequences comprising (1) the reference sequence and optionally the associated three-dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and (b) for each of the plurality of variant sequences, and optionally associated three-dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optional associated three-dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite. In some embodiments, the glycosylation feature is a specific monosaccharide or a polysaccharide epitope. In some embodiments, the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N- acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N- acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3-deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D-manno-2 octulopyranosylomc acid (Kdo), D-galacturomc acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L- arabinofuranose (Aral), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D- fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof. In some embodiments, the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof. In some embodiments, the gly cosylation feature is an increase in high-mannose in the variant sequence as compared to the reference sequence, the glycosylation feature is decrease in high-mannose in the variant sequence as compared to the reference sequence, the glycosylation feature is an increase in sialylation in the variant sequence as compared to the reference sequence, the glycosylation feature is decrease in sialylation in the variant sequence as compared to the reference sequence, the glycosylation feature is an increase in fucosylation in the variant sequence as compared to the reference sequence, or the glycosylation feature is decrease in fucosylation in the variant sequence as compared to the reference sequence. In some embodiments, the predicted presence that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the identity of one or more amino acid sequences varied as compared to the reference sequence. In some embodiments, the pseudo-probability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the position of one or more amino acid sequences varied as compared to the reference sequence. In some embodiments, the position is the distance from the glycosite. In some embodiments, each variant sequence has at least one amino acid substitution as compared to the reference sequence. In some embodiments, each variant sequence has at least two amino acid substitution as compared to the reference sequence. In some embodiments, the glycosite comprises a gly can-bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids N-terminal to the gly can bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids C-terminal to the gly can bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the sequence of a first variant sequence is comprised within a peptide. In some embodiments, the method comprises administering a therapeutically effective amount of the peptide based at least in part on determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0132] Further provided herein are computer systems comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for determining the effect of a variation of a reference sequence and optionally the associated three-dimensional structure on glycosylation of a glycosite in the reference sequence, the application comprising: a module programmed to, for each of a plurality of variant sequences, and optionally the associated three- dimensional structures, having one or more amino acid substitution as compared to the reference sequence, apply a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence based at least on the ammo acid sequence and the optional associated three-dimensional structure of the variant sequence. [0133] Further provided herein are non-transitory computer-readable mediums comprising machine-executable code that, upon execution by one or more computer processors, implements a method for determining the effect of a variation of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising: (a) providing a plurality of sequon sequences comprising (1) the reference sequence and optionally the associated three-dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and (b) for each of the plurality of variant sequences, and optionally associated three-dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optional associated three-dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0134] Further provided herein are systems for determining the effect of a variation of a reference sequence and optionally the associated three-dimensional structure on glycosylation of a glycosite in the reference sequence, the system comprising: a database comprising a plurality of sequences comprising (1) the reference sequence and optionally the associated three-dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: for each of the plurality of variant sequences, and optionally associated three-dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optionally associated three-dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite Tn some embodiments, the glycosylation feature is a specific monosaccharide or a polysaccharide epitope. In some embodiments, the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N-acetylglucosamine (GlcNAc), N- acetylgalactosamine (GalNAc), D-mannose (Man), N-acetylneuraminic acid (Neu5Ac), N- glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3-deoxynononic acid or 3- deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D-manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L- rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L-arabinofuranose (Aral), D- glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D-fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof. In some embodiments, the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, temtinal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof. In some embodiments, the glycosylation feature is an increase in high- mannose in the variant sequence as compared to the reference sequence, the glycosylation feature is decrease in high-mannose in the variant sequence as compared to the reference sequence, the glycosylation feature is an increase in sialylation in the variant sequence as compared to the reference sequence, the glycosylation feature is decrease in sialylation in the variant sequence as compared to the reference sequence, the glycosylation feature is an increase in fucosylation in the variant sequence as compared to the reference sequence, or the glycosylation feature is decrease in fucosylation in the variant sequence as compared to the reference sequence. In some embodiments, the pseudo-probability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the identity of one or more amino acid sequences varied as compared to the reference sequence. In some embodiments, the pseudo-probability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the position of one or more amino acid sequences varied as compared to the reference sequence. In some embodiments, the position is the distance from the glycosite. In some embodiments, each variant sequence has one amino acid substitution as compared to the reference sequence. In some embodiments, each vanant sequence has at least two amino acid substitution as compared to the reference sequence. In some embodiments, the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids C- terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the sequence of a variant sequence is comprised within a peptide. Further provided are methods of treatment comprising administering to a subject in need thereof a therapeutically effective amount of the peptide.

[0135] Further provided herein are methods of modifying a reference glycopeptide to alter a glycosylation feature of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: identifying a predicted presence of the glycosylation feature at a glycosite of a modified glycopeptide, which modified glycopeptide comprises one or more amino acid substitutions to a sequence of the reference glycopeptide, and generating the modified glycopeptide having the one or more amino acid substitutions in the sequence of the reference glycopeptide if the predicted presence is at least a threshold predicted presence. In some embodiments, the threshold pseudo-probability is about 50%, 60%, 70%, 80%, 90%, or higher. In some embodiments, the predicted presence is determined using a trained algorithm. In some embodiments, the predicted presence is determined at least based on the identity of one or more amino acids varied as compared to the reference sequence. In some embodiments, the predicted presence is determined at least based on the position of one or more amino acids varied as compared to the reference sequence. In some embodiments, the position is the distance from the glycosite.

[0136] Further provided herein are methods of modifying a reference glycopeptide to alter a glycosylation feature of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: substituting one or more amino acids within 15 amino acids of the glycosite to generate the modified glycopeptide. In some embodiments, the glycosylation feature is high-mannose, sialylation, fucosylation, or a combination thereof, the glycosylation feature is an increase in high-mannose in the modified glycopeptide as compared to the reference glycopeptide, the glycosylation feature is decrease in high- mannose in the modified glycopeptide as compared to the reference glycopeptide, the glycosylation feature is an increase in sialylation in the modified glycopeptide as compared to the reference glycopeptide, the glycosylation feature is decrease in sialylation in the modified glycopeptide as compared to the reference glycopeptide, the glycosylation feature is an increase in fucosylation in the modified glycopeptide as compared to the reference glycopeptide, or the glycosylation feature is decrease in fucosylation in the modified glycopeptide as compared to the reference glycopeptide. In some embodiments, the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the method further comprises administering a therapeutically effective amount of the modified glycopeptide to a subject in need thereof based at least in part on the altered glycosylation feature of the modified glycopeptide.

[0137] Further provided herein are modified glycopeptides having a first glycosylation feature that is different from a reference glycosylation feature of a glycosite of a reference glycoprotein, wherein the modified glycopeptide has one or more amino acid substitutions in a sequence comprising the glycosite as compared to the reference glycoprotein. In some embodiments, the one or more amino acid substitutions is positioned within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25 amino acids of the glycosite; or wherein the one or more amino acid substitutions is positioned within a sequon comprising the glycosite. In some embodiments, the first glycosylation feature is a specific monosaccharide or a polysaccharide epitope. In some embodiments, the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N- acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N- acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3-deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D-manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L- arabinofuranose (Araf), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D- fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof. In some embodiments, the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof. In some embodiments, the first glycosylation feature is an increase in high-mannose in the modified glycopeptide as compared to the reference glycopeptide, the first glycosylation feature is decrease in high-mannose in the modified glycopeptide as compared to the reference glycopeptide, the first glycosylation feature is an increase in sialylation in the modified glycopeptide as compared to the reference glycopeptide, the first glycosylation feature is decrease in sialylation in the modified glycopeptide as compared to the reference glycopeptide, the first glycosylation feature is an increase in fucosylation in the modified glycopeptide as compared to the reference glycopeptide, or the first glycosylation feature is decrease in fucosylation in the modified glycopeptide as compared to the reference glycopeptide. In some embodiments, the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine. Further provided is a method comprising administering a therapeutically effective amount of the modified glycopeptide to a subject in need thereof based at least in part on the first glycosylation feature of the modified glycopeptide.

[0138] Further provided herein are methods for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the method comprising: (a) providing the sequence (and optionally the associated three- dimensional structure) and the plurality of candidate glycans; (b) for each of the plurality of candidate glycans: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three-dimensional structure): and (c) computer processing the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence. In some embodiments, the one or more glycans comprises at least one glycan of Table 1. In some embodiments, the predicted presence of the glycan at the glycosite is determined at least based on the identity of the one or more amino acids in the sequence. In some embodiments, the predicted presence of the glycan at the glycosite is determined at least based on the position of the one or more amino acids in the sequence. In some embodiments, the predicted presence of the glycan at the glycosite is determined at least based on the identity and position of the one or more amino acids in the sequence. In some embodiments, the one or more amino acids in the sequence is located within 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 amino acids of the glycosite. In some embodiments, the glycosite comprises an arginine, asparagine, senne, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids C-terminal to the glycan bound arginine, asparagine, serine, threonine, or ty rosine. In some embodiments, the sequence is comprised within a peptide. In some embodiments, precursors of the one or more glycans are glycans present in a host cell during production of the peptide. In some embodiments, precursors of the one or more glycans are glycans present in a host cell medium during production of the peptide. In some embodiments, the method further comprises administering a therapeutically effective amount of the peptide based at least in part on determining whether the one or more glycans will be found at the glycosite of the sequence.

[0139] Further provided herein are computer systems comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for determining the likelihood that one or more glycans from a plurality of candidate gly cans will be found at a glycosite of a sequence, the application comprising: (a) a module programmed to, for each of the plurality of candidate glycans, apply a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three-dimensional structure of the sequence) to generate a plurality of predicted presences; and (b) a processing module programmed to process the plurality of predicted presences to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0140] Further provided herein are non-transitory computer-readable mediums comprising machine-executable code that, upon execution by one or more computer processors, implements a method for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the method comprising: (a) providing the sequence (and optionally the associated three-dimensional structure) and the plurality of candidate glycans; (b) for each of the plurality of candidate glycans: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three-dimensional structure); and (c) computer processing the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the gly cosite of the sequence.

[0141] Further provided herein are systems for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the system comprising: a database comprising the plurality of candidate glycans; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (a) for each of the plurality of candidate glycans: apply a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three-dimensional structure of the sequence); and (c) process the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence. In some embodiments, the one or more glycans comprises at least one glycan of Table 1. In some embodiments, the predicted presence for the glycan at the glycosite of the sequence is determined at least based on the identity of the one or more amino acids in the sequence. In some embodiments, the predicted presence for the glycan at the glycosite of the sequence is determined at least based on the position of the one or more amino acids in the sequence In some embodiments, the predicted presence for the glycan at the glycosite of sequence is determined at least based on the identity and position of the one or more amino acids in the sequence. In some embodiments, the one or more amino acids in the sequence is located within 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 amino acids of the glycosite. In some embodiments, the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the glycosite further comprises one or more amino acids C-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine. In some embodiments, the sequence is comprised within a peptide. In some embodiments, precursors of the one or more glycans are glycans present in a host cell during production of the peptide. In some embodiments, precursors of the one or more glycans are glycans present in a host cell medium during production of the peptide. [0142] Further provided herein are methods of modifying a reference glycopeptide to alter a glycan substructure of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: calculating whether there is a positive or negative IMR association between one or more amino acid substitutions of a protein feature proximal to the glycosite and the glycan substructure, and generating the modified glycopeptide having the one or more amino acid substitutions if a magnitude of the IMR association is at least a threshold value. In some embodiments, the threshold value is about 50%, 60%, 70%, 80%, 90%, or higher. In some embodiments, the IMR is as generalized estimating equation (GEE) IMR. In some embodiments, the IMR is a Fisher’s exact test IMR. In some embodiments, the IMR is significant if it has a false discovery rate (FDR) correction less than about 0. 1. In some embodiments, the IMR is significant if it has a p-value less than about 0.05. In some embodiments, the IMR comprises a logarithm of an odds ratio (logOR) with a magnitude greater then about 1. In some embodiments, the IMR comprises a logOR with a magnitude greater then about 0.5. In some embodiments, the IMR comprises a logOR with a magnitude greater then about 0. 1. In some embodiments, the IMR association is determined using a matrix describing the expected glycoimpact of the one or more ammo acid substitutions. In some embodiments, the IMR association is determined at least based on the identity of one or more amino acids. In some embodiments, the IMR association is determined at least based on the proximity of the one or more amino acids to the glycosite. In some embodiments, the proximity is the distance from the glycosite as measured in angstroms. In some embodiments, the proximity is less than or equal to about 6 angstroms to about 25 angstroms. In some embodiments, the proximity is the number of amino acids between the each of the one or more amino acids and the glycosite. In some embodiments, the distance is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 amino acids.

[0143] Further provided are methods of modifying a reference glycopeptide to alter a glycan substructure of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: substituting one or more amino acids of a protein feature proximal to the glycosite to generate the modified glycopeptide. In some embodiments, the protein feature proximal to the glycosite comprises a structural feature. In some embodiments, the structural feature is less than or equal to about 6 angstroms to about 25 angstroms from the glycosite. In some embodiments, the structural feature is a secondary structure comprising a beta strand, alpha helix, extended strand, beta-bridge, turn, or bend, or a combination of two or more thereof. In some embodiments, the protein feature proximal to the glycosite comprises an amino acid within about 6 amino acids of the glycosite in the N- or C-terminal direction. In some embodiments, the glycan substructure is selected from Table 2 or Table 3. In some embodiments, the method employs a computational approach. In some embodiments, the structure of the reference glycopeptide is or has been determined using X-ray crystallography, homology modeling, and/or de novo prediction based on primary amino acid sequence. In some embodiments, the method further comprises administering a therapeutically effective amount of the modified glycopeptide to a subject in need thereof based at least in part on the altered glycan substructure of the modified glycopeptide.

[0144] Further provided are modified glycopeptides having a first glycan substructure that is different from a reference glycan substructure of a glycosite of a reference glycoprotein, wherein the modified glycopeptide has one or more amino acid substitutions of a protein feature proximal to the glycosite as compared to the reference glycoprotein. In some embodiments, the protein feature proximal to the glycosite comprises a structural feature. In some embodiments, the structural feature is less than or equal to about 6 angstroms to about 15 angstroms from the glycosite. In some embodiments, the structural feature is a secondary structure comprising a beta strand, alpha helix, extended strand, beta-bridge, turn, or bend, or a combination of two or more thereof. In some embodiments, the protein feature proximal to the glycosite comprises an amino acid within about 6 amino acids of the glycosite in the N- or C-terminal direction. In some embodiments, the glycan substructure is selected from Table 2 or Table 3. In some embodiments, the protein feature is selected from Table 2 or Table 3. [0145] Further provided are methods comprising administering a therapeutically effective amount of the modified glycopeptide of any one of claims 117-123 to a subject in need thereof based at least in part on the first glycan substructure of the modified glycopeptide. [0146] Further provided are modified glycopeptides having an increase, decrease, or change in a glycan structure at a glycosite of the modified glycopeptide as compared to a reference glycopeptide, as determined based on the associations of Table 2 and/or Table 3 (e g., wherein the modified glycoprotein has a Phe within 5 amino acids upstream of the glycosite, and the reference glycopeptide does not have a Phe within 5 amino acids upstream of the glycosite of the reference glycopeptide).

[0147] Further provided are methods comprising administering a therapeutically effective amount of the modified glycopeptide of claim 125 to a subject in need thereof based at least in part on the increase, decrease, or change any of the glycan features selected from Table 1. [0148] Further provided are methods for determining the effect of a variation of a reference sequence on glycosylation of a first glycosite in the reference sequence, wherein the reference sequence comprises the first glycosite and a second glycosite, the method comprising: (a) providing a plurality of sequences comprising (1) the reference sequence and (2) a plurality of variant sequences each having a different glycosylation feature at the second glycosite as compared to the reference sequence; and (b) for each of the plurality of variant sequences: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the first glycosite based at least on the identity of the glycosylation feature at the second glycosite; thereby determining the effect of the variation of the reference sequence on glycosylation of the first glycosite.

[0149] Further provided are methods for determining the effect of a variation of the structure of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising: (a) providing a plurality of sequences comprising (1) the reference sequence and (2) a plurality of variant sequences having one or more amino acid substitution as compared to the reference sequence; and (b) for each of the plurality of variant sequences: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence based at least on the structure of the variant sequence; thereby determining the effect of the variation of the reference sequence structure on glycosylation of the glycosite. In some embodiments, the structure is secondary structure, tertiary structure, or quaternary structure, or a combination of two or more thereof.

[0150] In any of the methods and systems herein, the sequence may be a viral sequence. For instance, in some embodiments, provided herein are methods for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a viral sequence, the method comprising: (a) providing the viral sequence and the plurality of candidate glycans, observed glycans, desired glycans, undesired glycans; (b) for each of the plurality of candidate, observed, desired, or undesired glycans at each glycosite: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence; and (c) computer processing the predicted presence for each of the plurality of candidate, observed, desired, or undesired glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0151] In any of the methods and systems herein, the likelihood of a disease or disorder associated with a glycoprotein in an individual may be determined. For instance, in some embodiments, provided herein is are methods of determining a likelihood of a disease or disorder associated with a glycoprotein in an individual, the method comprising: calculating a first IMR association between a glycosite of the glycoprotein and a glycosylation feature; calculating a second IMR association between said glycosylation feature and a glycosite of a modified glycoprotein, wherein said modified glycoprotein comprises one or more amino acid substitutions relative to the glycoprotein; and determining said likelihood based on the difference of said first IMR and said second IMR.

[0152] In some embodiments, methods and systems include and/or utilize: (1) a trained algorithm applied to calculate a predicted presence for each glycan, a plurality of glycans, a plurality of glycans containing a glycan feature, or a plurality of glycans lacking a glycan feature at the glycosite of the sequence, (2) IMRs to estimate the likelihood of a glycan feature existing at the glycosite of a sequence, and/or (3) predicted presence from a trained algorithm or IMRs are compared between a multiplicity of sequences (e.g. reference sequence and variant sequence) to determine the relative likelihood of a glycan feature being present.

[0153] In some embodiments, methods are provided for determining the importance of a glycosite by examining co-evolution (e.g. evolutionary coupling) or conservation (e.g. glycan or composition associated conservation around a glycosite) with defining glycosite features (e.g. N, T, or S where the glycan attaches, the T or S at position N+2). A method of determining the importance of a glycosite by identifying elevated (relative to baseline evolutionary coupling or conservation with any amino acid, any asparagine, any serine, or any threonine) evolutionary coupling with a defining glycosite feature (e.g. N, T, or S where the glycan attaches, the T or S at position N+2).

[0154] Any of the methods and systems herein may be used to: (1) determine the likelihood a glycan feature can be added to a glycosite, (2) determine the likelihood a glycan feature can be enriched at a glycosite, (3) determine the likelihood a glycan feature can be removed from a glycosite, (4) determine the likelihood a glycan feature can be depleted at a glycosite, (5) aid in synthetic methods of glycan biosynthesis, (6) determine the change in glycoylstion due to mutation as it may relate to pathogenicity, (7) determine the change in glycosylation due to sequence differences between a reference and mutant sequence, (8) to explain the biological methods underlying gly can-modulated pathologies, (9) add, remove, enrich, depleate a glycan or glycan feature for the purposes of genetic therapy wherein the modified glycoconjugate is pathogenic, corrective of a pathogenic mechanism, or competitively inhibitory of a pathogenic mechanism, (10) predict glycosylation on any glycoconjugate including proteins, peptides, sequences, nucleotides, lipids, sugars, phosphates, functional groups, or small molecules, (11) glycoengineer surface proteins or targeting receptors for the purpose of developing cell therapies (e.g. CAR-T cells), (12) change the behavior, functional impact, biological importance, biological mechanism of a glycoconjugate without changing the structure, or (13) any combination of two or more thereof. Methods and systems for modifying glycopeptides

[0155] Methods and systems as described herein may modify a sequence or sequon comprising a glycosite, (in some cases referred to as a glycopeptide). A glycopeptide may be modified by altering the sequence or amino acid composition of the peptide (e.g., by modifying an amino acid sequence proximal to a glycosite). A glycosylation feature maybe be modified by altering an amino acid close to the glycosite (e.g., structurally proximal to the glycosite). Methods for modifying glycopeptides may comprise identifying the likelihood that one or more changes to the reference glycopeptide will alter a glycosylation feature of the glycosite to give a modified glycopeptide. The likelihood may be determined by applying a trained algorithm as described herein to the reference glycopeptide. Alternatively or additionally, the likelihood may be determined by calculating or looking up an association score between the reference glycopeptide and the glycosylation feature. The association score may be an IMR or combination of IMRs as described herein.

[0156] Changes to the glycopeptide to give a modified glycopeptide may comprise one or more of length, monomer (e.g., amino acid, nucleotide) identity, predicted or observed secondary structure, predicted or observed tertiary structure, glycosite composition, or glycosite position. Based on the likelihood of the presence or absence of the glycosylation features in the original glycopeptide as compared to the modified glycopeptide, a determination of the effect may be made.

[0157] In some embodiments, the likelihood may be expressed as a probability. In some embodiments, the likelihood may be expressed as a pseudo-probability. In some embodiments, the likelihood may be expressed as a ratio or product of one or more probabilities or pseudo-probabilities. In some embodiments, the likelihood may be expressed as a product, quotient, sum or difference of one or more probabilities. In some embodiments, the likelihood may be expressed as an odds ratio. In some embodiments, the likelihood may be expressed as the logarithm of an odds ratio. In some embodiments, the likelihood may be expressed as a mathematical operation performed on one or more odds, odds ratios, log odds ratios, probabilities or pseudo-probabilities.

[0158] In one example embodiments, the predicted presence (specific instance of likelihood) of the glycosylation feature at a glycosite in a sequence may be based on glycosylation feature structure, glycosylation feature composition, glycosylation feature length, glycosylation feature branching, sequence length, sequence composition, position of a monomer in the sequence, substitution of one or more monomers in the sequence, insertion of one or more monomers in the sequence, deletion of one or more monomers in the sequence, observed or predicted sequence secondary structure, observed or predicted sequence tertiary structure, observed or predicted sequence quaternary structure, or any combination thereof. The predicted presence may be based on a feature of the reference sequence. Alternatively or additionally, the predicted presence may be based on a feature of one or more of the variant sequences.

[0159] The method may further comprise determining that the likelihood is above or below a cutoff or threshold value. Examples of threshold values may include about 1%, about 2%, about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, and about 99%.

[0160] In some embodiments, the likelihood is expressed as the logarithm of an odds ratio. The odds ratio may be determined from Fisher’s exact test. The odds ratio may be determined by solving a set of generalized estimating equations. In some embodiments, the association score may be an IMR as described herein. In some embodiments, the IMR may be a generalized estimating equation (GEE) parameter. In some embodiments, the IMR may be an odds ratio as determined by Fisher’s exact test.

[0161] The modified glycopeptide may differ from the reference glycopeptide in amino acid sequence. The difference in amino acid sequence may comprise a difference in one or more amino acid identities or positions. In some embodiments, the variant glycopeptide differs from the reference glycopeptide in the identity of an amino acid 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1 , 12, 13, 14, 15, 16, 17, 18, 19, 20 or more positions from the glycosite. In some embodiments, the variant glycopeptide differs from the reference glycopeptide in the identity of an amino acid 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer positions from the glycosite.

[0162] The modified glycopeptide may differ from the reference glycopeptide in length of amino acid sequence. The modified glycopeptide may have one or more insertions or deletions with respect to the reference glycopeptide. The modified glycopeptide may comprise I, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more insertions or deletions. The insertions or deletions may all be contiguous, or they may comprise one or more subsets across the sequence of the modified glycopeptide. The insertion or deletion may be proximal to a glycosite of the modified glycopeptide. In some embodiments, the insertion or deletion is within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more sites of the glycosite. In some embodiments, the insertion or deletion is within no more than 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer sites of the glycosite. In some embodiments, the insertion or deletion is in a site distal to the glycosite.

[0163] The modified glycopeptide may differ from the reference glycopeptide in one or more glycosylation features. The modified glycopeptide may differ from the reference glycopeptide in 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or more glycosylation features. The one or more glycosylation features may be identical, or they may be independently distinct.

[0164] In some embodiments, the glycopeptide is a naturally occurring glycoprotein or derivative or modification thereof. In some embodiments, the glycopeptide comprises an antibody, glycoconjugate, Fc fusion protein, anticoagulant, blood factor, bone morphogenetic protein, engineered protein scaffold, enzyme, growth factor, hormone, interferon, interleukin or other cytokine, viral (or other pathogen) protein (or antigens), part of proteins (truncated proteins, ectodomains, stem domains, etc.), chimeric protein (e.g., chimeric antigen receptor), or thrombolytic. In some embodiments, the glycopeptide may be used in glycopeptide-based therapies.

[0165] In some embodiments, the modified glycopeptide is a modified form of an antigen or other molecule derived form a pathogen. In some embodiments, the pathogen is selected from the group consisting of a virus, bacterium, prion, fungus, protozoon, viroid, and parasite. [0166] In some embodiments, the pathogen is selected from the group that causes human disease which includes but are not limited to, Bacillus anthracis (anthrax), Clostridium botulinum toxin (botulism), Yersinia pestis (plague). Variola major (smallpox) and other related pox viruses, Francisella tularensis (tularemia), Viral hemorrhagic fevers, Arenaviruses, (e.g., Junin, Machupo, Guanarito, Chapare, Lassa, and/or Lujo), Bunyaviruses (e.g., Hantaviruses causing Hanta Pulmonary syndrome, Rift Valley Fever, and/or Crimean Congo Hemorrhagic Fever), Flaviviruses, Dengue, Filoviruses (e.g., Ebola and Marburg viruses), Burkholderia pseudomallei (melioidosis), Coxiella burnetii (Q fever), Brucella species (brucellosis), Burkholderia mallei (glanders), Chlamydia psittaci (Psittacosis), Ricin toxin (Ricinus communis), Epsilon toxin (Clostridium perfringens), Staphylococcus enterotoxin B (SEB), Typhus fever (Rickettsia prowazekii). Food and water-borne pathogens, Diarrheagenic E . coli, Pathogenic Vibrios, Shigella species, Salmonella, Listeria monocytogenes, Campylobacter jejuni, Yersinia enterocolitica, Caliciviruses, Hepatitis A, Cryptosporidium parvum, Cyclospora cayatanensis, Giardia lamblia, Entamoeba histolytica , Toxoplasma gondii, Naegleria fowleri, Balamuthia mandrillaris, Fungi, Microsporidia, Mosquito-borne viruses (e g., West Nile virus (WNV), LaCrosse encephalitis (LACV), California encephalitis, Venezuelan equine encephalitis (VEE), Eastern equine encephalitis (EEE), Western equine encephalitis (WEE), Japanese encephalitis virus (JE), St. Louis encephalitis virus (SLEV), Yellow fever virus (YFV), Chikungunya virus, Zika virus, Nipah and Hendra viruses, Additional hantaviruses, Tickbome hemorrhagic fever viruses, Bunyaviruses, Severe Fever with Thrombocytopenia Syndrome virus (SFTSV), Heartland virus, Flaviviruses (e.g., Omsk Hemorrhagic Fever virus, Alkhurma virus, Kyasanur Forest virus), Tickbome encephalitis complex flaviviruses, Tickbome encephalitis viruses, Powassan/Deer Tick virus, Tuberculosis, including drug-resistant Tuberculosis, Influenza virus, Prions, Streptococcus, Pseudomonas, Shigella, Campylobacter, Salmonella, Clostridium, Escherichia, Hepatitis C, papillomavirus, Epstein-Barr virus, varicella, variola, Orthomyxovirus, Severe acute respiratory syndrome associated coronavirus (SARS-CoV), SARS-CoV-2 (COVID- 19), MERS-CoV, other highly pathogenic human coronaviruses, or any combination thereof.

[0167] In some embodiments, the virus is a respiratory vims that primarily results in respiratory symptoms including, without limitation, coronavimses, influenza viruses, adenoviruses, rhinoviruses, coxsackieviruses, and metapneumoviruseses. In some embodiments, the vims is an enteric virus that primarily results in digestive symptoms including, without limitation, enteroviruses, noroviruses, heptoviruses, reoviruses, rotaviruses, parvoviruses, torovimses, and mastadenovirus. In certain embodiments, the vims is a hemorrhagic fever virus including, without limitation, Ebola virus, Marburg virus, dengue fever virus, yellow fever virus, Rift valley fever virus, hanta virus, and Lassa fever virus.

[0168] In some embodiments, the pathogen-associated antigen is from an influenza virus. In some embodiments, the pathogen-associated antigen is from an influenza A virus, such as the H5N1 strain. In some embodiments, the pathogen-associated antigen is from an influenza B vims. In some embodiments, the pathogen-associated antigen is an influenza matrix Ml protein or a fragment thereof. In some embodiments, the pathogen-associated antigen is an influenza neuraminidase or a fragment thereof. In some embodiments, the pathogen- associated antigen is an influenza hemagglutinin or a fragment thereof. For example, the pathogen-associated antigen may comprise an entire hemagglutinin, an HA1 domain, an HA2 domain or any antigenic portion thereof.

[0169] In some embodiments, the pathogen-associated antigen is a Coronaviridae antigen. In some embodiments, the Coronaviridae exhibits human tropism. In some cases, the Coronaviridae is selected from the list consisting of SARS Coronavirus (SARS-CoV-1), COVID- 19 (SARS-CoV-2), MERS -coronavirus (MERS-CoV), or any combination thereof. In some embodiments, the Coronaviridae comprises SARS Coronavirus (SARS-CoV-1). In some embodiments, the Coronaviridae comprises COVID- 19 (SARS-CoV-2). In some embodiments, the Coronaviridae comprises MERS-coronavirus (MERS-CoV). In some embodiments, the Coronaviridae antigen comprises a spike protein, an envelope protein, a nucleocapsid protein, a membrane protein, a membrane glycoprotein, or a non-structural protein. In some embodiments, the Coronaviridae antigen comprises a spike protein, an envelope small membrane protein, a membrane protein, anon-structural protein 6 (NSP6), a nucleoprotein, an ORFIO protein, Protein 3a, Protein7a, Protein 9b, structural protein 8, uncharacterized protein 4, or any combination thereof.

[0170] The method may comprise an operation of generating the glycopeptide. The glycopeptide may be generated by any suitable biochemical process (e.g., expression in a natural or recombinant host organism or part thereof), chemical synthetic route (e.g., solidphase glycan synthesis), or combination thereof. In some embodiments, the glycopeptide may be generated if the likelihood is determined to be above a cutoff value. Alternatively, the glycopeptide may be generated if the likelihood is determined to be below a cutoff value.

[0171] In some embodiments, a glycopeptide may be generated by culturing cells in vivo. A cell may comprise a cell membrane, at least one chromosome, composed of genetic material, cytoplasm, and various organelles which are adapted or specialized to perform one or more vital functions, such as energy and proteins synthesis, respiration, digestion, storage and transportation of nutrients, locomotion, or cell division. A cell may comprise one or a plurality of cells. A cell may comprise a somatic cell, a terminally differentiated cell, a stem cell, or a germ cell. A somatic cell may be any cell forming the body of an organism that are not germline cells. Mutations in somatic cells may affect the individual organism but are not passed onto offspring. A terminally differentiated cell may refer to any cell that in the course of acquiring specialized functions, is not able to transform into other types of cells. These cells may constitute most of the mammalian body and may be unable to proliferate.

[0172] In some embodiments, a glycoprotein may be generated by culturing cells in vitro. A cell may comprise a cell membrane, at least one chromosome, composed of genetic material, cytoplasm, and various organelles which are adapted or specialized to perform one or more vital functions, such as energy and proteins synthesis, respiration, digestion, storage and transportation of nutrients, locomotion, or cell division. A cell may comprise one or a plurality of cells. A cell may comprise a somatic cell, a terminally differentiated cell, a stem cell, a germ cell, or other cell type. A somatic cell may be any cell forming the body of an organism that are not germline cells. Mutations in somatic cells may affect the individual organism but are not passed onto offspring.

[0173] In some embodiments, a glycoprotein may be generated by a cell-free synthetic process. The cell-free synthetic process may use the constituent biomolecules, enzymes, substrates, cofactors, or reagents of an organism, recombinant or otherwise modified constituent biomolecules, enzy mes, substrates, cofactors, or reagents of an organism; or catalysts, reagents, and reaction conditions not associated with biological systems, such as those which may be employed in a chemical laboratory setting.

[0174] In some embodiments, the modified glycoprotein may be used as part of a pharmaceutical composition as described elsewhere herein. In some embodiments, the modified glycoprotein may be used in a vaccine composition. In some embodiments, the vaccine is a viral vaccine.

[0175] Methods and systems as described herein are not limited to modification of glycopeptides. Other biomolecules associated with glycans or glycosylation features may be modified by the methods described herein. In some embodiments, the method may be used to produce a modified glycosylated nucleic acid. In some embodiments, the method may be used to produce a modified glycosylated lipid.

Methods and systems for predicting pathogenicity of a mutation

[0176] Methods and systems as described herein may predict whether a mutation in a glycoprotein is pathogenic. In some cases, the methods and systems may determine the likelihood that a subject (e g , an individual) has or is predicted to have a disease or disorder associated with a glycoprotein. Methods for determining the likelihood of the individual having the disease or disorder associated with the glycoprotein may comprise calculating a first IMR between a glycosite of the glycoprotein and a glycosylation feature. The methods may further comprise an operation of determining a second IMR based on the glycosylation feature and the glycosite in a modified glycoprotein. The modified glycoprotein may differ from the first glycoprotein in one or more amino acids (e.g., substitutions, deletions, etc.). In some cases, the modified glycoprotein corresponds to a mutant or variant glycoprotein associated with a known disease or disorder. Based on the first IMR and the second IMR, the likelihood of the individual having the disease or disorder may be determined. In some embodiments, any methods and systems provided herein for determining the effect of sequence modification on glycosylation may be used to identify changes in glycosylation. In some embodiments, any methods and systems for modifying glycopeptides provided herein may be used to identify changes in glycosylation. [0177] Changes to the glycoprotein relative to the modified glycoprotein may comprise one or more of length, monomer (e.g., amino acid) identity, predicted or observed secondary structure, predicted or observed tertiary structure, glycosite composition, or glycosite position. Based on the likelihood of the presence or absence of the glycosylation features in the original glycoprotein as compared to the modified glycoprotein, a determination of the likelihood that the modified glycoprotein is pathogenic or causative of a pathology may be made.

[0178] In some embodiments, the likelihood may be expressed as a probability. In some embodiments, the likelihood may be expressed as a pseudo-probability. In some embodiments, the likelihood may be expressed as a ratio or product of one or more probabilities or pseudo-probabilities. In some embodiments, the likelihood may be expressed as a product, quotient, sum or difference of one or more probabilities. In some embodiments, the likelihood may be expressed as an odds ratio. In some embodiments, the likelihood may be expressed as the logarithm of an odds ratio. In some embodiments, the likelihood may be expressed as a mathematical operation performed on one or more odds, odds ratios, log odds ratios, probabilities or pseudo-probabilities.

[0179] The predicted presence of the glycosylation feature at a glycosite in a sequence may be based on glycosylation feature structure, glycosylation feature composition, glycosylation feature length, glycosylation feature branching, sequence length, sequence composition, position of a monomer in the sequence, substitution of one or more monomers in the sequence, insertion of one or more monomers in the sequence, deletion of one or more monomers in the sequence, observed or predicted sequence secondary structure, observed or predicted sequence tertiary structure, observed or predicted sequence quaternary structure, or any combination thereof. The predicted presence may be based on a feature of the reference sequence. Alternatively or additionally, the predicted presence may be based on a feature of one or more of the variant sequences.

[0180] The method may comprise determining that the likelihood is above or below a cutoff or threshold value. Examples of threshold values may include about 1%, about 2%, about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, and about 99%.

[0181] In some embodiments, the likelihood is expressed as the logarithm of an odds ratio.

The odds ratio may be determined from Fisher’s exact test. The odds ratio may be determined by solving a set of generalized estimating equations. In some embodiments, the association score may be an IMR as described herein. In some embodiments, the IMR may be a generalized estimating equation (GEE) parameter. In some embodiments, the IMR may be an odds ratio as determined by Fisher’s exact test.

[0182] The modified glycoprotein may differ from the reference glycoprotein in amino acid sequence. The difference in amino acid sequence may comprise a difference in one or more ammo acid identities or positions. In some embodiments, the variant glycoprotein differs from the reference glycoprotein in the identity of an amino acid 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more positions from the glycosite. In some embodiments, the variant glycoprotein differs from the reference glycoprotein in the identity of an amino acid 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer positions from the glycosite.

[0183] The modified glycoprotein may differ from the reference glycoprotein in length of amino acid sequence. The modified glycoprotein may have one or more insertions or deletions with respect to the reference glycoprotein. The modified glycopeptide may comprise 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more insertions or deletions. The insertions or deletions may all be contiguous, or they may comprise one or more subsets across the sequence of the modified glycoprotein. The insertion or deletion may be proximal to a glycosite of the modified glycoprotein. In some embodiments, the insertion or deletion is within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more sites of the glycosite. In some embodiments, the insertion or deletion is within no more than 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, or fewer sites of the glycosite. In some embodiments, the insertion or deletion is in a site distal to the glycosite.

[0184] The modified glycoprotein may differ from the reference glycopeptide in one or more glycosylation features. The modified glycopeptide may differ from the reference glycopeptide in 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, or more glycosylation features. The one or more glycosylation features may be identical, or they may be independently distinct.

[0185] The methods and systems may be used to determine the likelihood of the individual having any disease or disorder known to be associated with a glycoprotein or glycosylation of a protein. In some embodiments, the disease or disorder comprises albinism, a prion disease, or Gaucher disease. In some embodiments, the prion disease comprises Creutzfeldt-Jakob Disease (CJD) or Gerstmann-Straussler disease (GSD).

Pharmaceutical compositions and methods of treatment [0186] Also described herein are pharmaceutical compositions, wherein a pharmaceutical composition may comprise a modified glycoprotein as described herein or a fragment thereof. In some embodiments, a pharmaceutical composition may further comprise a pharmaceutically acceptable carrier, an excipient, or any combination thereof. A “pharmaceutically acceptable carrier or excipient” may comprise one or more molecular entities that do not materially affect the composition or change the active agent(s) contained therein, are physiologically tolerable, and do not typically produce an allergic reaction, or similar untoward reaction, when administered to a subj ect.

[0187] Also described herein are methods for treating a subject using a formulation or pharmaceutical composition as described herein. Also described herein are methods for prophylactic treatment of a subject using a formulation or pharmaceutical composition as described herein. Pharmaceutical compositions are formulated in a conventional manner using one or more pharmaceutically acceptable excipients that facilitate processing of the active compounds, i.e., modified glycoproteins or functional fragments thereof, into preparations that may be used pharmaceutically. Proper formulation is dependent upon the route of administration chosen. A summary of pharmaceutical compositions described herein may be found, for example, in Remington: The Science and Practice of Pharmacy, Nineteenth Ed. (Easton, Pa.: Mack Publishing Company, 1995); Hoover, John E., Remington’s Pharmaceutical Sciences, Mack Publishing Co., Easton, Pennsylvania 1975; Liberman, H.A. and Lachman, L., Eds., Pharmaceutical Dosage Forms, Marcel Decker, New York, N.Y., 1980; and Pharmaceutical Dosage Forms and Drug Delivery Systems, Seventh Ed. (Lippincott Williams & Wilkins 1999), herein incorporated by reference for such disclosure. [0188] Such methods may comprise administering to a subject an effective amount of the pharmaceutical composition or formulation. An effective amount may be determined, for example, based on the KD of a modified glycoprotein within the formulation or pharmaceutical composition, the bioavailability of a modified glycoprotein within the formulation or pharmaceutical composition, the route of administration of the formulation or pharmaceutical composition, other factors, or a combination thereof.

[0189] In some embodiments, a formulation or pharmaceutical composition may further comprise a second therapeutic. For example, a formulation or pharmaceutical composition may further comprise a pain reliever (e.g., ibuprofen or acetaminophen or any other suitable pain reliever), an antiviral compound (e.g., remdesivir or any other suitable antiviral compound), an antibiotic compound (e.g., azithromycin or any other suitable antibiotic compounds) or a steroid (e.g., dexamethasone, corticosteroids, cortisone, hydrocortisone, prednisone, or any other suitable steroids).

[0190] In some embodiments, a method may further comprise administering a pain reliever (e.g., ibuprofen or acetaminophen), an antiviral compound (e.g., remdesivir), an antibiotic compound (e.g., asithromycin) or a steroid (e.g., dexamethasone). In some embodiments, the second therapeutic compositions may be administered prior to the administration of the modified glycopeptides or the functional fragments thereof disclosed therein. In some embodiments, the second therapeutic compositions may be administered subsequent to the administration of the modified glycoproteins or the functional fragments thereof disclosed therein. In some embodiments, the second therapeutic compositions may be administered at the same time to the administration of the modified glycopeptides or the functional fragments thereof disclosed therein. In some embodiments, the second therapeutic may be conjugated to the modified glycopeptide.

Computer systems

[0191] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG. 27 shows a computer system 101 that is programmed or otherwise configured to, for example, (i) train and test a trained algorithm, (ii) use the trained algorithm to predict the presence or absence of a glycosylation feature, (iii) use the trained algorithm to determine the effect of a sequence or structure modification on glycosylation, and (iv) use the trained algorithm to modify a glycopeptide.

[0192] The computer system 101 can regulate various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, (i) training and testing a trained algorithm, (ii) using the trained algorithm to predict the presence or absence of a glycosylation feature, (iii) using the trained algorithm to determine the effect of a sequence or structure modification on glycosylation, and (iv) using the trained algorithm to modify a glycopeptide. The computer system 101 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.

[0193] The computer system 101 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer sy stem 101 also includes memory or memory location 104 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 106 (e.g., hard disk), communication interface 108 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 107, such as cache, other memory, data storage and/or electronic display adapters. The memory 104, storage unit 106, interface 108 and peripheral devices 107 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard. The storage unit 106 can be a data storage unit (or data repository) for storing data. The computer system 101 can be operatively coupled to a computer network (“network”) 100 with the aid of the communication interface 108. The network 100 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.

[0194] In some embodiments, the network 100 is a telecommunication and/or data network. The network 100 can include one or more computer servers, which can enable distributed computing, such as cloud computing. For example, one or more computer servers may enable cloud computing over the network 100 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, (i) training and testing a trained algorithm, (ii) using the trained algorithm to predict the presence or absence of a glycosylation feature, (iii) using the trained algorithm to determine the effect of a sequence or structure modification on glycosylation, and (iv) using the trained algorithm to modify a glycopeptide. Such cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud. In some embodiments, the network 100, with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.

[0195] The CPU 105 may comprise one or more computer processors and/or one or more graphics processing units (GPUs). The CPU 105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 104. The instructions can be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.

[0196] The CPU 105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 101 can be included in the circuit. In some embodiments, the circuit is an application specific integrated circuit (ASIC).

[0197] The storage unit 106 can store files, such as drivers, libraries and saved programs. The storage unit 106 can store user data, e.g., user preferences and user programs. In some embodiments, the computer system 101 can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.

[0198] The computer system 101 can communicate with one or more remote computer systems through the network 100. For instance, the computer system 101 can communicate w ith a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device. Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 100.

[0199] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 104 or electronic storage unit 106. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 105. In some embodiments, the code can be retrieved from the storage unit 106 and stored on the memory 104 for ready access by the processor 105. In some situations, the electronic storage unit 106 can be precluded, and machine-executable instructions are stored on memory 104.

[0200] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.

[0201] Embodiments of the systems and methods provided herein, such as the computer system 101, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, or disk drives, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[0202] Hence, a machine readable medium, such as computer-executable code, may take many forms, including a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

[0203] The computer system 101 can include or be in communication with an electronic display 102 that comprises a user interface (UI) 103 for providing, for example, (i) a visual display indicative of training and testing a trained algorithm, (ii) a visual display indicative of using the trained algorithm to predict the presence or absence of a glycosylation feature, (iii) a visual display indicative of using the trained algorithm to determine the effect of a sequence or structure modification on glycosylation, and (iv) a visual display indicative of using the trained algorithm to modify a glycopeptide. Examples of UIs include, without limitation, a graphical user interface (GUI) and web-based user interface. [0204] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 105. The algorithm can, for example, (i) tram and test a trained algorithm, (ii) use the trained algorithm to predict the presence or absence of a glycosylation feature, (iii) use the trained algorithm to determine the effect of a sequence or structure modification on glycosylation, and (iv) use the trained algorithm to modify a glycopeptide

LIST OF EMBODIMENTS

[0205] The following list of embodiments of the invention are to be considers as disclosing various features of the invention, which features can be considered to be specific to the particular embodiment under which they are discussed, or which are combinable with the various other features as listed in other embodiments. Thus, simply because a feature is discussed under one particular embodiment does not necessarily limit the use of that feature to that embodiment.

[0206] Embodiment 1. A method for determining the effect of a variation of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising:

(a) providing a plurality of sequences comprising (1) the reference sequence and optionally the associated three-dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and

(b) for each of the plurality of variant sequences, and optionally associated three- dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optional associated three- dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0207] Embodiment 2. The method of embodiment 1, wherein the glycosylation feature is a specific monosaccharide or a polysaccharide epitope.

[0208] Embodiment 3. The method of embodiment 2, wherein the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N-acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N-acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3- deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D- manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L-arabinofuranose (Aral), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D-fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof.

[0209] Embodiment 4. The method of embodiment 2, wherein the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof.

[0210] Embodiment 5. The method of any one of embodiments 1-4, wherein the glycosylation feature is an increase in high-mannose in the variant sequence as compared to the reference sequence.

[0211] Embodiment 6. The method of any one of embodiments 1-4, wherein the glycosylation feature is decrease in high-mannose in the variant sequence as compared to the reference sequence.

[0212] Embodiment 7. The method of any one of embodiments 1-4, wherein the glycosylation feature is an increase in sialylation in the variant sequence as compared to the reference sequence.

[0213] Embodiment 8. The method of any one of embodiments 1-4, wherein the glycosylation feature is decrease in sialylation in the variant sequence as compared to the reference sequence.

[0214] Embodiment 9. The method of any one of embodiments 1-4, wherein the glycosylation feature is an increase in fucosylation in the variant sequence as compared to the reference sequence.

[0215] Embodiment 10. The method of any one of embodiments 1-4, wherein the glycosylation feature is decrease in fucosylation in the variant sequence as compared to the reference sequence.

[0216] Embodiment 11. The method of any one of embodiments 1-10, wherein the predicted presence that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the identity of one or more amino acid sequences varied as compared to the reference sequence.

[0217] Embodiment 12. The method of any one of embodiments 1-11, wherein the pseudoprobability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the position of one or more amino acid sequences varied as compared to the reference sequence.

[0218] Embodiment 13. The method of embodiment 12, wherein the position is the distance from the glycosite. [0219] Embodiment 14. The method of any one of embodiments 1-13, wherein each variant sequence has at least one amino acid substitution as compared to the reference sequence. [0220] Embodiment 15. The method of any one of embodiments 1-13, wherein each variant sequence has at least two amino acid substitution as compared to the reference sequence. [0221] Embodiment 16. The method of any one of embodiments 1-15, wherein the glycosite comprises a gly can-bound arginine, asparagine, serine, threonine, or tyrosine.

[0222] Embodiment 17. The method of embodiment 16, wherein the glycosite further comprises one or more amino acids N-terminal to the gly can bound arginine, asparagine, serine, threonine, or tyrosine.

[0223] Embodiment 18. The method of embodiment 16 or embodiment 17, wherein the glycosite further comprises one or more amino acids C-terminal to the gly can bound arginine, asparagine, serine, threonine, or tyrosine.

[0224] Embodiment 19. The method of any one of embodiments 1-18, wherein the sequence of a first variant sequence is comprised within a peptide.

[0225] Embodiment 20. The method of embodiment 19, further comprising administering a therapeutically effective amount of the peptide based at least in part on determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0226] Embodiment 21. A computer system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for determining the effect of a variation of a reference sequence and optionally the associated three-dimensional structure on glycosylation of a glycosite in the reference sequence, the application comprising: a module programmed to, for each of a plurality of variant sequences, and optionally the associated three- dimensional structures, having one or more amino acid substitution as compared to the reference sequence, apply a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence. [0227] Embodiment 22. A non-transitory computer-readable medium comprising machineexecutable code that, upon execution by one or more computer processors, implements a method for determining the effect of a variation of a reference sequence on glycosylation of a glycosite in the reference sequence, the method comprising:

(a) providing a plurality of sequon sequences comprising (1) the reference sequence and optionally the associated three-dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and

(b) for each of the plurality of variant sequences, and optionally associated three- dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optional associated three- dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0228] Embodiment 23. A system for determining the effect of a variation of a reference sequence and optionally the associated three-dimensional structure on glycosylation of a glycosite in the reference sequence, the system comprising: a database comprising a plurality of sequences comprising (1) the reference sequence and optionally the associated three- dimensional structure, and (2) a plurality of variant sequences, and optionally the associated three-dimensional structures, having one or more amino acid substitution as compared to the reference sequence; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: for each of the plurality of variant sequences, and optionally associated three-dimensional structures: applying a trained algorithm to calculate the predicted presence of a glycosylation feature at the glycosite of each variant sequence and optionally associated three-dimensional structure based at least on the amino acid sequence and the optional associated three-dimensional structure of the variant sequence; thereby determining the effect of the variation of the reference sequence on glycosylation of the glycosite.

[0229] Embodiment 24. The system of any one of embodiments 21-23, wherein the glycosylation feature is a specific monosaccharide or a polysaccharide epitope.

[0230] Embodiment 25. The system of embodiment 24, wherein the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N-acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N-acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3- deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D- manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L-arabinofuranose (Aral), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D-fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof. [0231] Embodiment 26. The system of embodiment 24, wherein the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof.

[0232] Embodiment 27. The system of any one of embodiments 21-26, wherein the glycosylation feature is an increase in high-mannose in the variant sequence as compared to the reference sequence.

[0233] Embodiment 28. The system of any one of embodiments 21-26, wherein the glycosylation feature is decrease in high-mannose in the variant sequence as compared to the reference sequence.

[0234] Embodiment 29. The system of any one of embodiments 21-26, wherein the glycosylation feature is an increase in sialylation in the variant sequence as compared to the reference sequence.

[0235] Embodiment 30. The system of any one of embodiments 21-26, wherein the glycosylation feature is decrease in sialylation in the variant sequence as compared to the reference sequence.

[0236] Embodiment 31. The system of any one of embodiments 21-26, wherein the glycosylation feature is an increase in fucosylation in the variant sequence as compared to the reference sequence.

[0237] Embodiment 32. The system of any one of embodiments 21-26, wherein the glycosylation feature is decrease in fucosylation in the variant sequence as compared to the reference sequence.

[0238] Embodiment 33. The system of any one of embodiments 21-32, wherein the pseudoprobability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the identity of one or more ammo acid sequences varied as compared to the reference sequence.

[0239] Embodiment 34. The system of any one of embodiments 21-33, wherein the pseudoprobability that the glycosite of each variant sequence will have a glycosylation feature is determined at least based on the position of one or more amino acid sequences varied as compared to the reference sequence.

[0240] Embodiment 35. The system of embodiment 34, wherein the position is the distance from the glycosite.

[0241] Embodiment 36. The system of any one of embodiments 21-35, wherein each variant sequence has one amino acid substitution as compared to the reference sequence. [0242] Embodiment 37. The system of any one of embodiments 21-35, wherein each variant sequence has at least two amino acid substitution as compared to the reference sequence.

[0243] Embodiment 38. The system of any one of embodiments 21-37, wherein the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine.

[0244] Embodiment 39. The system of embodiment 38, wherein the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0245] Embodiment 40. The system of embodiment 38 or embodiment 39, wherein the glycosite further comprises one or more amino acids C-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0246] Embodiment 41. The system of any one of embodiments 21-40, wherein the sequence of a variant sequence is comprised within a peptide.

[0247] Embodiment 42. A method of treatment comprising administering to a subject in need thereof a therapeutically effective amount of the peptide of embodiment 41.

[0248] Embodiment 43. A method of modifying a reference glycopeptide to alter a glycosylation feature of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: identifying a predicted presence of the glycosylation feature at a glycosite of a modified gly copeptide, which modified glycopeptide comprises one or more amino acid substitutions to a sequence of the reference glycopeptide, and generating the modified glycopeptide having the one or more amino acid substitutions in the sequence of the reference glycopeptide if the predicted presence is at least a threshold predicted presence.

[0249] Embodiment 44. The method of embodiment 43, wherein the threshold pseudoprobability is about 50%, 60%, 70%, 80%, 90%, or higher.

[0250] Embodiment 45. The method of embodiment 43 or embodiment 44, wherein the predicted presence is determined using a trained algorithm.

[0251] Embodiment 46. The method of any one of embodiments 43-45, wherein the predicted presence is determined at least based on the identify of one or more amino acids varied as compared to the reference sequence.

[0252] Embodiment 47. The method of any one of embodiments 43-45, wherein the predicted presence is determined at least based on the position of one or more amino acids varied as compared to the reference sequence.

[0253] Embodiment 48. The method of embodiment 47, wherein the position is the distance from the glycosite. [0254] Embodiment 49. A method of modifying a reference glycopeptide to alter a glycosylation feature of a glycosite of the reference glycopeptide to produce a modified glycopeptide, the method comprising: substituting one or more amino acids within 15 amino acids of the glycosite to generate the modified glycopeptide.

[0255] Embodiment 50. The method of any one of embodiments 43-49, wherein the glycosylation feature is high-mannose, sialylation, fucosylation, or a combination thereof. [0256] Embodiment 51. The method of any one of embodiments 43-50, wherein the glycosylation feature is an increase in high-mannose in the modified glycopeptide as compared to the reference glycopeptide.

[0257] Embodiment 52. The method of any one of embodiments 43-50, wherein the glycosylation feature is decrease in high-mannose in the modified glycopeptide as compared to the reference glycopeptide.

[0258] Embodiment 53. The method of any one of embodiments 43-50, wherein the glycosylation feature is an increase in sialylation in the modified glycopeptide as compared to the reference glycopeptide.

[0259] Embodiment 54. The method of any one of embodiments 43-50, wherein the glycosylation feature is decrease in sialylation in the modified glycopeptide as compared to the reference glycopeptide.

[0260] Embodiment 55. The method of any one of embodiments 43-50, wherein the glycosylation feature is an increase in fucosylation in the modified glycopeptide as compared to the reference glycopeptide.

[0261] Embodiment 56. The method of any one of embodiments 43-50, wherein the glycosylation feature is decrease in fucosylation in the modified glycopeptide as compared to the reference glycopeptide.

[0262] Embodiment 57. The method of any one of embodiments 43-56, wherein the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine.

[0263] Embodiment 58. The method of any one of embodiments 43-57, further comprising administering a therapeutically effective amount of the modified glycopeptide to a subject in need thereof based at least in part on the altered glycosylation feature of the modified glycopeptide

[0264] Embodiment 59. A modified glycopeptide having a first glycosylation feature that is different from a reference glycosylation feature of a glycosite of a reference glycoprotein, wherein the modified glycopeptide has one or more amino acid substitutions in a sequence comprising the glycosite as compared to the reference glycoprotein. [0265] Embodiment 60. The modified glycopeptide of embodiment 59, wherein the one or more amino acid substitutions is positioned within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, or 25 amino acids of the glycosite; or wherein the one or more amino acid substitutions is positioned within a sequon comprising the glycosite. [0266] Embodiment 61. The modified glycopeptide of embodiment 59 or embodiment 60, wherein the first glycosylation feature is a specific monosaccharide or a polysaccharide epitope.

[0267] Embodiment 62. The modified glycopeptide of embodiment 61, wherein the specific monosaccharide is mannose, sialic acid, fucose, D-glucose (Glc), D-galactose (Gal), N- acetylglucosamine (GlcNAc), N-acetylgalactosamine (GalNAc), D-mannose (Man), N- acetylneuraminic acid (Neu5Ac), N-glycolylneuraminic acid (Neu5Gc), neuraminic acid (Neu), 2-keto-3-deoxynononic acid or 3-deoxy-D-glycero-D-galacto-nonulosonic acid (KDN), 3-deoxy-D-manno-2 octulopyranosylonic acid (Kdo), D-galacturonic acid (GalA), L-iduronic acid (IdoA), L-rhamnose (Rha), L-fucose (Fuc), D-xylose (Xyl), D-ribose (Rib), L- arabinofuranose (Aral), D-glucuronic acid (GlcA), D-allose (All), D-apiose (Api), D- fructofuranose (Fruf), ascarylose (Asc), or ribitol (Rbo), or a combination thereof.

[0268] Embodiment 63. The modified glycopeptide of embodiment 61, wherein the polysaccharide epitope is high-mannose, sialylation, fucosylation, hybrid, complexity, core or distally fucosylation, terminal sialylation, terminal galactosylation, terminal GlcNAc-ylation, GlcNAc-bisection, or poly-sialylation, or a glycosylation feature listed in Table 1, or a combination thereof.

[0269] Embodiment 64. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is an increase in high-mannose in the modified glycopeptide as compared to the reference glycopeptide.

[0270] Embodiment 65. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is decrease in high-mannose in the modified glycopeptide as compared to the reference glycopeptide.

[0271] Embodiment 66. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is an increase in sialylation in the modified glycopeptide as compared to the reference glycopeptide.

[0272] Embodiment 67. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is decrease in sialylation in the modified glycopeptide as compared to the reference glycopeptide. [0273] Embodiment 68. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is an increase in fucosylation in the modified glycopeptide as compared to the reference glycopeptide.

[0274] Embodiment 69. The modified glycopeptide of any one of embodiments 59-63, wherein the first glycosylation feature is decrease in fucosylation in the modified glycopeptide as compared to the reference glycopeptide.

[0275] Embodiment 70. The modified glycopeptide of any one of embodiments 59-69, wherein the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine. [0276] Embodiment 71. A method comprising administering a therapeutically effective amount of the modified glycopeptide of any one of embodiments 59-70 to a subject in need thereof based at least in part on the first glycosylation feature of the modified glycopeptide. [0277] Embodiment 72. A method for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the method comprising:

(a) providing the sequence (and optionally the associated three-dimensional structure) and the plurality of candidate glycans;

(b) for each of the plurality of candidate glycans: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three- dimensional structure); and

(c) computer processing the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0278] Embodiment 73. The method of embodiment 72, wherein the one or more glycans comprises at least one glycan of Table 1.

[0279] Embodiment 74. The method of embodiment 72 or embodiment 73, wherein the predicted presence of the glycan at the glycosite is determined at least based on the identity of the one or more amino acids in the sequence.

[0280] Embodiment 75. The method of embodiment 72 or embodiment 73, wherein the predicted presence of the glycan at the glycosite is determined at least based on the position of the one or more amino acids in the sequence.

[0281] Embodiment 76. The method of embodiment 72 or embodiment 73, wherein the predicted presence of the glycan at the glycosite is determined at least based on the identity and position of the one or more amino acids in the sequence. [0282] Embodiment 77. The method of any one of embodiments 72-76, wherein the one or more amino acids in the sequence is located within 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 amino acids of the glycosite.

[0283] Embodiment 78. The method of any one of embodiments 72-77, wherein the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine.

[0284] Embodiment 79. The method of embodiment 78, wherein the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0285] Embodiment 80. The method of embodiment 78 or embodiment 79, wherein the glycosite further comprises one or more amino acids C-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0286] Embodiment 81. The method of any one of embodiments 72-80, wherein the sequence is comprised within a peptide.

[0287] Embodiment 82. The method of embodiment 81, wherein precursors of the one or more glycans are glycans present in a host cell during production of the peptide.

[0288] Embodiment 83. The method of embodiment 81 or embodiment 82, wherein precursors of the one or more glycans are glycans present in a host cell medium during production of the peptide.

[0289] Embodiment 84. The method of any one of embodiments 81-83, further comprising administering a therapeutically effective amount of the peptide based at least in part on determining whether the one or more glycans will be found at the glycosite of the sequence. [0290] Embodiment 85. A computer system comprising a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create an application for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the application comprising:

(a) a module programmed to, for each of the plurality of candidate glycans, apply a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three-dimensional structure of the sequence) to generate a plurality of predicted presences; and (b) a processing module programmed to process the plurality of predicted presences to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0291] Embodiment 86. A non-transitory computer-readable medium comprising machineexecutable code that, upon execution by one or more computer processors, implements a method for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the method compnsing:

(a) providing the sequence (and optionally the associated three-dimensional structure) and the plurality of candidate glycans;

(b) for each of the plurality of candidate glycans: applying a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three- dimensional structure); and

(c) computer processing the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0292] Embodiment 87. A system for determining the likelihood that one or more glycans from a plurality of candidate glycans will be found at a glycosite of a sequence, the system comprising: a database comprising the plurality of candidate glycans; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to:

(a) for each of the plurality of candidate glycans: apply a trained algorithm to calculate a predicted presence for each glycan at the glycosite of the sequence determined at least based on one or more amino acids in the sequence (and optionally the associated three- dimensional structure of the sequence); and

(b) process the predicted presence for each of the plurality of candidate glycans to determine the likelihood that the one or more glycans will be found at the glycosite of the sequence.

[0293] Embodiment 88. The system of any one of embodiments 85-87, wherein the one or more glycans comprises at least one glycan of Table 1.

[0294] Embodiment 89. The system of any one of embodiments 85-88, wherein the predicted presence for the glycan at the glycosite of the sequence is determined at least based on the identity of the one or more amino acids in the sequence. [0295] Embodiment 90. The system of any one of embodiments 85-89, wherein the predicted presence for the glycan at the glycosite of the sequence is determined at least based on the position of the one or more amino acids in the sequence.

[0296] Embodiment 91. The system of any one of embodiments 85-90, wherein the predicted presence for the glycan at the glycosite of sequence is determined at least based on the identity and position of the one or more amino acids in the sequence.

[0297] Embodiment 92. The system of any one of embodiments 85-91, wherein the one or more amino acids in the sequence is located within 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 amino acids of the gly cosite.

[0298] Embodiment 93. The system of any one of embodiments 85-92, wherein the glycosite comprises an arginine, asparagine, serine, threonine, or tyrosine.

[0299] Embodiment 94. The system of embodiment 93, wherein the glycosite further comprises one or more amino acids N-terminal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0300] Embodiment 95. The system of embodiment 93 or embodiment 94, wherein the glycosite further comprises one or more amino acids C-terrmnal to the glycan bound arginine, asparagine, serine, threonine, or tyrosine.

[0301] Embodiment 96. The system of any one of embodiments 85-95, wherein the sequence is comprised within a peptide.

[0302] Embodiment 97. The system of embodiment 96, wherein precursors of the one or more glycans are glycans present in a host cell during production of the peptide.

[0303] Embodiment 98. The system of embodiment 96 or embodiment 97, wherein precursors of the one or more glycans are glycans present in a host cell medium during production of the peptide.

Examples

Example 1. Discovery of templating principles for glycoprotein synthesis

[0304] Glycan biosynthesis, unlike DNA, RNA, and protein biosynthesis, has been described previously as “non-templated.” Through structural analysis of site-specific glycosylation data, described herein are protein-sequence and structural features that may predict specific glycan structures. Differences in sequence-predicted glycosylation, which may be referred to as "glycoimpact" herein, increase when the PAM and BLOSUM substitution matrices disagree. High-gly coimpact amino acids may also co-evolve with glycosites. Similarly, high- gly coimpact ClinVar variants observed close to glycosites may be associated with glycan- actuated diseases such as in Albinism and prion disease. More broadly, glycoimpact may predict disagreement between multiple pathogenicity predictions (e.g. VEP).

[0305] DNA, RNA, and protein sequences may be predictably templated by DNA and codon templates, respectively. Distinctly, glycosylation is often described as a non-templated process. However, protein primary sequence influences glycan diversity and identity. In the initial description of the N-glycosylation sequon, glycans were found to covalently bind asparagine (N) residues with a downstream (N+2) serine (S) or threonine (T) separated by any amino acid (AA) but proline (NXS/T). Variation atN+1 may impact glycan complexity. Glycosylation of a sequon ending in threonine is approximately 40 times more efficient than those ending in serine. Upstream of the glycosite, a phenylalanine to alanine substitution in human IgG3 increased bigalactose structures with a core-fucose. Additionally, influenza evolves glycosylation sites to evade immune detection. Tools like GlycoSiteAlign and mutagenesis studies have offered expansions of the primary sequon structure including the enhanced aromatic sequon; an aromatic residue upstream of the glycosite (N-2) that can influence glycan complexity with a variable impact given the N+l variation.

[0306] Glycoconjugate influence may be observed beyond the primary sequence. Secondary structures, including P-sheets and a-helixes, may influence glycosylation.

[0307] Glycosylation may be determined by both cellular metabolic environments and sitespecific, glycoconjugate-defined microenvironments. However, these have not been consolidated in a clear and interoperable mapping from the genome to the glycome. As described herein, glycan biosynthesis was integrated with site-specific protein structure features to generalize the template for glycosylation. Fundamental to this is precursor-limited templating, i.e., a templated process wherein the substrate requested is not always available; thus, the template is difficult to observe without biosynthetic knowledge-defining all possible intermediates and the final possible glycans. A glycosylation template may be described as a mapping from “gly coimpactful” protein structure to expected glycan substructures. These glycoimpact relations were validated by comparison to evolutionary substitution matrices and mutation pathogenicity scores. Further, pathogenic gly coimpactful mutations were found enriched near glycosylation sites. Finally, glycoimpact was used to accurately predict changes in glycan complexity, galactosylation, sialylation and functional glycosylation. A model of glycosylation consistent with this work is illustrated in FIG. 1.

[0308] Results

[0309] First, glycan structure was tested to see if it correlates with protein structure. To do this, the Protein-Glycan Enriched Structure Database (PGES-DB), a compendium of glycosylation sites on proteins and their experimentally measured glycans, was built leveraging the UnicarbKB and GlyConnect databases. Glycan structures were decomposed into their substructures that describe intermediates in their biosynthesis (termed “substructures”) using GlyCompare. Features of the glycosite-proximal protein structure were annotated using the Structural System Biology (SSBio) toolkit. Briefly, PGES-DB contains protein structures (empirical (PDB), curated homology models (SWISSMOD) and ab initio homology models (I-TASSER)). This includes 98 glycoproteins with N-glycosylation sites and 38 glycoproteins with O-glycosylation sites including 3,563 N-glycosylation and 700 O- glycosylation events. First the glycosite-structure annotation was checked to be representative of typical variation in glycosites; the structure-annotated glycosites from the input database were used to train a dimensionality reduction. The results of this Factor Analysis with Mixed Data (FAMD) are shown in FIG. 2A. Glycosites throughout the human secretome were then projected into the reduced space, as illustrated in FIG. 2B. Using a multivariate Gaussian, the probability that each non-input glycosite was within the distribution of input glycosite variation was determined. After a False Discovery Rate (FDR) correction, no outlying glycosite structures were found, as shown in FIG. 2C, indicating that the PGES-DB glycosites are representative of the space of glycosite structure.

[0310] With a representative mapping of the glycosite structure manifold, the associations between glycan substructures (e.g., tri-mannose) and glycosite-proximal protein features (e.g., proximal tyrosine) within PGES-DB were examined (FIG. 3A). The Fisher exact test was used to estimate odds ratios (OR) for intermol ecul ar relations (TMR), which may comprise quantitative associations between glycan substructures and protein features. To further probe these IMR, the augmentation to conditional-marginal probability difference (CMd) and Kullback-Leibler divergence (KLd) between conditional and marginal distributions of protein structure when glycan structure was fixed (G=l or G=0) and glycan structure when protein structure was fixed (P=l or P=0) were investigated.

[0311] Of 259,114 potential substructure-IMRs, 50,842 relationships that were substantial (|Fisher-OR|>0.1) and significant (Fisher-FDRO.l) were found, and 10,111 of 26,404 substantial and significant motif-IMRs. Of the 10,111 selected motif-IMRs, 9,296 and 815 showed significant correlation and anticorrelation, respectively. Representative significant IMRs included correlations with glycosylation-site-proximal (wdthin 5 A) alanine and cysteine and anticorrelations with proximal arginine and valine (FIG. 3B). To further probe the motif- IMRs, the CMd and KLd when either protein structure or glycan structure was known were investigated. KLd was low for significant IMRs (Fisher-FDR<0. 1) when protein structure or glycan structures were not present (Mean KLdp=o = 0.0038, Mean KLdc=o = 0.0054). KLd increased ~10-fold when protein structure was present (Mean KLdp=i = 0.032) and ~25-fold when glycan structure was present (Mean KLdc=i = 0. 138). Overall, the presence of glycan structure and the protein structure may provide substantial information about glycan structure.

[0312] Next, motif-IMRs were examined by estimating the conditional probability between protein and glycan structures. Conditional probability diverged significantly (Fisher-FDR < 0.1) from corresponding marginal probabilities, indicating non-independence. Conditional glycan probabilities (FIG. 3D) and conditional protein structure probabilities (Fig. 3E) show symmetric (Loess estimation) difference from respective marginal probabilities suggesting no global bias for either condition. The gly can-protein structure non-independence-absolute difference between conditional and marginal probabilities — was also investigated, stratified by glycan motif size (number of monosaccharides, FIG. 3F); each motif-size bin contains 279-10,658 IMRs (Table 4). For monomeric-motifs, the change from marginal to conditional probability is 2.3-fold greater when glycan structure is known than when protein structure is known (Mean CMdp|G= 0.091, CMdqp =0.040). As motif size increases, the fold-change between glycan and protein specified CMd grows to 34.2-fold increase at 21-mers (Mean CMdqp = 0.038, CMdp|G= 0.10), suggesting larger gly cans are less clearly informed by protein structures, but in turn, play a larger role in informing protein structure. Despite being post-translational, glycosylation is known to influence protein folding.

Table 4. Distribution and sample size of IMR conditional probabilities 21 protein 0.124301 0.101235 118 161 279

[0313] Many AA-proximal IMRs are high-confidence, wherein Pr(G|P) is close to 1 or 0 (within 0.001). Confidence in glycan presence increases when a specific protein structure feature is present (FIG. 4A). Of IMRs involving a spatially proximal AA, 20.2% are highly deterministic of glycan substructures. Additionally, 32.4% and 17.5% of down- and upstream AA IMRs (+/-6aa) are highly deterministic. The certainty with which an AA determines a glycan structure decreases substantially when proximal AAs are absent to 5%, 0.3% and 0.33% for spatially, downstream and upstream-proximal residues respectively (FIG. 4A). These high-confidence IMRs do not appear to be dominated by small numbers of motifs as the high-confidence IMR count is proportional to the number of unique substructures (FIG.

4B)

[0314] Among the highly deterministic protein-glycan relations, (among 1725 N- glycosylation events) 1553 glycans were observed to contain a GlcNAc on the P-l,6-mannose branch (Glc2NAc(pi-6)Man(al-6)Man(pi-4)Glc2NAc); indicative of a hybrid or complex N-glycan. All 75 glycosylation events with a downstream tryptophan included glycans with the hybrid/complex substructure (Table 4). These data suggest tryptophan may be sufficient to result in a decrease of oligomannosylation. Similarly, in 454 O-glycosylation events, 237 contain the sialyl-T antigen (Neu5Ac(a2-6)Gal(pi-3)GalNAc). Of 6 events containing a sequence-proximal tryptophan, every event also contained a sialyl-T antigen (Table 4).

[0315] The mapping between glycan and protein structures was described quantitatively. The specific IMRs were quantified using univariate logistic generalized estimation equations (GEE) to probe the site-matched glycan-protein co-occurrences in PGES-DB and control for protein identity effects as described elsewhere herein. The resulting odds ratios (OR) estimate the probability a glycan substructure will appear given a proximal protein structure-feature. Therefore, a list of ORs-protein-glycan structure co-occurrence likelihood— for each glycan substructure association with one protein structure-feature describes the typical glycosylation observed close to a given protein structure; “expected substructure abundance” or an “expected glycoprofile” for that protein structure. Therefore, when expected glycoprofiles for all glycan substructures is compared across protein structure-features, e.g., the expected glycoprofile change between alanine and isoleucine, glycosylation impact, or “glycoimpact,” of AA-substitution may be estimated by considering the difference in expected glycoprofiles for across all glycan substructures. [0316] 1,715 (FDR<0.1, |log(OR) |>0.1) N-glycan IMRs were discovered. Many IMRs associated with structure-proximal (e.g., N+6A) and sequence-proximal (e.g., N+/-5 residues) AAs were found. Stratifying sequence-proximal effects, approximately twice as many IMRs involving upstream (N-5) than downstream AAs were observed. Among the downstream AA effects, tryptophan, alanine, serine and phenylalanine are most impactful (99, 55, 55, and 48 IMRs respectively). Tryptophan also has many IMRs when downstream and physically proximal (26). Arginine and glutamine are the largest effectors when structurally proximal (70) or downstream (35) Finally, glycosylation sites on turns have the most IMRs (61) (FIG. 5A).

[0317] Tum-associated IMRs include >3-fold increases in di- and tri-sialylated tetra- antennary and >2-fold increases in mono- and di- galactosylated structures with core fucose; all positively correlated structures have at least one galactose while not all are core (FIG. 5B). Structurally proximal glutamine is associated with a >20-fold increase in monosialylated triantennary structures and a 10-fold decrease in tetra-antennary structures (FIG. 5C). Histidine, threonine and valine show increasing correlation with GalNAc[4S] (FIG. 5D). Spearman correlation biclustenng between the number of monosaccharides per substructure and protein structure-features in an IMR suggest there may be two major types of protein structure influence mirroring the well-known N-glycan/O-glycan dichotomy in glycosylation. Providing clues to the elusive O-glycosylation site, proximal alanine is negatively correlated with galactose and GlcNAc but positively correlated with GalNAc. Conversely, threonine and histidine are positively associated with GlcNAc and Galactose but negatively correlated with P-GalNAc. GlcNAc and Gal— complex-glycan substituents— follow similar trends to Neu5Ac. However, Neu5Ac, GlcNAc and Gal trends diverge near proline, cysteine and valine; suggesting these AAs may be limiters of high-complexity (FIG. 3E).

[0318] Given the expected glycoprofile for each protein-feature, the “glycoimpacf ’ of a variety of protein structural transformations (e.g. AA-substitutions) may be explored; the impact of that transformation on glycan biosynthesis. More specifically, glycoimpact may be defined as the difference between two expected glycoprofiles; the expected difference across two protein structure-features resulting from transformation between those protein structurefeatures. For example, the relative impact of phenylalanine and tryptophan, two structurally similar aromatics, were compared. An upstream phenylalanine was associated (>3-fold) with core fucosylated tri- and biantennary structures with variable galactosylation. Try ptophan was marginally associated (<2-fold) with core-fucosylated biantennary structures too but more associated (>2-fold) with tetra-antennary structure, suggesting a phenylalanine/try ptophan substitution could impact branch number and core-fucosylation. Upstream phenylalanine was associated (>3-fold) with a Man7 substructure but anticorrelated with a Man6 substructure (>2-fold) suggesting that upstream phenylalanine, in some contexts, prefers larger oligomannosidic structures (FIG. 5F). At the structural-level, proximal phenylalanine and tryptophan show related effects. Structure-proximal phenylalanine was correlated (>10-fold) with an increase in sialylation on tri-antenary core- fucosylated structures while Trp was correlated with distal fucosylation (>2-fold) (FIG. 5G). Measured as the normalized Euclidean distance between tryptophan and phenylalanine expected glycoprofiles, the tryptophan/phenylalanine substitution is predicted to be highly gly coimpactful (>4-fold). All AA-substitution glycoimpact scores were calculated as the difference in expected glycoprofiles between each AA-pair. Glycoimpact was calculated at multiple IMR thresholds (FIG. 6). Representative substitution events are shown in FIG. 5H. The glycoimpact AA-substitution matrix may be referred to herein as the BLOSUM-PAM Orthology matrix (BLAMO X:Y); X and Y refer to the logOR and FDR thresholds respectively. LogORs insignificant or unsubstantial by the X:Y threshold are excluded from the glycoimpact calculation. For example, FIG. 5H, displays a subset of BLAMO 0.5 0.1 relations.

[0319] To further establish the relevance of glycoimpact, it was compared to established measures of amino acid substitution impact. The PAM and BLOSUM matrices are popular but distinct amino acid substitution matrices. PAM is based on global alignments within a protein focusing it on evolution and function. Meanwhile BLOSUM uses local alignment across proteins to highlight structure and conserved domains. Glycoimpact (BLAMO 0.5:0.1) was compared to the divergence between the function-focused PAM and structure-focused BLOSUM matrices. Comparing PAM and BLOSUM scores at multiple thresholds (RMSE(PAMij,BLOSUMij), FIG. 7A, FIG. 6), error in 4 of 5 PAM-BLOSUM comparisons was found to be significantly correlated to glycoimpact for impactful (z>2.5) substitutions; correlation diminished for null-glycoimpact substitutions (FIG. 7A, FIG. 6). The correlation between high-gly coimpact substitutions and PAM-BLOSUM error was maintained for most PAM and BLOSUM thresholds (FIG. 8). These results suggest a positive relationship between glycoimpact and the failure of structure (BLOSUM) to explain function (PAM) Given this relationship, the glycoimpact substitution predictions may be referred to herein as the BLOSUM-PAM Orthology matrix or “BLAMO.”

[0320] Glycoimpact as a measure of pathogenicity of ClinVar mutations within 20 A (3D min-distance; minimum distance between any two atoms) of a glycosite annotated in UniprotKB was examined. Null and impactful glycoimpact (BLAM00.5:0.1) were examined, and glycoimpact was found to be significantly higher (Wilcoxon p=2.2e-7) for ClinVar- pathogenic mutations close to glycosylation sites. The difference trends towards inversion for null glycoimpact values (Wilcoxon p=0.079) (FIG. 7B). One example high-gly coimpact and glycosite-proximal mutation is tyrosinase/ A355V (P14679), a glycosylation-associated causal mutation in albinism.

[0321] To determine the proximity of prion disease causing mutations to glycosites, the 3D min-distance from all positions in human PrP (including mutations causing Creutzfeldt-Jakob disease (CJD) and Gerstmann-Straussler disease (GSD)) to the two PrP glycosylation sites, N181 and N197, were measured (FIG. 7D). CJD-causing mutations were approximately twice as close to glycosylation sites than the background distribution of all PrP sites (Onesided Wilcoxon p=0.0003). GSD-causative mutants were also found to trend closer (Onesided Wilcoxon p=0.07) (FIG. 7D, FIG. 9A-9B). Low expression mutants, an indication of possible aberrant glycosylation, were found to trend closer to site N180 (One-sided Wilcoxon p=0.16) and appeared further from N196 (One-sided Wilcoxon p=0.04) (FIG. 9C-9D). Thus, searching for gly can-modulated pathogenic events may be possible using their glycosylation sites or known mutations as a reference point.

[0322] To test if differences in prediction scores across variants could be explained by glycoimpact (BLAMO 0.5 0.1) of their corresponding amino acid changes (see Methods), across 3,549,910 nonsynonymous mutations, the disagreement (RMSE) between each of 27 rank-normalized functional impact prediction tools (precomputed with dbNSFP) was measured; impact prediction divergence was correlated with glycoimpact. After hierarchical clustering on the correlation coefficients, tools were separated into two main clusters: one that primarily contained conservation and sequence and/or epigenetic-based tools and another that contained nearly all (6/7) of the protein-structure based tools (FIG. 7C). Nearly all variant impact score differences across the two clusters demonstrated marginally significant correlations with glycodistance. However, these correlations and clustering structure disappear when glycoimpact scores are shuffled (FIG. 10). Ablation by shuffling suggest that glycoimpact explain functional discrepancies between prediction scores.

[0323] To explore evolutionary pressures acting near the glycosite, evolutionary coupling (EC) scores (i.e., the likelihood that amino acids will co-occur in a protein) were calculated from functional-domain alignments of 2,005 glycoproteins. Coupling scores for top-ranked amino acid pairs were examined; using multiple score cutoffs from L/5 to 4L (L is the protein alignment length). The number of high-ranking ECs between any amino acid with N- glycosylation sites (GN) was examined. At multiple thresholds, significantly more high- ranking glycosite-coupled ECs (GN) were found than Asn-coupled (N, p<0.025, one-sided Wilcoxon-test) or all background ECs (X, p<0.0013, one-sided Wilcoxon-test, FIG. 7E). GN, N, and X couplings were compared with specific amino acids i-positions N-terminal (N-

1) or C-terminal (N+i). Glycosite-coupling with another position-specific amino acid as an increased GN-coupling probability relative to N or X at a given rank threshold (one-sided Wilcoxon test, FIG. 7F) or as increased proportion of high-ranking NG-coupled events relative to N or X (FIG. 7G) was tested; increased proportion was measured by hypergeometric enrichment multiple rank thresholds then pooled using Fisher’s method then corrected for multiple-testing. Serine and Threonine were found to be significantly more coupled with glycosylation sites at the N+2 position. The glycosite-coupling enrichment with Serine and Threonine was significant as measured by the relative distributions of coupling probabilities (One-sided Wilcoxon p<0.05, FIG. 7F) and the relative number of high-ranking couplings at multiple thresholds (pooled hypergeometric FDRO.l, FIG. 7H). Several additional position-specific glycosite-coupled residues were found, including phenylalanine at N-2 (hypergeometric p<0.005, FIG. 7G; pooled hypergeometric FDR < 0.1, FIG. 7H), Tyrosine at N-l (pooled hypergeometric FDR < 0.1, FIG. 7H), and Tryptophan at N-2 and N-3 (hypergeometric p<0.2 at multiple rank-thresholds).

[0324] Examining position-specific glycosite couplings (hypergeometric enrichment for high-rank ECs pooled across rank-thresholds, FIG. 7G-7H), 13 of 20 amino acids were found to have at least one significant (FDR<0 1) increase in co-occurrence with glycosites over other asparagines (N, red-square) or any amino acid (X, black triangle); position-specific glycosite coupling events enriched over N and X may expand the definition of the sequon while those with either N or X enrichment may be more indicative of glycosite emergence. Seven of ten amino-acids implicated in upstream glycosite interactions (those visible in FIG. 2A) show enriched coupling with glycosites; specifically alanine (N-l, 2, 4, 6), aspartic acid (N-2), phenylalanine (N-2), isoleucine (N-2), lysine (N-2), leucine (N-l, 3, 5), and serine (N-

2). Several glycosite-coupling events were enriched over either N or X but not both. When coupling probability rank was pooled for each glycosite-relative position, evidence of larger co-coupling events was found. Sequons (e g., N+/-6) were masked by EC score (only rank<4L were retained), then motifs were clustered and motifs were constructed for 5 motifclusters (FIG. 71) and 25 motif-clusters. The N+2 aspartic acid enriched in the univariate analysis (FIG. 7H) co-occurs with an N-2 Lysine (FIG. 71, motif 1). Alternatively, glutamic acid was found more likely to co-occur with other glutamic acid residues (N-4,+1,+3) with an N+2 threonine sequon (FIG. 71, motif 4). These couplings, reflective of evolutionary pressures, surrounding the glycosylation sites suggest a dramatic expansion of the N- glycosylation site structure.

[0325] Finally, a glycosite-centered alignment of glycosites permitting a tetra-antennary N- glycan with no fucose or sialic acids was examined (FIG. 7 J). The glycosite alignment was examined for consistency with high-influence amino acids (Of 20 amino acids, 10 upstream residues, and 8 downstream residues, FIG. 5A) and those significantly coupled with glycosylation sites (1.9 residues per position, FIG. 7H). Sixteen of 20 glycosite-flanking amino acids show consistency between the first or second most common amino acids and either the high-influence or glycosite coupled residues. In the primary glycosite consensus sequence (PWQAKVVSRHNLTQGATLLNE (SEQ ID NO: 64), N+/-10, FIG. 7J), 5 high- influence residues appearing upstream (S, K, A, Q, W; binomial N=10, p=10/20, Pr(X>5)=0.377) and the 6 high-influence amino acids appearing downstream (T, Q, G, T, L, L; binomial N=9, p=8/20, Pr(X>6)=0.025) indicated an enrichment of highly glyco- influential downstream residues in the consensus. Glycosite-coupled residues in the primary consensus sequence were enriched upstream in the glycosite alignment (P, A, V, S, H; binomial N=10, p=l .9/10, Pr(X>5)=0.00488). At nearly every glycosite-flanking residue (N+/-10), indications of consistency were seen between these three analyses.

[0326] Finally, to validate the specificity and portability of the predictions, PGES-DB calculated IMRs were compared to well-studied glycosylation on specific glycoproteins. [0327] The HIV envelope proteins present several distinct N-gly cans. The consistency between previously measured IMRs and HIV glycosylation was examined. PGES-DB- measured IMRs suggest that downstream glutamine was most significantly and substantially (FDR<le-8; OR<0.5) predictive of complexity while structure-proximal Pro and Lys were weak but significant distinguishers (FDR<le-3, FDR<0.1 respectively, FIG. 11A). These predictions' site-specific glycan complexity measurements in HIV ENV gp!60 were compared (FIG. 11B). As predicted, proline-proximal (within 6A) gp!60 glycosites presented more oligomannose (Two-sided Wilcoxon p=0.0033), whereas C-terminus- proximal glutamine were higher complexity (Two-sided Wilcoxon p=le-4, FIG. 11C). Structure-proximal lysine, a less significant and lower magnitude prediction (FIG. 11C), revealed a nonlinear impact on glycan complexity in HIV gpl60; first increasing with one proximal lysine then decreasing with two. Both of the most significant IMRs predicted from PGES-DB were consistent with the site-specific glycosylation observed in HIV gpI60. [0328] Further, differential glycosylation across Ighgl missense mutation (Phe299Ile) in the IgGl heavy chain was predicted. C57BL/6 and CD1 mouse strains expressing the IgGl:Phe299Ile substitution have significantly lower IgGl sialylation and di-galactosylation than strains (e.g., BALB/c and C3H) expressing wt IgGl. Interestingly, within several BALB/c animals, heterozygous for ighgl alleles, the Fc-linked N-gly coprofiles of IgGl:Phe299Ile were more similar to those of IgGl:Phe299Ile expressed in C57BL/6 mice, as compared to IgGI:Phe299 expressed in the same BALB/c animals (FIG. 11E-11F). The Fc-linked N-gly cans of IgGl :Phe299Ile in both BALB/c and C57BL/6 animals presented increased agalactosylation (Mann-Whitney p=1.02e-6) and lower levels of di-galctosylation, mono-, di- and total sialylation (Mann-Whitney p<0.0073), as compared to IgGl:Phe299 expressed in the same animals (FIG. HE, Table 5). The increase in galactosylation in IgGl:Phe299 is consistent with PGES-DB predicted IMRs for upstream (N-terrmnal) phenylalanine (FIG. HD). Upstream phenylalanine is associated with increased di- galactosylated biantennary structures (OR>2), while upstream isoleucine is associated with tetra-antennary galactosylation. Since only bi-antenarry structures are generally permitted on IgG, the Gal promotion function of upstream He may be unrealized in IgG. The increased sialylation in IgGl:Phe299 is also consistent with PGES-DB IMRs which show an association between structurally proximal phenylalanine and di-sialylated structures (OR>10). These results suggest that glycoimpact can accurately predict the degree of glycan complexity.

Table 5. P-values and FDR correction for two-sample Mann-Whitney tests of glycan abundance distributions

[0329] The SARS-CoV-2 spike SI subunit in the original 2019 strain was compared to the Gamma and Delta variants. AlphaFol d2 -predicted SI subunits were compared using pyMol (v2.5) root mean square distance (RMSD). RMSD between the 2019, Gamma, Delta and the full trimer (PDB:6VXX) were marginal (RMSD(2019,Gamma)=2.098, RMSD(2019,Delta)=6.387, RMSD(2019,Trimer)=10.435). Measuring Euclidean distance in 3-dimensions, multiple glycosite-proximal mutations (within 15 Angstroms) were found. In the Gamma spike SI, three mutations (L18F, T20N, & D138Y) appeared close to N17, two mutations (P26S & R190S) close to N61, three mutations (L18F, D138Y, & R190S) close to N122, D615G was extremely close to N616, and H655Y was extremely close to N657. In the Delta spike SI, N17 and N122 had 5 and 4 proximal mutations, respectively, and N165 and N616 each had one high-proximity mutation. Of the glycosite proximal substitutions, only one substitution in each strain had high glycoimpact; L18F in Gamma and F157V in Delta. L18F appeared within 15 Angstroms of N17, N74, N122 in Gamma. Similarly, F157V appeared within 15 Angstroms N17, N122, and N165 in Delta. Both high-impact substitutions appeared close to N17 and N122.

[0330] SARS-CoV-2 spike SI proteins were expressed in HEK293 then glycoprofiled using the proteomics-digestion method DeGlyPHER to determine glycan occupancy (unoccupied, complex, and ohgomannose/hybnd) at each glycosylation site. Two independent replicate analyses of the original 2019 strain compared to the Gamma and Delta SI variants were performed. The proportions of unoccupied, complex, and oligomannose/hybrid observations were compared using a Mann- Whitney test, and p-values were pooled across the two independent replicate analyses using the Fisher method. Three three significant differential glycosylation events were observed (FTG. 11G) At N122, high-mannose/hybrid structures replaced complex glycans at N122 in both Delta (oligomannose/hybrid observations increased nearly 4-fold from 13.9% in SI to 52.5%; FDR=3.3e-9) and Gamma (oligomannose/hybrid observations nearly doubled to 27.6%; FDR=7.9e-4). At N331, complex structures increased marginally to replace ohgomannose/hybnd glycans in Delta (complex structures increased from 93.7% to 99.7%; FDR=0.031). At N657, complex glycosites became unoccupied in Gamma SI (complex observations decreased by over 2-fold from 53.3% to 21.1%; FDR=1.27e3). The N17 site was inconsistently cleaved with the signal peptide precluding stable measurement in these recombinant products. The Gamma S I monomer was consistently expressed with two novel complex glycosylation sites at N20 and N188.

[0331] Eleven of the twelve canonical SI glycosites (excluding N17) in the original 2019 strain and the Gamma and Delta variants were examined. Significant differential glycosylation was found at N122 in both strains, N657 in Gamma, and N331 in Delta. Based on proximal high-gly coimpact substitutions, predicted change N17, N74 (Gamma only), N122, and N165 (Delta only) were predicted. Two of four (N122 in Gamma and Delta) predicted differential glycosylation events were consistent with the four observed changes (Sensitivity =.5), while 15 sites where no change was predicted were consistent with the 17 sites where no change was observed (Speci fici ty=.88). The significant and substantial differential glycosylation event at N122 was correctly predicted.

[0332] Discussion

[0333] Developing the Protein-Glycan Enriched Structure Database (PGES-DB), the correlation between protein and glycan structure were quantified, described herein as “glycoimpact.” Glycoimpact signatures were validated by comparison to substitution matrices, evolutionary couplings, and pathogenicity scores. Further validation of the glycoimpact predictions was done through comparison to glycosylation on PrP, HIV gpl60, and IgG glycoproteins.

[0334] In PGES-DB, an enrichment in protein-glycan associations was inconsistent with independence. The median information gain (KLd) was substantially larger when a protein or glycan structure was present. Consistent with established glycan influence on protein folding, glycan structure may provide information-gain regarding protein structure. Yet, on average, glycans may be less determined by protein structure. Glycan size as a proxy for the influence of metabolic demand on glycan biosynthesis and steric hindrance on protein folding was examined; larger glycans (more monosaccharides) contain more opportunities for precursorlimitation. An increased conditional-marginal divergence (CMd) with glycan size was found in gly can-conditioned protein structure. The increased divergence is consistent with previous findings that glycan sterics impact protein folding. Conversely, the protein-conditioned glycan CMd decreased with glycan size, suggesting that the metabolic and processive dependencies may have an inverse impact on the predictability of glycan structure from protein structure.

[0335] Many sequence-proximal amino acid IMRs were found upstream of the glycosite, a region not previously interrogated. For example, the glycoimpact for upstream phenylalanine predicts an increase in structures containing Man7 (seven-mannose low-complexity N- glycan) and a decrease in structures containing Man6, suggesting an increase in larger high- mannose structures.

[0336] Methods

[0337] Enrichment of gly can-protein site-matched data to generate the Protein-Glycan Enriched Structure Database (PGES-DB) [0338] Starting from site-specific glycosylation events, the annotation of each glycosylation site and glycan was included to include detailed site-specific protein structural annotation and recorded the number of times each substructure appeared in each glycan. Only human glycoproteins were analyzed. The final database includes 111 proteins, 306 glycosylation sites and 4263 gly cans. Initially, site-specific glycosylation events documented in UnicarbKB was used. Later and current work was informed by glycosylation events documented in Gly connect with supplemental information from GlyGen. Empirical site-specific glycosylation events from the UnicarbKB and Gly connect were used to inform much of the core analysis.

[0339] The protein structure annotation was done using the Structural Systems Biology (SSBio) package in python. The package uses several tools to perform a variety of annotations. For each human protein, empirical and homology modeled structures were collected from the Protein Data Bank (PDB) and SWISMOD, respectively. Proteins without existing models were modelled using I-TASSER. Protein structures and chemistry close to the glycosylation sites were annotated multiple software packages through SSbio: sequence properties , sequence alignment (EMBOS -.needle), secondary structure (DSSP ( , SCRATCH: :SSpro, and SCRATCH: :SSpro8), solvent accessibility (DSSP and FreeSASA), and residue depth (MSMS). Additional amino acid aggregate features were calculated using R::seqinr. Spatial proximity was defined using “min-distance” between two amino acids; the minimum distance between any pair of atoms spanning the amino acids.

[0340] Glycan structures were annotated using a combination of glypy (Klein and Zaia 2019) and GlyCompare, for structure parsing and comparison respectively. All glycan substructures, a connected subset of monosaccharides with and without linkage information, were extracted from each glycan, merged to make a superset of substructures, then mapped to each glycan, resulting in a mapping from every glycan in the input database to shared substructures.

[0341] Software and packages

[0342] Protein structure analysis was performed in Python v2.7.15 using SSBIO vO.9.9.8 to retrieve and calculate: existing empirical and homology models from PDB and SWISSMOD (PDBe SIFTS), de novo homology models (I-TASSER v5.1), sequence properties (EMBOS v6.6.0.0 pepstats), sequence alignment(EMBOS v6.6.0.0 needle), secondary structure (DSSP v3.0.0, SCRATCHvl. l: :SSpro and SCRATCHvl. l: :SSpro8), solvent accessibility (DSSPv3.0.0 and FreeS ASAv2.0.2), and residue depth (MSMSv2.2.6.1). Additional amino acid aggregate features were calculated using R::seqinr. [0343] Statistical analysis was performed in R v3.6.1. R::entropy vl.2.1 was used for entropy, Kullback-Leibler divergence and other information theoretic calculations. Generalized Estimating Equations (GEE) were fit using R::geepack vl.3.1. Gaussian Mixture Models were used to z-score normalize the glycoimpact using R::mixtools vl.1.0. BLOSUM and PAM substitution matrixes were accessed from R::Biostrings v2.52.

[0344] Probability event space, information gain and conditional probability

[0345] An event (a row in the enriched glycosylation-gly cosite database) was defined as as “the observation of a glycan at a glycosylation site in an experiment.” If two separate experiments in the input database both observe the same glycan at the same site on the same protein, that event was considered to have occurred twice. Within each event, it was considered if the glycan structure random variable (the presence or absence of a specific glycan substructure) is present or absent in the observed glycosylation event and if the protein structure random variable (a proximal amino acid, a secondary structure or another discrete protein structure). A Fisher exact test (R::base::fisher.test) was used to estimate the odds ratio (OR) and significance (p) of each inter-molecular relation (IMR). P-values were corrected for False Discovery Rate (FDR, q) permitting 10% false discovery (q<0.1). Conditional probability was calculated by dividing joint probability by the marginal probability of protein and glycan structure presence. Kullback-Leibler divergence (KLd, R:: entropy ::KL. Dirichlet, pseudo count=l/6) was calculated by comparing the conditional probability distribution to the marginal probability distributions.

[0346] Quantitative characterization of Inter-Molecular Relations (TMR) using Generalized Estimation Equations (GEE)

[0347] To characterize the IMRs in the PGES-DB while controlling for protein-specific confounding effects and handle nonlinear relations, a population-averaging approach was used: logistic Generalized Estimating Equations (GEE) with glycoprotein identity as the cluster identifier. An exchangeable correlation structure was used to describe and balance the in-protein similarity. Models were fit to predict glycan substructure binary (presence or absence) from either z-score normalized continuous or binary (presence or absence) protein structures. For each model, the data from PGES-DB was isolated for one gly can-type (N- glycan or O-glycan), one glycan substructure and one protein structure. Incomplete observations (events/rows) were removed and then several checks on each data-slice were run to minimize overfitting. Glycan substructures were excluded from modelling if standard deviation was less than le-6 or if there were fewer than 5 observations of the structure within the pertinent data-slice. Discrete protein structure features were excluded if there were fewer than 4 observations within the data-slice. Models were excluded if there were fewer than 4 instances in any cell (of the 2x2 absence/occurrence matrix) or if the chi-squared expected value of any cell was less than or equal to 5. Observations were weighted by the reciprocalcount of the corresponding label type to balance label contributions to the model and scaled by exponentiated cscore to maximize the contribution of high-quality protein structure models (); c is the c-score given by I-TASSER and n is the number of times a structure is present (1) or absent (0). Models with |log(OR)|> 50 were excluded as likely overfit. Quasilikelihood under independent model criterion (QIC) and the Wald tests were used to evaluate the significance and magnitude of the estimated IMRs. This analysis was run using publication identifiers as a control variable to account for researcher and group biases; this produced similar results likely because protein identity is strongly correlated with the publications in which they appear.

[0348] Calculating glycoimpact from IMRs and populating a BLAMO matrix

[0349] Glycoimpact is calculated for every pair of AAs as the Euclidean distance between significant and substantial logORs for each AA; the Euclidean distance between expected glycoprofiles for each AA. The substantial (log(OR)>X) and significant (FDR<Y) log(OR) values are retained while insignificant or unsubstantial log(OR) values are set to zero. The resulting matrix describes the expected glycoimpact due to each AA-substitution, termed the BLAMO XY matrix where X and Y denote the log(OR) and FDR thresholds respectively. [0350] Glycoimpact values from a BLAMO XY matrix may then be z-score normalized to a Gaussian Mixture Model estimated null distribution. z=2.5 may be used as a heuristic, but stringent, cutoff between “impactful” (z>2.5) and “null” (z<2.5) substitutions.

[0351] Comparison of SNP pathogenicity scores with glycoimpact

[0352] Functional prediction rank normalized scores were obtained from dbNSFP (v3.2) for the following 27 tools: SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, 2x PhyloP, MetaSVM, MetaLR, CADD, VEST3, PROVEAN, 4* fitCons scores, fathmm-MKL, DANN, 2* phastCons, GenoCanyon, Eigen and Eigen-PC. Variants were excluded from the analysis if they had more than 3 missing functional score predictions, did not result in an amino acid change, or not on proteins that had known glycosylation sites

[0353] Assignments of “prediction-type” and “structure-usage” were adapted from classifications provided by dbNSFP.

[0354] Estimation and Analysis of Evolutionary Coupling (EC) [0355] For EVCouplings calculation, hits of more than 50% gaps were filtered from the alignment, and sequences with homologs more than 80% identical were downweighted to compute Neff, the effective number of sequences. ECs were calculated using pseudolikelihood maximization, as implemented previously. The Z.i term was scaled by the number of amino acids minus one times the number of sites in the model minus one. Pre- and postprocessing was performed using the EVCouplings Python package.

[0356] Fligh-ranking EC events are generally considered those ranking less than L, the alignment length within the corresponding protein. Multiple high-rank thresholds between L/5 and 3L were explored. To explore the increased coupling with glycosylation sites, couplings between each amino acid with glycosites (GN), asparagines (N) and any amino acid (AA) were examined. The number of high-ranking coupling events, the distributions of EC probabilities and the relative numbers of high and low -ranking ECs for each group were compared with various amino acids at relative positions N+/-6. Distributions were compared with a one-sided Wilcoxon test, and high/low-ranking counts were compared with hypergeometric enrichment. The hypergeometric enrichment of glycosite-coupling w as performed at multiple high-rank thresholds (L/3, L/2, L, 2L, 3L), and p-values were pooled for each amino acid at each relative position across ranks using Fisher’s method. Finally, the pooled p-values were corrected for multiple tests using the Benjamini-Hotchberg method. [0357] To examine larger structures in ECs, EC rank was used to mask extended sequons (N+/-6). The sequons were then clustered and motifs extracted. For each sequon, the residues were retained if the residue-gly cosite coupling rank was less than L4. The extended and masked sequons were distinguished using a hamming distance (DECIPHERv2.18.1) then clustered using agglomerative hierarchical clustering (factoextra::hcut vl.0.7). Motif logos were generated using custom-scaled position-specific scoring matrices (Wagih 2017) reflecting the cumulative rank of amino acids at each glyco-site relative position.

Specifically, the aggregate score, S, for each amino acid, a, at each position, p, was aggregated over EC score ranks, r, within each extended-masked sequons, s, in a cluster, c, such that

[0358] Mouse breeding and Samples

[0359] The Collaborative Cross (CC) recombinant inbred mouse strains (N = 333, 95 strains, age 20-117 weeks) were produced by Geniad Pty Ltd and housed at Animal Resources Centre (Murdoch, WA, Australia). The CC strains were genotyped using the MegaMUGA platform (GeneSeek; Lincoln, NE). C57BL/6 (N = 10) and BALB/c mice (N = 10), sex- and age- matched (10 weeks old, 1 : 1 male:female) were obtained from Elevage Janvier (Le Genest- Saint-Isle, France). The studies received appropriate ethics approvals from the Animal Ethics Committee of the Animal Resources Centre and the Ethical Committee of the District Government of Lower Franconia.

[0360] Liquid Chromatography - Mass Spectrometry (LC-MS), Normalization and Statistical Analysis of Mouse Fc-linked IgG N-gly copeptides

[0361] Immunoglobulin G was isolated from 100-500 pl of mouse serum on 96-well Protein G monolithic plates (BIA Separations) as described previously. LC-MS analysis of tryptic Fc-gly copeptides was performed as described in. In brief, approximately 10-20 pg of isolated IgG was digested with 200 ng trypsin (Worthington, USA). The resulting glycopeptides were purified by reverse-phase solid phase extraction using Chromabond C18ec beads (Marcherey- Nagel, Germany). Tryptic digests were analyzed on ananoACQUITY UPLC system (Waters, USA) coupled to a Compact mass spectrometer (Bruker Daltonics, Germany). Peak areas were calculated by summing areas for doubly and triply charged ions determined with LaCyTools v 1.0. 1 b.7 software and normalized to the total integrated area per IgG subclass. [0362] Batch correction was performed on the log-transformed values using the ComBat method (R package “sva”) to remove possible expenmental vanations due to LC-MS analysis having been performed on several 96-well plates within each cohort. Derived glycosylation traits describing relative abundance of N-gly cans sharing specific structural features (agalactosylated, galactosylated, sialylated, monogalactosylated, digalactosylated, monosialylated, disialylated structures, structures with bisecting GlcNAc) were calculated in a subclass-specific manner. Statistical analysis and data visualization were performed using R programming language v 4.0.3.

Example 2. Prediction of changes in glycosylation of anti-Ebola Fc antibody

[0363] To determine if predicted changes in glycosylation can also predict gly can-modulated behaviors, the increases in antibody-dependent cellular cytotoxicity (ADCC) were predicted from predicted decreases in core-fucosylation. Anti-ebola virus antibodies from convalescent plasma were characterized for differential immune modulation. Of these antibodies, five (R292P, S298A, Y300L, V305I, T307A) contained Fc-variation close to the N297 (P0DOX5) glycosylation site and showed increases in ADCC or FcRn binding. For each allotype, the relative glycoimpact (Fisher-OR estimated structure and sequence-proximal IMRs) was queried for each wt and mutant amino acid (e.g. R and P respectively upstream of a glycosite) on the N-glycan core motif with (Man(bl-4)GlcNAc(bl-4)[Fuc(al-6)]GlcNAc(bl-4)-Asn; X183) or without (Man(bl-4)GlcNAc(bl-4)GlcNAc(bl-4)-Asn; X69) a core fucose.

Wildtype and mutant glycoimpacts were plotted against each other for structure-proximal FIG. 16A) and sequence-proximal (FIG. 16B) effects. In 4 of 5 allotypes structure-proximal IMRs indicate an increase in afucosylated structures (above y=x, equal likelihood in wt and mutant) while core-fucosylation likelihood remains close to equal (FIG. 16A); 3 of 5 alloty pes show the same behavior in sequence-proximal IMRs (FIG. 16B). The relative increase in predicted afucosylation for each ADCC-increasing allotype was consistent with the increase in ADCC observed.

Example 3. A trained algorithm for predicting glycans

[0364] Described herein are trained algorithms (e g., machine learning algorithms) such as the Interloping Saccharide Neural Network Extrapolation (InSaNNE), a machine learning model predicting glycosylation given glycosite-proximal protein structure. Using long shortterm memory' (LSTM) units, within a recurrent neural network, the functional and biosynthetic glycan encodings of SweetTalk, SweetNet and GlyCompare were leveraged to generate an accurate mapping to protein sequence and structure. Several protein structure models and resolutions were explored for performance including empirical, curated and ab initio homology models. The gly cosite-glycan pairing model was trained and validated on empirically observed site-specific glycosylation events from UmcarbKB and GlyConnect respectively. Predictions on a small number of important glycosylation events on the coronavirus spike protein, immunoglobulin, and the enhanced aromatic sequon validate the trained model.

[0365] Results

[0366] To predict which glycans will or could be present on a protein, a model was developed to predict whether a given sequon will be glycosylated with a given glycan structure. Provided with such a trained matching model, a list of available glycans (for instance tailored to the species of interest) could then be used as input to determine which glycans are predicted to be permissible with a sequon of interest. Representative models investigated are shown in FIG. 17A.

[0367] To analyze the protein sequences, long short-term memory (LSTM) units, a module used in recurrent neural networks that treated protein sequences as a biological language was used (FIG. 17B). Two separate LSTM-based modules were integrated into the model for analyzing the sequon and spatially proximal amino acids, respectively. For the analysis of the glycan component, several different modules were tested: (1) a fully connected neural network that used the GlyCompare features of a glycan as input, (2) a gly can-based language model in the style of SweetTalk, and (3) a graph convolutional neural network based on SweetNet. [0368] On average, the model based on GlyCompare features achieved an accuracy of -79.7% in predicting whether a given glycan could be found on a specific glycosylation site (Table 6). Models based on recurrent neural networks (-84.5%) or graph convolutional neural networks (-87.1%) further improved on this competitive performance, demonstrating that optimizing the module analyzing glycan sequences yields a substantial increase in prediction performance. This result points to the information-richness of glycans that can be leveraged with state-of-the-art machine learning algorithms. Choosing the SweetNet-based model for further refinement, stochastic weight averaging was used to further optimize performance. In this technique, models at multiple training time points were averaged to yield more robust and generalizable models at the end of the training process. Indeed, SweetNet- based models enhanced with SWA achieved an average prediction accuracy of -90.6% and represented the final model, InSaNNe, which was used for downstream analyses.

[0369] After optimizing the analysis of glycan sequences for the model, the role of protein sequences on prediction performance was analyzed. For this, a model was trained that only had access to the glycan sequences and sequons, without additional spatially proximal amino acids. Compared to the full InSaNNe model (-90.6%), this ablated model achieved a slightly worse performance (-89.1%, Table 6, “-Environ”), suggesting there may be relevant information in the three-dimensional context of glycosylation. However, even without access to the spatially proximal amino acids, InSaNNe still retained most of its predictive performance. Such models may be useful in cases where no structural information is available

[0370] A model was further trained that, in addition to sequon, proximal amino acids, and glycan sequences, also had access to the full protein sequences. The effect of this additional information on performance is shown in Table 6 below (-90.8% accuracy, Table 6, “Whole”).

Table 6. Developing a model for glycan-sequon matching to predict permissible glycans on a glycosylation site.

[0371] Given the remarkable diversity of glycans, performance of the model on different classes of gly cans was analyzed. Considering differences in sequons associated withJV- and O-linked glycans, the trained model was first tested on both N- as well as O-linked glycans. InSaNNe achieved approximately equally high performance on both N- (96.2% accuracy) as well as O-linked glycans (99.8%) in the dataset, despite the consensus glycosylation sequence of O-linked gly cans exhibiting far more diversity than that for JV-linked glycans. The high accuracy in these cases, compared to the overall accuracy of -90.6% that also included predicting non-matches, suggests that InSaNNe exhibits high recall - recovering nearly all permissible glycans at a given sequon.

[0372] To determine if any glycan motif were especially difficult to predict, the average prediction accuracy of the trained InSaNNe model for each GlyCompare feature was calculated. Rare GlyCompare features exhibited a lower prediction accuracy than more frequent motifs, as InSaNNe had more examples and data to learn from for the latter case (FIG. ISA). However, very few examples could result in good prediction performance, and InSaNNe exhibited a predictive accuracy of more than 90% for nearly all motifs (FIG. 18B). In one example, ten examples were sufficient to ensure an accuracy of at least 90% with the trained model. As GlyCompare features represent a hierarchical feature set, rare motifs with low prediction accuracy may not be independent from each other and demonstrated clusters based on their sequence similarity (FIG. ISC). GlyCompare features with lower predictive performance were enriched in al -3 linked fucose, a glycan modification commonly found in plants and O-linked glycans. Analogous to the glycan features, most sequons exhibited an aggregate predictive accuracy beyond 90% (FIG. 18D), a correlation of prediction performance with the number of observed glycans for that sequon was again observed (FIG. 29). Widespread redundancy in the information extracted from sequon sequences was observed, as removal of single amino acids or short motifs only had a negligible impact on prediction performance (FIG. 30). The flanking residues, rather than the central region that contained the glycosylation consensus sequence, were found to inform model predictions, with a potentially stronger impact of the upstream flank (FIG. 30).

[0373] To illustrate the capabilities of InSaNNe with an example, the sequon GTVLTRNETHATYS (SEQ ID NO: 62) from human uromodulin, the most abundant protein in human urine and relevant for chronic kidney disease, was selected as an example. The trained InSaNNe model was used to predict the probability of all glycans in the dataset to decorate this sequon. All 61 experimentally observed glycans were placed in the top 100 predicted glycans (FIG. 18E). Additionally, glycans in the top 100 that were not yet experimentally reported in conjunction with this sequon shared features with the observed glycans, such as a strong negative charge via sialylation and/or sulfation. These results demonstrate that glycosylation range may be accurately predicted by models described herein.

[0374] Next to predicting specific glycans, the model was also able to predict motifs that are likely to be present in glycans at a given sequon. This feature is relevant as it may permit users to select motif-specific lectins to probe biological samples. The sequon PVQINCTRPN (SEQ ID NO: 63) from human immunodeficiency virus (HIV-1) envelope glycoprotein was used as the input for the trained InSaNNe model. As sequons from HIV-1 proteins were not part of the dataset for developing InSaNNe, this served to validate the model. This procedure resulted in structurally related clusters of predicted glycans (FIG. 18F) that can be compared between different sequons (FIG. 18G).

[0375] Additionally, directly predicting glycan motifs or features facilitates statistical enrichment analyses to determine which motifs are significantly enriched or depleted in this specific glycosylation site. The GlyCompare features of all glycans in the dataset were determined, and the InSaNNe model was trained to predict which GlyCompare features were present in a glycan associated with a sequon. Then, the enrichment of motifs was assessed by a one-sided Wilcoxon rank-sum test, a non-parametric test in which the ranks of predicted glycans were compared with that feature against the null hypothesis that they are randomly distributed across all ranks. If a GlyCompare feature obtained a p-value of smaller than 0.05 (corrected by the Benjamini -Hochberg false detection rate correction), it was deemed significantly enriched in glycans predicted to be present at that sequon.

[0376] The trained InSaNNe model, which has learned the relationship between sequons and glycan ranges, was then used to probe what effect changes to the sequon would have on predicted glycans.

[0377] As O-linked glycans exhibit a considerably different sequon architecture, A'-linked sequons were focused on for this analysis. Then, for each position in a 14-amino acid sequon, all amino acids were iteratively replaced with a given amino acid in all sequons in the dataset. These modified sequons were used as inputs for the InSaNNe model, and the changes in predicted glycans compared to the wild-type sequence were assessed. To aid interpretation, glycans were grouped into “high-mannose”, “sialylated”, and “fucosylated” to analyze common features that are relevant for human glycobiology. Changes in predicted probability for each of these features when modifying a given amino acid at different positions of the sequon were thus tracked (FIG. 19).

[0378] For multiple amino acids, distinct changes in the predicted glycosylation range of modified sequons were observed, with a clear difference between changes to upstream and downstream regions, respectively. While the introduction of some amino acids (e.g., tyrosine) had the same qualitative effect regardless of where they were introduced, other ammo acids (e g., cysteine) seemed to have diverging effects, with an increase in high-mannose glycans when introduced upstream and a decrease when introduced downstream. While there is a clear difference between upstream and downstream, within these two regions the effect seemed to be large monotonous, albeit with some changes in the direction of predicted effects (e.g., glutamate first increasing high-mannose when introduced downstream and then, further downstream, decreasing high-mannose).

[0379] One possible advantage of predicting glycosylation is a considerable increase in scale and speed for characterizing protein glycosylation, especially in the context of newly discovered proteins, for which obtaining glycan information might otherwise take several months. So, the utility of InSaNNe as an annotation tool for protein glycosylation was next demonstrated.

[0380] One possible predictions in this realm is the high-confidence prediction of sequons that will be modified by A-linked glycosylation. To leverage and extend this capability, the glycosylation ranges of 2,763 human N-linked sequons deposited in the database GlyConnect were predicted, focusing on human proteins given that the model was trained on human data. For this, annotated glycosylation sites together with the six upstream and seven downstream amino acids were extracted, resulting in sequons that were used as inputs for the trained InSaNNe model. Then, for each sequon, the likelihood of the 199 N-linked glycans in the dataset were predicted. A threshold of 0.6, corresponding to a false-positive rate of below 10% while still maintaining a true positive rate above 75%, was chose (FIG. 28A).

[0381] With that, the hit rate (also known as recall or sensitivity) of the predictions within GlyConnect was assessed by considering how many already recorded structures were matched by structures (FIG. 28B) and how many compositions could be explained and refined by InSaNNe predictions (FIG. 28C).

[0382] One most fundamental distinction between glycans is between highly processed, complex, and immature, oligomannose, glycans. It has been reported that an aromatic residue 2-positions N-terminal from a glycosylation site would decrease complexity at the site; this sequon has been termed the enhanced aromatic sequon. An L to F substitution two residues upstream of the CD2 glycosylation site may transform the site from predominantly complex and hybrid structures to oligomannose structures. When InSaNNe evaluates the same sequences, the F allele sequence shows significantly higher predicted presence for higher- mannose structures. An individual enrichment for 7-mannose structures (One-sided Mann- Whitney-Wilcoxon, p=0.017) and an overall increase in oligomannose structure predicted- presence for the F allele (Linear model; Wald, p<0.001; F-statistic, p=7.44xl0' 5 ; FIF. 20A) were observed. A corresponding decrease in predicted-presence for sialylated structures in the F allele (One-sided Mann-Whitney-Wilcoxon, p<le-4; FIG. 20B) was also observed. [0383] The model could also capture variation in SARS-CoV-2 glycan complexity. Oligomannose at N234 is consistently high (80-100%) and appears necessary to support the open ACE2 -binding spike conformation. Predictions made by models described herein (e.g., InSaNNe) show strong preference for Man5 and Man9 structures and a strong preference against sialylation (FIG. 20C). The next-most immature sites, N717 and N801 (30-55%), see a near-complete obliteration of predicted sialylation ((FIG. 20C). Predictions for all glycosylation sites were mostly consistent with empmcal observations (FIG. 21). The spike of new strains was also examined to see if their glycosylation could be predicted.

[0384] Examining site N616 in a simulated D616G variant (FIG. 22) and N717 in a T714I variant (FIG. 20D), distinct changes in predicted glycosylation were found. To focus attention on relevant changes, those with non-negligible wt predicted-presence (>0.1) and substantial fold change (|logFC|>l) relative to the wt were further examined. At site N717 in the T714I variant, many asialylated sugars with between 1 and 3 galactose decreased relative to wt. Additionally, a small number of sugars with 0-2 sialic acids and 1-4 galactose increased. Though InSaNNe predicts site N717 becomes variably hospitable to mono, di, tri and tetra-antennary sialylated and asialylated structures, empirically, it is an oligomannose site suggesting these terminal galactoses may not be visible without additional mutations to the site. Distinctly, the InSaNNe reveals few confident changes at site N616 in the D614G variant (FIG. 22)

[0385] Mutations resulting in differential glycosylation in IgG3 have been reported. These data measured the abundance of 8 complex biantennary structures in human IgG3 for wt and glycosite (N297; P01860:N227) proximal mutants. The wt IgG3 shows a preference for core- fucose and bl-6-branch galactose, R301A increased all terminal galactose, and Y296A accepted no galactosylation (FIG. 23A). As they found, primary protein structure can profoundly influence glycosylation. [0386] InSaNNe predictions for the R301 A and Y296A mutants were compared and background-adjusted predicted-presence and change in adjusted predicted-presence were found to correlate with empirical occupancy. Abundance-prediction was high for the R301 A abundance (R 2 =0.876; FIG. 23B) and moderately predictive of wt abundance (R 2 =0.25; FIG. 23B). Predicted-presence was a moderate predictor of measured IgG3:N297 in the Y296A mutant (R 2 =0.33; FIG. 23B). Prediction accuracy increased when changes in predicted and observed values were compared. The log fold-change in predicted-presence in R301A relative to wt was highly correlated with measured log-abundance (R 2 =0.87; FIG. 23C). Yet, the consistency in predicted vs observed change for Y296A decreased dramatically (R<0, R 2 =0.27; FIG. 23C). To further probe the failure to predict change in Y296A, glycans with a predicted absolute log fold-change less than 1 were removed, and it was found that abundance prediction accuracy for wt (R 2 =0.52), R301A (R 2 =0.99), and log fold-change (R 2 =0.95) increased (FIG. 23D-23E) while nearly all predictions for Y296A dropped out. These results suggest that InSaNNe can predict changes in abundance, not only presence.

[0387] Methods

[0388] Site-specific glycosylation training set construction

[0389] As described elsewhere herein, empirical site-specific glycosylation data from humans was obtained from UnicarbKB and Gly connect with supplemental information from GlyGen. The protein structure annotation was done using the Structural Systems Biology (SSBio) package in python. Protein structure analysis was performed in Python v2.7.15 using SSBTO vO.9.9.8 to retrieve and calculate existing empirical and homology models from PDB and SWISSMOD (PDBe SIFTS), de novo homology models (I-TASSER v5.1), sequence properties (EMBOS v6.6.0.0 pepstats), sequence alignment (EMBOS v6.6.0.0 needle), secondary structure (DSSP v3.0.0, SCRATCHvl.l ::SSpro and SCRATCHvl.l::SSpro8), solvent accessibility (DSSPv3.0.0 and FreeSASAv2.0.2), and residue depth (MSMSv2.2.6.1). Additional amino acid aggregate features were calculated using R::seqinr. Gly can structures were annotated using a combination of glypy and GlyCompare for structure parsing and comparison respectively. All gly can substructures, a connected subset of monosaccharides with and without linkage information, were extracted from each gly can, merged to make a superset of substructures, then mapped to each gly can, thus resulting in a mapping from every gly can in the input database to shared substructures.

[0390] For the dataset used to train SweetNet, 2,313 unique glycosylation events were extracted from UniCarb. This included the gly can sequence that was observed and the sequon (14 ammo acids, with the glycosylated amino acid in the center) and structural information in the form of additional amino acids within 6 A if structural simulations converged. As negative examples, the same number of combinations of sequons and glycans that have not been observed was generated.

[0391] Model construction

[0392] All glycan-sequon matching models comprised (1) a recurrent neural network that analyzed the amino acid sequence of the sequon, (2) another recurrent neural network analyzing the amino acids of the three-dimensional sequon surroundings, (3) a model analyzing the glycan sequence, described below, and (4) a part consisting of fully connected layers to use the concatenated features generated by the previous modules to predict whether a glycan is permissible on a sequon. The recurrent neural networks consisted of a 128- dimensional embedding layer followed by two bidirectional long short-term memory (LSTM) layers. The fully connected model part consisted of a linear layer, a leaky ReLU (rectified linear unit) activation function, a batch normalization layer, and a multi-sample dropout scheme followed by a sigmoid function.

[0393] There different model architectures for the glycan analysis module were compared. For assessing GlyCompare, the glycan analysis module comprised a fully connected neural network using the 12,259 GlyCompare features as inputs for two linear layers interspersed with dropout, leaky ReLU, and batch normalization layers. For the model containing a SweetTalk-based language model for glycan analysis, glycans were converted to glycowords, and a bidirectional recurrent neural netw ork was use. For the SweetNet-based model, glycans were converted to graphs by constructing a list of nodes (representing monosaccharides or linkages) and edges to denote graph connectivity. The corresponding model contained an embedding layer and three graph convolutional layers, interspersed by leaky ReLUs, Top-K pooling layers, and both global mean and global maximum pooling operations. Model architectures and hyperparameters were optimized using cross-validation.

[0394] Model training and prediction

[0395] All models were trained with an NVIDIA® Tesla® K80 GPU using PyTorch. The data were split on a protein level into 80% for training and 20% for testing. For the RNNs, all protein and glycan sequences were brought to the same length by padding. Linear layers and RNNs were initialized using Xavier initialization while SweetNet-type models were initialized using a sparse initialization scheme with a sparsity of 10%.

[0396] A batch size of 32 was used for all models. As an optimizer, adaptive moment estimation (ADAM) was used with a weight decay value of 0.00001 and a starting learning rate of 0.00001, which decayed according to a cosine function over 170 epochs. Models were trained for a maximum of 250 epochs, with an early stopping criterion of 25 epochs without a decrease in validation loss. Binary cross-entropy was used as a loss function. Beginning from epoch 150, stochastic weight averaging was additionally employed with a learning rate of 0.0001.

[0397] Presence or absence of each glycan may be predicted from the trained InSaNNe model. To heuristically boost signal for glycans with limited representation in the training set, a naturalistic background of predicted presence was generated for each glycan. Predictions were generated from all training-set sequons to capture the biases and variation of the dataset as a background predicted-presence distribution for each glycan. The background-adjusted predicted-presence is the product of predicted presence and the predicted-presence cumulative probability (statsmodels::ECDF vO.12.2) relative to the naturalistic background for that glycan.

[0398] Supplemental Results

[0399] Protein structure optimization and ablation demonstrates that all included feature types support predictive performance

[0400] To determine suitable dataset preparations, a random forest classifier was trained to distinguish oligomannose from complex glycosylation sites, given protein structure and surface data. Dataset preparations included protein structure model type (ab initio [I- TASSER], curated [SWISSMOD], or empirical [PDB]), and proximity radius defining “proximal” amino acids (4-10 A). Hyperparameters were optimized using 500 iterations of grid-search. Labels were balanced using up-sampling and performance was evaluated using area under the receiver-operator curve (AUROC), sensitivity and specificity on two iterations of six-fold cross-validation; each fold contained non-overlapping groups of proteins to avoid overfitting due to protein identity.

[0401] By vary ing these parameters, the optimal protein model and annotation resolution was evaluated (FIG. 24). Models trained on I-TASSER protein structures with a 6 A-annotation resolution showed the high performance across all three metrics relative to random forest models trained on I-TASSER proteins annotated at other resolutions. Among models trained using PDB protein structures, those trained on data annotated at 8 A performed well across all three metrics. The best PDB-trained models measured, on average, comparable AUROC, a 3.5% decline in sensitivity and 13% decline in specificity compared to I-TASSER-trained models. SWISSMOD-trained models did not have a clear best resolution though average scores were mostly comparable to those trained using PDB or I-TASSER structures. [0402] Towards determining the importance of different protein structure annotations, an ablation analysis was performed by removing major types of data from the training set and comparing performance to models trained on all data (FIG. 25). The significance of each depletion in performance relative to models trained on all data was pooled (2-sample t-test, Fisher’s method for pooling p-values); p-values were pooled within each ablation and performance metric across models trained on I-TASSER, PDB and SWISSMOD protein structures at all resolutions. AUROC, sensitivity and specificity are all sensitive to ablations in secondary structures, depth, and upstream amino acids (FDR<0.01475). Overall, each major data-type may maintain performance across all three metrics.

Example 4: Trained algorithm for predicting core fucosylation

[0403] A trained algorithm as described elsewhere herein was trained on glycoprotein data to predict the likelihood of core fucosylation at a glycosite in a given protein sequence. Input glycoprotein data was split into a test set to train the network and a validation set for cross- validation of the trained algorithm. FIG. 26A shows the validation loss as a function of training epoch, indicating that the trained algorithm a generalizable model for predicting core fucosylation of previously unseen protein sequences. FIG. 26B shows the validation accuracy as a function of training epoch, demonstrating that the trained algorithm may correctly predict core fucosylation of previously unseen protein sequences.

[0404] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.