Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR DETERMINING THE DEGREE OF PHOSPHORYLATION AND THE DEGREE OF GLYCOSYLATION OF A PROTEIN IN A PROTEIN SAMPLE
Document Type and Number:
WIPO Patent Application WO/2016/139335
Kind Code:
A1
Abstract:
The present invention relates to methods for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample. In particular, the present invention relates to methods for determining at least a degree of phosphorylation and a degree of glycosylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an infrared (IR) spectrum obtained by Fourier transform infrared (FTIR) spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

Inventors:
DERENNE ALLISON (BE)
GOORMAGHTIGH ERIK (BE)
Application Number:
PCT/EP2016/054619
Publication Date:
September 09, 2016
Filing Date:
March 04, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIVERSITÉ LIBRE DE BRUXELLES (BE)
International Classes:
G01N33/68; G06F19/16
Other References:
GUERRERO ANDRES ET AL: "Top-Down Analysis of Highly Post-Translationally Modified Peptides by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry", JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, ELSEVIER SCIENCE INC, US, vol. 26, no. 3, 18 November 2014 (2014-11-18), pages 453 - 459, XP035454372, ISSN: 1044-0305, [retrieved on 20141118], DOI: 10.1007/S13361-014-1034-5
BARTH ET AL: "Infrared spectroscopy of proteins", BIOCHIMICA ET BIOPHYSICA ACTA. BIOENERGETICS, AMSTERDAM, NL, vol. 1767, no. 9, 28 August 2007 (2007-08-28), pages 1073 - 1101, XP022217976, ISSN: 0005-2728, DOI: 10.1016/J.BBABIO.2007.06.004
A. GOLDSZTEIN ET AL: "Gastric ATPase phosphorylation/dephosphorylation monitored by new FTIR-based BIA-ATR biosensors", SPECTROSCOPY, vol. 24, no. 3-4, 1 January 2010 (2010-01-01), XP055210325, DOI: 10.3233/SPE-2010-0402
MARK R. EMMETT: "Determination of post-translational modifications of proteins by high-sensitivity, high-resolution Fourier transform ion cyclotron resonance mass spectrometry", JOURNAL OF CHROMATOGRAPHY A, vol. 1013, no. 1-2, 1 September 2003 (2003-09-01), pages 203 - 213, XP055210326, ISSN: 0021-9673, DOI: 10.1016/S0021-9673(03)01127-0
KHAJEHPOUR ET AL: "Infrared spectroscopy used to evaluate glycosylation of proteins", ANALYTICAL BIOCHEMISTRY, ACADEMIC PRESS INC, NEW YORK, vol. 348, no. 1, 1 January 2006 (2006-01-01), pages 40 - 48, XP005206937, ISSN: 0003-2697, DOI: 10.1016/J.AB.2005.10.009
GAIGNEAUX ET AL., APPLIED SPECTROSCOPY, vol. 60, no. 9, 2006, pages 1022 - 1029
ALTSCHUL ET AL., J MOL BIOL, vol. 215, 1990, pages 403 - 10
TATUSOVA; MADDEN, FEMS MICROBIOL LETT, vol. 174, 1999, pages 247 - 250
GELADI ET AL., ANALYTICA CHIMICA ACTA, vol. 185, 1986, pages 1 - 17
WOLD ET AL., CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, vol. 58, 2001, pages 109 - 130
BYLER ET AL., BIOPOLYMERS, vol. 25, 1986, pages 469 - 87
JACKSON ET AL., CRIT. REV. BIOCHEM. MOL. BIOL., vol. 30, 1995, pages 95 - 120
GOORMAGHTIGH ET AL., BIOPHYS J., vol. 90, 2006, pages 2946 - 57
GOORMAGHTIGH ET AL., ADV. BIOMED. SPECTROSC., vol. 2, 2009, pages 104 - 128
GOORMAGHTIGH ET AL., SPECTROCHIM. ACTA., vol. 50, 1994, pages 2137 - 2144
Attorney, Agent or Firm:
GYI, Jeffrey et al. (9830 Sint-Martens-Latem, BE)
Download PDF:
Claims:
CLAIMS

1 . A method for determining at least a degree of glycosylation and a degree of phosphorylation of a protein in a protein sample, the method comprising the steps of:

(a) receiving an infrared (IR) spectrum obtained by Fourier transform infrared (FTIR) spectroscopy of the protein sample, and

(b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

2. Method according to claim 1 , wherein step (b) is performed by calculating from the IR spectrum the degree of glycosylation of the protein using a first mathematical model configured to determine the degree of glycosylation of a protein, and calculating from the IR spectrum the degree of phosphorylation of the protein using a second mathematical model configured to determine the degree of phosphorylation of a protein.

3. Method according to claim 2, wherein the first mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one, preferably each, reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one, preferably each, other reference protein sample in the first set, and wherein the second mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one, preferably each, reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one, preferably each, other reference protein sample in the second set.

4. Method according to any one of claims 1 to 3, further comprising the step of determining from the same IR spectrum the protein concentration in the protein sample.

5. Method according to claim 4, wherein the protein concentration is determined by calculating the concentration of the protein from the IR spectrum using a third mathematical model configured to determine a protein concentration.

6. Method according to claim 5, wherein the third mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a third set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein at least one, preferably each, reference protein sample has a protein concentration which is different from the protein concentration of at least one, preferably each, other reference protein sample in the third set.

7. Method according to any one of claims 3 to 6, wherein the statistical tool is selected from the group consisting of partial least square (PLS) regression, classical least square (CLS) regression, inverse least square (ILS) regression, principal component regression (PCR), multiple linear regression (MLR), and support vector machine

(SVM), preferably wherein the statistical tool is PLS regression.

8. Method according to any one of claims 1 to 7, wherein the protein sample contains an internal reference compound.

9. Method according to claim 8, wherein the internal reference compound is an azide, ferrocyanide, deuterated lipid, or nitrile.

10. Method according to any one of claims 1 to 9, wherein determining the degree of glycosylation comprises determining the degree of one or more types of glycosylation.

1 1 . Method according to any one of claims 1 to 10, further comprising the step of determining from the same IR spectrum of the protein sample the degree of aggregation of the protein and/or the degree of denaturation of the protein in the protein sample.

12. Method according to claim 1 1 , wherein the degree of aggregation and/or the degree of denaturation is determined by comparing the IR spectrum of the protein sample with IR spectra of the same protein at progressively increasing temperatures.

13. Method according to any one of claims 1 to 12, further comprising the step of determining from the same IR spectrum of the protein sample one or more secondary structure characteristics of the protein in the protein sample.

14. Method according to claim 13, wherein the secondary structure characteristic is selected from the group consisting of alpha helix, beta sheet, beta turn, 3io helix, and random structure.

15. Method according to any one of claims 1 to 14, for determining the degree of phosphorylation, the degree of glycosylation, the concentration, the degree of one or more types of glycosylation, the degree of aggregation, the degree of denaturation, and the one or more secondary structure characteristics of the protein in the protein sample.

Description:
METHOD FOR DETERMINING THE DEGREE OF PHOSPHORYLATION AND THE DEGREE OF GLYCOSYLATION OF A PROTEIN IN A PROTEIN SAMPLE

FIELD OF THE INVENTION

The invention is broadly in the field of quality assurance (QA), more precisely in the field of quality assurance of protein products. In particular, the invention concerns methods for determining at least the degree of phosphorylation and the degree of glycosylation of protein in protein products.

BACKGROUND OF THE INVENTION

During the last decades, the number of protein products in development and entering the market has dramatically increased. Yet, a rapid commercialization of these protein products remains a challenge notably due to their chemical and physical instabilities. The inherent complexity of proteins requires the development of analytical strategies to characterize and ensure the quality and the safety of these protein products.

During therapeutic protein development and protein manufacturing, the quality of protein products regularly has to be assured. In particular, there is a constant need for QA in development and manufacturing of protein produtcs such as protein therapeutics. There is also a need for monitoring shelf-life, storage conditions, and/or stress testing of proteins (e.g. stored under different conditions such as for different time periods and/or at different temperatures) and for quality assurance during optimization of process conditions (e.g. upscaling a reactor).

Chemical and physical properties of proteins such as protein aggregation, protein denaturation, secondary structure of a protein, protein concentration, glycosylation, and phosphorylation, are a major concern because they impact significantly the biological activity and the immunogenicity of the protein product. For instance, protein aggregation is a common source of protein instability. Moreover, misfolding or alterations in the three- dimensional structure of proteins can be responsible of a loss of activity and elicit immune responses. Posttranslational modifications such as glycosylation and phosphorylation are also important issues. Phosphorylation is the reversible addition of a phosphate group on a polypeptide chain. It generally influences structural properties dynamic and binding, and is often involved in signalling pathways. Glycosylation is a modification that attaches a carbohydrate to a protein. This modification is involved in protein folding, interaction, stability, signal transduction and mobility. In the context of protein production, protein glycosylation is subject to a high degree of heterogeneity according to the manufacturing conditions which emphasizes the importance to evaluate the proportion of sugar in a commercial protein product. As these chemical and physical properties affect the efficacy and the safety of protein products, they must be carefully and systematically monitored.

Accordingly, a need exists to develop further and improved methods for QA in therapeutic protein development and in protein manufacture such as methods for determining physical and chemical properties of a protein in a protein sample.

SUMMARY OF THE INVENTION

The invention provides technology adapted to the determination, including quantitative analysis, of at least the degree of phosphorylation and the degree of glycosylation of a protein in a protein sample.

In a first aspect, the invention provides a method for determining at least a degree of phosphorylation and a degree of glycosylation of a protein in a protein sample, the method comprising the steps of:

(a) receiving an infrared (IR) spectrum obtained by Fourier transform infrared (FTIR) spectroscopy of the protein sample, and

(b) determining from the IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

As illustrated in the example section, the inventors have found that a single IR spectrum obtained by FTIR spectroscopy allows the determination of the degree of phosphorylation and the degree of glycosylation despite the fact that both post-translational modifications (i.e., glycosylation and phosphorylation) absorb in overlapping spectral regions. Both the degree of phosphorylation and the degree of glycosylation may be determined simultaneously or essentially simultaneously. Preferably, the determining is performed using only one IR spectrum of the protein sample. Advantageously, the present method allows determining a degree of phosphorylation and a degree of glycosylation of a protein in a protein sample based on a single IR spectrum of the protein obtained by a single analysis. Such method also advantageously allows the quantification of protein phosphorylation and protein glycosylation with a low amount of sample and in a short measurement time.

The present method also advantageously obviates the need to generate reference protein samples and measure calibration IR spectra each time a protein sample is analysed for determining a degree of phosphorylation and a degree of glycosylation, for instance during therapeutic protein development or during QA of a manufactured protein product. The above and other characteristics, features and advantages of the present invention will become apparent from the following detailed description, which illustrate, by way of example, the principles of the invention.

DESCRIPTION OF THE FIGURES

Figure 1 represents a graph illustrating the Root Mean Square Error of Cross-Validation (RMSECV) as a function of the number of PLS components included in a first mathematical model according to an embodiment of the invention.

Figure 2 represents a graph illustrating the PLS regression with a first mathematical model according to an embodiment of the invention to determine the carbohydrate content (the mass percentage of carbohydrate). Each number represents one IR spectrum used to build the model. Calibration parameters of the regression were r-ι = 0.9951 and RMSECV1 = 0.21 10.

Figure 3 represents a graph illustrating the Root Mean Square Error of Cross-Validation (RMSECV) as a function of the number of PLS components included in a second mathematical model according to an embodiment of the invention.

Figure 4 represents a graph illustrating PLS regression with a second mathematical model according to an embodiment of the invention to determine phosphate content (the mass percentage of phosphate). Each number represents one IR spectrum used to build the model. Calibration parameters of the regression were r 2 = 0.9959 and RMSECV2 = 0.0725.

Figure 5 represents a graph illustrating the predicted values of the carbohydrate content (in mass percentage of carbohydrate) of protein samples as a function of the true values of the carbohydrate content (in mass percentage of carbohydrate) of the protein samples. Each point represents one IR spectrum used to validate a first mathematical model according to an embodiment of the invention.

Figure 6 represents a graph illustrating the predicted values of the phosphate content (in mass percentage of phosphate) of protein samples as a function of the true values of the phosphate content (in mass percentage of phosphate) of the protein samples. Each point represents one IR spectrum used to validate a second mathematical model according to an embodiment of the invention.

Figure 7 represents a graph illustrating the Root Mean Square Error of Cross-Validation (RMSECV) as a function of the number of PLS components included in a third mathematical model according to an embodiment of the invention. Figure 8 represents a graph illustrating PLS regression with a third mathematical model according to an embodiment of the invention to determine protein concentration. Each number represents one IR spectrum used to build the model. Calibration parameters of the regression were r 3 = 0.9803 and RMSECV3 = 0.6786.

Figure 9 represents a graph illustrating the predicted values of the protein concentration of protein samples as a function of the true values of protein concentration of the protein samples. Each point represents one IR spectrum used to validate a third mathematical model according to an embodiment of the invention.

Figures 10 represents a graph illustrating the differences in absorbance (i.e., difference spectra) between mean IR spectra of albumin for each temperature (as indicated in the right margin) and the mean IR spectra of albumin at room temperature (control). A Student t-test was computed at every wavenumber (expressed in cm "1 ) with a significance level of a= 0.1 %. Each marked wavenumber (black stars) indicates a statistically significant difference in absorbance between the means. The Y axis represents the intensity of the absorbance. For better readability, the difference spectra were offset along the absorbance axis.

Figure 11 represents a graph illustrating the stability index of albumin obtained by FTIR spectroscopy as a function of temperature.

DETAILED DESCRIPTION OF THE INVENTION

Before the present method and devices used in the invention are described, it is to be understood that this invention is not limited to particular methods, components, or devices described, as such methods, components, and devices may, of course, vary. It is also to be understood that the terminology used herein is not intended to be limiting, since the scope of the present invention will be limited only by the appended claims.

As used herein, the singular forms "a", "an", and "the" include both singular and plural referents unless the context clearly dictates otherwise.

The terms "comprising", "comprises" and "comprised of" as used herein are synonymous with "including", "includes" or "containing", "contains", and are inclusive or open-ended and do not exclude additional, non-recited members, elements or method steps. The terms "comprising", "comprises" and "comprised of" also include the term "consisting of".

The recitation of numerical ranges by endpoints includes all numbers and fractions subsumed within the respective ranges, as well as the recited endpoints. The term "about" as used herein when referring to a measurable value such as a parameter, an amount, a temporal duration, and the like, is meant to encompass variations of +/-10% or less, preferably +/-5% or less, more preferably or less, and still more preferably +/-0.1 % or less of and from the specified value, insofar such variations are appropriate to perform in the disclosed invention. It is to be understood that the value to which the modifier "about" refers is itself also specifically, and preferably, disclosed.

Unless otherwise defined, all terms used in disclosing the invention, including technical and scientific terms, have the meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. By means of further guidance, definitions for the terms used in the description are included to better appreciate the teaching of the present invention.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to a person skilled in the art from this disclosure, in one or more embodiments. Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those in the art. For example, in the following statements, any of the claimed embodiments can be used in any combination.

The present invention relates to a method for determining at least a degree of glycosylation and a degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an infrared (IR) spectrum obtained by Fourier transform infrared (FTIR) spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of glycosylation of the protein and the degree of phosphorylation of the protein in the protein sample.

The terms "determining" and "predicting" may be used interchangeably herein. The terms "determined" and "predicted" may be used interchangeably herein. The terms "determination" and "prediction" may be used interchangeably herein. The recitation "determining a degree of a given modification (such as glycosylation, phosphorylation, or a type of glycosylation)" as used herein refers to determining the presence or absence of the modification, determining an amount of the modification, and/or monitoring the modification.

The phrase "determining an amount of a given modification (such as glycosylation, phosphorylation, or a type of glycosylation)" refers to determining the presence of the modification within a discrete range, or determining the absolute amount of the modification. The discrete range may indicate, for instance, "high" or "low" modification content.

The phrase "determining an amount of glycosylation and phosphorylation in a protein sample" may encompass determining the presence of glycosylation and phosphorylation in a protein sample within a discrete range. The discrete range may indicate, for instance, "high" or "low" glycosylation and phosphorylation content (i.e., semi-quantitative measure). The phrase may also encompass determining the absolute amount of (i.e., quantitatively) glycosylation and phosphorylation in a protein sample.

The terms "determining an amount" or "quantifying" may be used interchangeably herein and refer to determining the presence within a discrete range (i.e., semi-quantitative determination) or determining the absolute amount (i.e., quantitative determination).

The terms "amount" or "quantity" may be used interchangeably herein.

Accordingly, in certain embodiments, the present invention relates to a method for quantifying at least the phosphorylation and the glycosylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) quantifying from the same (i.e., a single) IR spectrum the phosphorylation of the protein and the glycosylation of the protein in the protein sample.

The term "glycosylation" as used herein refers to a carbohydrate, in particular a glycan, attached to a protein such as in a glycoprotein. Protein glycosylation is one type of modification, in particular one type of post-translational modification.

The carbohydrate may be a monosaccharide, an oligosaccharide or a polysaccharide. The term "monosaccharide" generally refers to a single sugar unit without glycosidic connection to other such units. The term "oligosaccharide" generally refers to compounds in which 2 to 20 monosaccharide units are joined by glycosidic linkages. According to the number of units, they are called disaccharides, trisaccharides, tetrasaccharides, pentasaccharides etc. The term "polysaccharide" generally refers to a polymer or macromolecule consisting of monosaccharide 20 units joined together by glycosidic bonds.

The oligosaccharide or polysaccharide may be linear or branched. The oligosaccharide or polysaccharide may contain a substantial proportion of amino sugar residues.

The following aspects of glycosylation can be modified: the glycosidic bond (i.e., the site of glycan linkage), the glycan composition (i.e., the types of sugars that are linked to a given protein), the glycan structure (i.e., unbranched or branched chains of sugars), and the glycan length (i.e., short- or long-chain oligosaccharides or polysaccharides).

The main types of glycosylation include N-linked glycosylation and O-linked glycosylation. In N-linked glycosylation, the addition of sugar can happen at the amide nitrogen on the side-chain of the asparagine. In O-linked glycosylation, the addition of sugar can happen on the hydroxyl oxygen on the side-chain of hydroxylysine, hydroxyproline, serine, or threonine.

The main sugars (type, abbreviation) found in human glycoproteins include β-D-glucose (Hexose, Glc), β-D-galactose (Hexose, Gal), β-D-mannose (Hexose, Man), a-L-fucose (Deoxyhexose, Fuc), N-Acetylgalactosamine (Aminohexose, GalNAc), N- Acetylglucosamine (Aminohexose, GlcNAc), and N-Acetylneuraminic acid (Aminononulosonic acid or Sialic acid, NeuNAc), xylose (Pentose, Xyl).

The term "degree of glycosylation" as used herein refers to the presence or absence of glycosylation and/or the amount of glycosylation.

In certain embodiments, the degree of glycosylation may be the amount of glycosylation. The amount of glycosylation is also referred to herein as the carbohydrate content (e.g., expressed in mass percentage).

The "carbohydrate content" refers to the fraction of mass corresponding to carbohydrate in a protein sample divided by the total mass of the protein (including mass of carbohydrate) in the protein sample. The mass may be expressed in grams. The carbohydrate content may be expressed in mass percentage.

The carbohydrate content is related to the total mass of protein. From the spectroscopic point of view, a series of absorption bands related to the carbohydrate is ratioed to a protein absorption band.

In certain embodiments, the degree of glycosylation (e.g., expressed as the carbohydrate content) may be equal to or greater than zero. In certain embodiments, the degree of glycosylation (e.g., expressed as the carbohydrate content) may range from 0 to 100%. For instance, when the protein in the protein sample is a peptide containing only few amino acids, the carbohydrate content may be up to 100%. A degree of glycosylation (e.g., expressed as the carbohydrate content) may be 100% corresponding to purified glycans, for example after hydrolysis.

The term "phosphorylation" as used herein refers to a phosphate (P0 4 3~ ) group attached to a protein such as in a phosphoprotein. Protein phosphorylation is one type of modification, in particular one type of post-translational modification.

The term "degree of phosphorylation" as used herein refers to the presence or absence of phosphorylation, and/or the amount of phosphorylation.

In certain embodiments, the degree of phosphorylation may be the amount of phosphorylation. The amount of phosphorylation is also referred to herein as the phosphate content (e.g., expressed in mass percentage).

The "phosphate content" refers to the fraction of mass corresponding to phosphate in a protein sample divided by the total mass of the protein (including mass of phosphate) in the protein sample. The mass may be expressed in grams. The phosphate content may be expressed in mass percentage.

The phosphate content is related to the total mass of protein. From the spectroscopic point of view, a series of absorption bands related to the phosphate is ratioed to a protein absorption band.

In certain embodiments, the degree of phosphorylation (e.g., expressed as the phosphate content) may be equal to or greater than zero. In certain embodiments, the degree of phosphorylation (e.g., expressed as the phosphate content) may range from 0 to 100%.

The total mass of a protein sample may be determined by weighing the total mass of a dry (e.g., freeze-dried) protein sample, in particular of a dry protein sample consisting essentially of, or consisting of protein. The total mass of a protein sample may be determined by calculating the total mass from the volume and protein concentration of a liquid protein sample. The protein concentration of a liquid protein sample may be determined using the methods as taught herein or using a method as known in the art for determing the concentration of a protein in solution, such as UV spectroscopy involving measuring the aborbance at 280 nm, Bradford assay (e.g., Quick Start™ Bradford Protein Assay, Bio-Rad Laboratories, Inc., CA, USA; or Coomassie (Bradford) Protein Assay Kit, Pierce Biotechnology, IL, USA), BCA assay (Pierce™ BCA Protein Assay Kit, Pierce Biotechnology, IL, USA ), or NanoDrop (Desjardins, et al., 2009, J. Vis. Exp., 33, 1610). The term "monitoring" generally refers to determining the degree of a modification (such as glycosylation, phosphorylation, or one or more types of glycosylation) in a protein sample over time. For instance, monitoring a modification (e.g., glycosylation and phosphorylation) in a protein sample may be performed by determining the degree of the modification (e.g., glycosylation and phosphorylation) in the protein sample at one or more successive time points. For instance, the method of the invention may be performed on the protein sample a plurality of times, each method performed at a different interval in time, for instance, at an interval of one or more days, weeks, months of years.

The terms "peptide", "polypeptide" or "protein" are interchangeably used herein and relate to any natural, synthetic or recombinant molecule comprising amino acids joined together by peptide bonds between adjacent amino acid residues. A "peptide bond", "peptide link" or "amide bond" is a covalent bond formed between two amino acids when the carboxyl group of one amino acid reacts with the amino group of the other amino acid, thereby releasing a molecule of water. The protein can be from any source, e.g. a naturally occurring protein, a chemically synthesized protein, a proteinproduced by recombinant molecular genetic techniques or a peptide from a cell or translation system. The proteinmay be a linear chain or may be folded into a globular form. Furthermore, it is not intended that a proteinbe limited by possessing or not possessing any particular biological activity.

The term "a protein" in the recitation "a protein in a protein sample" refers to the totality of protein in a protein sample. The (totality of) protein may consist of one or more protein species, such as one protein species, or two or more protein species, e.g., three, four, five, six, seven, eight, or more protein species.

In certain embodiments, the protein in a protein sample may be a mixture of proteins. In certain embodiments, the protein in a protein sample may be known. Generally, in therapeutic protein development and in QA in protein manufacturing, the one or more protein species in the protein sample are known.

In certain embodiments, the protein sample may be solid or liquid. The protein may be dissolved in water or an aqueous solution such as a buffer solution.

In the methods as taught herein, the protein sample comprises, consists essentially of , or consists of one or more protein species, such as one protein species, or two or more protein species, e.g., three, four, five, six, seven, eight, or more protein species.

The protein sample may further comprise one or more additional components such as one or more components for stabilizing the protein. The one or more additional components generally should not give rise to IR spectral signals that interfere with IR spectral signals attributable to the protein sample. For instance, the protein sample may further comprise a buffer solution. The buffer solution may be any buffer solution known in the art such as bicarbonate buffer, citrate buffer, acetate buffer, and borate buffer. Preferably, the additional component, such as the buffer solution, is not a phosphate-containing component. In certain embodiments, the additional component is not a phosphate- containing buffer solution. It will be understood that such a phosphate-containing component, such as a phosphate-containing buffer, may interfere with the determination of the degree of phosphorylation in the present method.

In certain embodiments, the protein sample may further comprise one or more additional components such as detergents (e.g., polyoxyethylene derivatives of fatty acids, partial esters of sorbitol anhydrides, for example, those products known commercially as "Tween 80" or "polysorbate 80" (polyoxyethylene (20) sorbitan monooleate), "Tween 20" or "polysorbate 20" (polyoxyethylene (20) sorbitan monolaurate), and nonionic oil soluble water detergents such as that sold commercially under the trademark "Triton X 100" (oxyethylated alkylphenol, octyl phenol ethoxylate)), cryoprotectants (e.g., glycerol or ethylene glycol), anti-microbial agents (e.g., sodium azide or thimerosal), metal chelators, reducing agents (e.g., dithiothreitol (DTT) or 2-mercaptoethanol (2-ME)), and residual components (e.g., nucleic acids or lipids), for instance resulting from protein purification. It is to be understood that additional components can interfere with the present method, for instance depending on the quantity and/or the absorbance of the additional components (e.g., if the additional components absorb in a region similar to the absorbance of glycosylation and/or phosphorylation). Accordingly, in certain embodiments, the methods as taught herein may comprise a purification or dialysis step prior to the measurement in order to remove the additional components.

The term "Fourier transform infrared spectroscopy" or "FTIR" generally refers to a technique which is used to obtain an infrared spectrum of absorption, emission, photoconductivity, or Raman scattering of a solid, liquid, or gas. The term Fourier transform infrared spectroscopy originates from the fact that a Fourier transform (i.e., a mathematical process) is required to convert the raw data into the actual spectrum.

For a given sample which may be solid, liquid, or gaseous, the method or technique of (Fourier transform) infrared spectroscopy uses an instrument called a (Fourier transform) infrared spectrometer or spectrophotometer to produce an infrared spectrum. The term "infrared" refers to electromagnetic radiation with wavelengths from 700 nm to 1 mm.

The term "infrared spectrum" or "IR spectrum" is as generally known in the art and essentially refers to a graph of infrared light absorbance (or transmittance) on the vertical axis versus frequency or wavelength on the horizontal axis. Typical units of frequency used in IR spectra are reciprocal centimeters (or wave numbers), with the symbol cm -1 .

In certain embodiments, the frequency may be from 4000 to 400 cm "1 (corresponding to a wavelength of from 2.5 to 25 μηη). In certain embodiments, the frequency may be from 4000 to 600 cm "1 .

In certain embodiments, the frequency may be from 1500 to 800 cm "1 . This spectral region allows determining the degree of glycosylation and the degree of phosphorylation.

In certain embodiments of the methods, as taught herein, the FTIR spectroscopy may be performed by Attenuated total reflectance (ATR), Transmission, Imaging system, or High Throughput screening (HTS) system.

In certain preferred embodiments, the FTIR spectroscopy may be performed in ATR mode. Generally, for ATR FTIR spectroscopy, a liquid protein sample may be deposited on a crystal and dried. Subsequently, FTIR spectroscopy may be performed on the dried protein sample (i.e., solid protein sample).

The term "Attenuated total reflection (ATR)" refers to a measuring mode of infrared spectroscopy in which samples can be examined directly in the solid or liquid state without laborious sample preparation. ATR is based on the absorption of a probing light beam which undergoes total internal reflection at the interface of a high refractive index medium and a sample under test with a lower refractive index.

Total internal reflection occurs when the angle of incidence of a light beam on an interface is bigger than a critical angle, which is determined by the refractive index difference between the high refractive index medium and the sample under test. Upon undergoing total internal reflection, an evanescent wave forms which extends from the surface of the high refractive index medium into the sample under test for a distance which depends on the incident light's wavelength, and which is commonly of the order of a few micrometers. Absorption of the evanescent wave attenuates the reflected light, and the spectroscopically resolved attenuation of light can be used to analyze the sample under test in a manner which is similar to the analysis of conventional infrared spectroscopy spectra. Typically, a sensing infrared light beam undergoes multiple total internal reflections at the sensing crystal-sample under test interface for enhanced sensitivity. Because the sensing light beam does not travel through the sample under test, ATR is a technique which is suitable for gathering spectroscopically resolved absorption data from highly absorbing samples.

In certain preferred embodiments, the FTIR spectroscopy may be performed in transmission mode. Generally, for transmission FTIR spectroscopy, a measurement cell (e.g., transparent measurement cell) may be filled with a liquid protein sample, and FTIR spectroscopy may be performed on the liquid protein sample.

Methods according to the present invention may also be used in conjunction with High- throughput screening (HTS). HTS allows testing hundreds of thousands of samples per day by means of microtiter plates, automation techniques, and advanced metrology.

Methods according to the present invention may also be used in conjunction with Imaging system. Imaging system may be used to analyze proteins. Briefly, spots of protein samples are dried on a plate transparent to the infrared radiations (e.g. CaF 2 , BaF 2 ) for transmission measurement or on a plate compatible with reflection measurements (e.g. MirrlR slides from Kevley technologies, http://www.kevley.com/). Advantageously, Imaging sytem allows to analyze simultaneously hundreds of proteins.

In certain embodiments, the water vapour contribution (with a refrence peak of 1956-1935 cm "1 ) may be substracted from the IR spectrum.

In certain embodiments, the IR spectrum may be baseline corrected. Such a baseline correction may involve interpolating straight lines between spectra points. These spectra points may be chosen arbitrarily in low absorbance area. Alternatively, baseline correction may involve working with the second derivative of the spectra. Baseline shift may be observed in FTIR spectra and are not representative of the protein sample. For example, in ATR mode, baseline shift may be due to small variations of the crystal position. Baseline correction allows to correct the shift in order to remove artificial drifts and to obtain comparable spectra.

The various baseline correction methods (i.e., choosing different baseline points or the second derivative) are equally efficient to reveal the differences existing between series of spectra (Gaigneaux et al., 2006, Applied spectroscopy, 60 (9), 1022-1029)

In certain embodiments of the methods as taught herein, comparison and statistical analysis may be realized with spectra which have undergone the same preprocessing and/or baseline correction.

In certain embodiments, the water vapour contribution may be subtracted. In certain embodiments, a normalization step may be realized (e.g., at a specific absorption) on the IR spectrum. Such a normalization step may be performed by scaling methods known in the art such as scaling the spectra on intensity (e.g., each spectrum may be multiplied by the factor needed to obtain the same value at the selected wavenumber), scaling the spectra on area (e.g., a straight line may be interpolated between the two wavenumbers limiting the area and the surface above the line may be calculated; each spectrum may be multiplied by the factor needed to equalize this surface for all the spectra), and vector normalization (e.g., the sum of all the absorbance may be made constant over a given spectral region).

In certain embodiments, a normalization step may be realized on the IR spectrum by one or more methods selected from scaling the IR spectra on intensity, scaling the IR spectra on area, and vector normalization. In certain embodiments, a normalization step may be realized on the IR spectrum by scaling the IR spectra on area.

In certain embodiments of the methods, as taught herein, step (b) may be performed by calculating from the IR spectrum the degree of glycosylation of the protein using a first mathematical model configured to determine the degree of glycosylation of a protein, and calculating from the IR spectrum the degree of phosphorylation of the protein using a second mathematical model configured to determine the degree of phosphorylation of a protein.

Accordingly, in certain embodiments, the step of determining from the same IR spectrum the degree of glycosylation of the protein and the degree of phosphorylation of the protein in the protein sample may be performed by calculating from the IR spectrum the degree of glycosylation of the protein using a first mathematical model configured to determine the degree of glycosylation of a protein, and calculating from the IR spectrum the degree of phosphorylation of the protein using a second mathematical model configured to determine the degree of phosphorylation of a protein.

The term "mathematical model" or "quantitative model" generally refers to a description of a system using mathematical concepts and language.

The recitation "first mathematical model configured to determine the degree of glycosylation of a protein", as used herein, refers to a description of the glycosylation of a protein using mathematical equations.

The recitation "second mathematical model configured to determine the degree of phosphorylation of a protein", as used herein, refers to a description of the phosphorylation of a protein using mathematical equations. In certain embodiments of the methods, as taught herein, the first mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one, preferably each, reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one, preferably each, other reference protein sample in the first set, and wherein the second mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one, preferably each, reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one, preferably each, other reference protein sample in the second set.

In certain embodiments of the methods, as taught herein, the first mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one other reference protein sample in the first set, and wherein the second mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments of the methods, as taught herein, the first mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein each reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of each other reference protein sample in the first set, and wherein the second mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein each reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of each other reference protein sample in the second set.

The term "standard IR spectra", as used herein, refers to IR spectra obtained by FTIR spectroscopy of a set of reference protein samples.

The terms "reference protein samples", "standard protein samples", or "standards" may be used interchangeably herein.

In certain embodiments, the mathematical models as defined herein may be based on the full IR spectrum (e.g., 4000 to 400 cm "1 ). In certain embodiments, the mathematical models may be based on a spectral region (e.g., 4000 to 400 cm "1 , or 1500 to 800 cm "1 ). In certain embodiments, the mathematical models may also be based on the spectral region of 1500 to 800 cm "1 . Using the full IR spectrum or a spectral region (e.g., 1500 to 800 cm "1 ) instead of the absorbance at one or two specific wavenumbers advantageously allows to obtain very good predictive mathematical models (and a low error values).

In certain embodiments, a reference protein sample may comprise, consist essentially of, or consist of one or more protein species, such as one protein species, or two or more protein species, e.g., three, four, five, six, seven, eight, or more protein species. The reference protein sample may further comprise one or more additional components such as one or more components for stabilizing the protein. For instance, the protein sample may further comprise a buffer solution. The one or more additional components generally should not give rise to IR spectal signals that interfere with IR spectal signals attributable to the protein sample.

In certain embodiments, a reference protein sample may further comprise one or more additional components such as detergents (e.g., polyoxyethylene derivatives of fatty acids, partial esters of sorbitol anhydrides, for example, those products known commercially as "Tween 80" or "polysorbate 80" (polyoxyethylene (20) sorbitan monooleate), "Tween 20" or "polysorbate 20" (polyoxyethylene (20) sorbitan monolaurate), and nonionic oil soluble water detergents such as that sold commercially under the trademark "Triton X 100" (oxyethylated alkylphenol, octyl phenol ethoxylate)), cryoprotectants (e.g., glycerol or ethylene glycol), anti-microbial agents (e.g., sodium azide or thimerosal), metal chelators, reducing agents (e.g., dithiothreitol (DTT) or 2- mercaptoethanol (2-ME)), and residual components (e.g., nucleic acids or lipids). In certain embodiments of the methods as taught herein, a reference protein sample and a protein sample (to be tested) may comprise, consist essentially of, or consist of the same or substantially the same protein species. For instance, in therapeutic protein development and in QA in protein manufacturing, the protein species in the references protein samples used to build the mathematical models are the same as or substantially the same as the protein species in the protein samples which are to be analysed. The protein species in a reference protein sample and a protein sample (to be tested) may differ in the degree of glycosylation of the protein, the degree of phosphorylation of the protein, and/or the protein concentration.

In certain embodiments, at least one protein species in a reference protein sample may be the same or substantially the same as a protein species in the protein sample to be tested. In certain embodiments, each protein species in a reference protein sample may be the same or substantially the same as a protein species in the protein sample to be tested. If the reference protein samples and the protein samples (to be tested) comprise the same or substantially the same protein species, the mathematical model advantageously allows to quantify more accurately the carbohydrate content and phosphate content in the protein sample.

As used herein, a first protein species (e.g., in a reference protein sample) is said to be substantially the same as a second protein species (e.g., in a protein sample to be tested) when the protein sequence (i.e., amino acid sequence) of the first protein species is substantially identical (i.e., largely but not wholly identical) to the protein sequence (i.e., amino acid sequence) of the second protein species, e.g., when the protein sequence (i.e., amino acid sequence) of the first protein species is at least about 30% identical to the protein sequence (i.e., amino acid sequence) of the second protein species. For instance, a first protein species (e.g., in a reference protein sample) is said to be substantially the same as a second protein species (e.g., in a protein sample to be tested) when the protein sequence (i.e., amino acid sequence) of the first protein species is at least about 35% identical, at least about 40% identical, at least about 45% identical, at least about 50% identical, at least about 55% identical, at least about 60% identical, at least about 65% identical, at least about 70% identical, or at least about 75% identical to the protein sequence (i.e., amino acid sequence) of the second protein species. For instance, a first protein species (e.g., in a reference protein sample) is said to be substantially the same as a second protein species (e.g., in a protein sample to be tested) when the protein sequence (i.e., amino acid sequence) of the first protein species is at leats about 80% identical or at least about 85% identical to the protein sequence (i.e., amino acid sequence) of the second protein species, e.g., preferably when the protein sequence (i.e., amino acid sequence) of the first protein species is at least about 90% identical, e.g., at least about 91 % identical, at least about 92% identical, at least about 93% identical, at least about 94% identical, at least about 95% identical, at least about 96% identical, at least about 97% identical, at least about 98% identical, or at least about 99% identical to the protein sequence (i.e., amino acid sequence) of the second protein species.

Preferably, a first protein species may display such degrees of identity to a second protein species when the whole sequence of the protein species are queried in the sequence alignment (i.e., overall sequence identity).

Sequence identity may be determined using suitable algorithms for performing sequence alignments and determination of sequence identity as know per se. Exemplary but non- limiting algorithms include those based on the Basic Local Alignment Search Tool (BLAST) originally described by Altschul et al. 1990 (J Mol Biol 215: 403-10), such as the "Blast 2 sequences" algorithm described by Tatusova and Madden 1999 (FEMS Microbiol Lett 174: 247-250), for example using the published default settings or other suitable settings (such as, e.g., for the BLASTN algorithm: cost to open a gap = 5, cost to extend a gap = 2, penalty for a mismatch = -2, reward for a match = 1 , gap x_dropoff = 50, expectation value = 10.0, word size = 28; or for the BLASTP algorithm: matrix = Blosum62, cost to open a gap = 1 1 , cost to extend a gap = 1 , expectation value = 10.0, word size = 3).

In certain embodiments, a reference protein sample and a protein sample (to be tested) may comprise the same additional components such as the same buffer solution. Advantageously, if the additional component is present in the reference protein samples and in the protein samples (to be tested), the mathematical model takes its presence into account and allows to quantify more accurately the carbohydrate content and phosphate content in the protein sample. For instance, if the additional component is present in the reference protein samples and in the protein samples (to be tested) at the same concentration, the mathematical model takes this into account and allows to quantify more accurately the carbohydrate content and phosphate content in the protein sample.

In certain embodiments, the known degree of glycosylation (e.g., expressed as the carbohydrate content) is equal to or greater than zero. In certain embodiments, the known degree of glycosylation (e.g., expressed as the carbohydrate content) may range from 0 to 100%. In certain embodiments, the known degree of phosphorylation (e.g., expressed as the phosphate content) is equal to or greater than zero. In certain embodiments, the known degree of phosphorylation (e.g., expressed as the phosphate content) may range from 0 to 100%.

In certain embodiments, the first set of reference protein samples may comprise at least one reference protein sample having a degree of glycosylation which is zero and which is different from the degree of glycosylation of at least one other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least one reference protein sample having a degree of glycosylation which is zero and which is different from the degree of glycosylation of each other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least two reference protein samples. For example, the first set of reference protein samples may consist of two reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of the other reference protein sample in the first set. Such a first set of reference protein samples may be sufficient to prepare the first mathematical model.

In certain embodiments, the first set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of each other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is zero and which is different from the degree of glycosylation of at least one other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is zero and which is different from the degree of glycosylation of each other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least three reference protein samples. For example, the first set of reference protein samples may comprise at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples. A set of reference protein samples comprising at least three reference protein samples allows better performance of the first mathematical model.

In certain embodiments, the first set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein each reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of each other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is zero and which is different from the degree of glycosylation of at least one other reference protein sample in the first set.

In certain embodiments, the first set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is zero and which is different from the degree of glycosylation of each other reference protein sample in the first set.

In certain embodiments, the second set of reference protein samples may comprise at least one reference protein sample having a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least one reference protein sample having a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least two reference protein samples. For example, the second set of reference protein samples may consist of two reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of the other reference protein sample in the second set. Such a second set of reference protein samples may be sufficient to prepare the second mathematical model.

In certain embodiments, the second set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least three reference protein samples. For example, the second set of reference protein samples may comprise at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples. A set of reference protein samples comprising at least three reference protein samples allows better performance of the second mathematical model.

In certain embodiments, the second set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein each reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of at least one other reference protein sample in the second set.

In certain embodiments, the second set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is zero and which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments, the methods, as taught herein, may further comprise the step of determining from the (same) IR spectrum the concentration of the protein in the protein sample. As exemplified herein, the inventors have found that from the same IR spectrum obtained by FTIR spectroscopy, the degree of glycosylation, the degree of phosphorylation, and the protein concentration can be determined of a protein sample. The degree of phosphorylation, the degree of glycosylation, and the protein concentration may be determined simultaneously or essentially simultaneously. The term "protein concentration" or "concentration of a protein" refers to the abundance of a protein in a protein sample divided by the total volume of the protein sample.

In certain embodiments, the protein concentration may be a mass protein concentration.

The term "mass protein concentration" refers to the mass of a protein in a protein sample divided by the volume of the protein sample. The mass protein concentration may be expressed in g/l (equal to the SI unit kg/m 3 ).

In certain embodiments, the protein concentration (e.g., expressed as the mass protein concentration) may be equal to or greater than zero.

In certain embodiments of the methods, as taught herein, the protein concentration of the protein may be determined by calculating the concentration of the protein from the IR spectrum using a third mathematical model configured to determine a protein concentration.

The recitation "third mathematical model configured to determine a protein concentration", as used herein, refers to a description of the protein concentration using mathematical equations.

In certain embodiments of the methods, as taught herein, the third mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a third set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein at least one, preferably each, reference protein sample has a protein concentration which is different from the protein concentration of at least one, preferably each, other reference protein sample in the third set.

In certain embodiments, the third mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a third set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein at least one reference protein sample has a protein concentration which is different from the protein concentration of at least one other reference protein sample in the third set.

In certain embodiments, the third mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a third set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein at least one, preferably each, reference protein sample has a protein concentration which is different from the protein concentration of at least one, preferably each, other reference protein sample in the third set.

In certain embodiments, the known protein concentration (e.g., expressed as the mass protein concentration) may be equal to or greater than zero.

In certain embodiments, the third set of reference protein samples may comprise at least one reference protein sample having a protein concentration which is zero and which is different from the protein concentration of at least one other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least one reference protein sample having a protein concentration which is zero and which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least two reference protein samples. For example, the third set of reference protein samples may consist of two reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein one reference protein sample has a protein concentration which is different from the protein concentration of the other reference protein sample in the third set. Such a third set of reference protein samples may be sufficient to prepare the third mathematical model.

In certain embodiments, the third set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is different from the protein concentration of at least one other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is zero and which is different from the protein concentration of at least one other reference protein sample in the third set. In certain embodiments, the third set of reference protein samples may comprise at least two reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is zero and which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least three reference protein samples. For example, the third set of reference protein samples may comprise at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples. A set of reference protein samples comprising at least three reference protein samples allows better performance of the third mathematical model.

In certain embodiments, the third set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is different from the protein concentration of at least one other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein each reference protein sample has a protein concentration which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments, the third set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is zero and which is different from the protein concentration of at least one other reference protein sample in the third set. In certain embodiments, the third set of reference protein samples may comprise at least three, such as at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least twenty, at least thirty, at least forty, at least fifty, at least sixth, at least seventy, at least eighty, at least ninety, or at least hundred, reference protein samples, wherein each reference protein sample has a known protein concentration, and wherein at least one reference protein sample has a protein concentration which is zero and which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments, the first set of reference protein samples, the second set of reference protein samples, and the third set of reference protein samples may each consist of different reference protein samples. In certain embodiments, the first set of reference protein samples, the second set of reference protein samples, and the third set of reference protein samples may each comprise different reference protein samples.

In certain preferred embodiments, the first set of reference protein samples, the second set of reference protein samples, and the third set of reference protein samples may comprise the same reference protein samples.

In certain preferred embodiments, the first set of reference protein samples, the second set of reference protein samples, and the third set of reference protein samples may consist of the same reference protein samples. This advantageously allows preparing the first, second, and third mathematical model with one (i.e., a single, the same) set of reference protein samples.

The terms "statistical tool" or "statistical test" may be used interchangeably.

In certain embodiments of the methods, as taught herein, the statistical tool may be any statistical tool used in the field of chemometrics, in particular in the field of extracting information from chemical systems, such as the degree of protein glycosylation and the degree of protein phosphorylation, by data-driven means.

In certain embodiments of the methods, as taught herein, the statistical tool may be regression analysis.

In certain embodiments of the methods, as taught herein, the statistical tool may be selected from the group consisting of partial least square (PLS) regression, classical least square (CLS) regression, inverse least square (ILS) regression, principal component regression (PCR), multiple linear regression (MLR), and support vector machine (SVM). In preferred embodiments, the statistical tool is PLS regression. PLS regression advantageously improves preparing the mathematical models as defined herein.

In certain embodiments of the methods, as taught herein, determining the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample may be accomplished using Partial Least Squares Regression (PLS). In certain embodiments of the methods, as taught herein, determining the protein concentration in a protein sample may be accomplished using PLS. The basic aspects of the PLS regression method are explained in Geladi et al. (1986, Analytica Chimica Acta, 185, 1 -17), which is hereby incorporated by reference in its entirety. A further review of PLS-regression is provided in Wold et al. (2001 , Chemometrics and Intelligent Laboratory Systems, 58, 109-130). The PLS regression method has the advantage of being particularly robust, which implies that the model parameters do not change very much when new calibration samples are taken from a population.

The PLS method allows building a mathematical model for instance for the determination of the degree of glycosylation or for the determination of the degree of phosphorylation from a I R spectrum obtained by FTI R spectroscopy.

Partial least Squrares Regression may be of particular interest because it can analyze data with strongly collinear (correlated), noisy, and numerous X-variables. Partial Least Squares Regression has the advantage of taking into account the entire spectral range. As such, it does not rely on a single absorbance measurement but on a pattern of absorbance that can encompass the entire spectrum.

In certain embodiments, the mathematical models as defined herein, such as the first, second, and/or third mathematical model as defined herein, may be written in matrix notation as:

Y = XB + E (Eq. 1 a),

in which Y is a vector comprising the response variable (e.g., the degree of glycosylation, the degree of phosphorylation, or the protein concentration), X is a measured IR spectrum, B comprises regression coefficients that are determined during a calibration step, and E is an error term. In the calibration step in which B is built, new artificial variables (PLS) components are defined. The PLS components may correspond to linear combinations of the original I R spectra: τ = xw

(Eq. 2a),

in which T is the matrix comprising the PLS factors, and W is a matrix which defines the combination of the original IR spectra X. In other words, W defines the linear transformation applied to X when constructing T. The matrix comprising the PLS factors, T, may be seen as a combination or "mixing" of the original spectra X. Also, just like X, T is a predictor of Y, e.g., the degree of phosphorylation, the degree of glycosylation, or the protein concentration. Therefore, in certain embodiments, the mathematical models as defined herein (Eq. 1 a), such as the first, second, and/or third mathematical model as defined herein, may alternatively be written as:

Y = TQ + E (Eq. 3a),

in which Q is the regression coefficient matrix. The regression coefficient matrix Q is calculated with the training set.

Combination of Equations 1 a to 3a yields:

B = WQ (Eq. 4a)

In certain embodiments, a cross validation may be applied when building the PLS model. Among the n spectra used to establish a mathematical model, one of the spectra may be left out from the calibration set and the model built with the remaining spectra may be tested on the spectrum that does not contribute to the calibration. This procedure may be repeated until each IR spectrum is excluded for the mathematical model building. The left- out spectrum may be used to determine the variable, such as the degree of phosphorylation, the degree of glycosylation, or the protein concentration, with the established mathematical model.

In certain embodiments, a cross-validation based on the root mean square error of cross- validation is applied when building a PLS model according to the following procedure: among the n spectra used to establish the model, one of them is left out from the calibration set and the model built with the remaining spectra is tested on the spectrum that did not contribute to the calibration. This procedure is repeated until each spectrum is excluded from the model building. The left-out spectrum is used to predict the desired characteristic (either the protein concentration, the degree of glycosylation, or the degree of phosphorylation) with the model built from the other spectra. Subsequently, the root mean square error of cross validation (RMSECV) may be calculated as:

(Eq. 5a)

wherein n is the number of spectra for the calibration, y t is the degree of phosphorylation, the degree of glycosylation, or the protein concentration present in the sample corresponding to the spectrum i, and is the the degree of phosphorylation, the degree of glycosylation, or the protein concentration, determined for the left-out spectrum i. In particular, the lower the RMSECV, the more reliable the PLS model.

In certain embodiments, the correlation coefficient r or R may be calculated as follows:

(Eq. 6a)

wherein n is the number of spectra for the calibration, y t is the degree of phosphorylation, the degree of glycosylation, or the protein concentration present in the mixture corresponding to the spectrum i, y^; is the degree of phosphorylation, the degree of glycosylation, or the protein concentration estimated by the model with spectrum i, and y is the average of all reference measurements values in the calibration set.

The correlation coefficient r may be used to evaluate the performance of a specific model. In addition, the correlation coefficient r may be used to compare different models which were built using, for example, various initial spectral ranges or varying numbers of PLS components.

In certain embodiments of the methods, as taught herein, the protein sample contains an internal reference compound.

The terms "internal reference compound" or "internal standard" may be used interchangeably herein and refer to a chemical substance that is added in a constant amount to the protein sample and the reference protein samples before performing the FTIR spectroscopy.

Preferably, the internal reference is soluble and not volatile.

In the methods as taught herein, it is intended that the internal reference compound absorbs in a spectral range different from that of biological molecules such as proteins (1800-900 cm "1 ). In certain embodiments, the reference protein samples as defined herein contain an internal reference compound. In certain embodiments, the concentration of the internal reference compound is the same in the protein sample (to be tested) and in the reference protein samples. In certain embodiments, the concentration of the internal reference compound is the same in the protein sample (to be tested) and in the reference protein samples used to prepare at least the third mathematical model as defined herein. In certain embodiments, the concentration of the internal reference compound is the same in the protein sample (to be tested) and in the reference protein samples used to prepare the first, second, and third mathematical models, as defined herein. In certain embodiments, the concentration of the internal reference compound is the same in the protein sample (to be tested) and in the reference protein samples used to prepare any of the mathematical models as defined herein. The internal reference compound advantageously calibrates the volume of the protein sample (to be tested) and the reference protein samples.

In certain embodiments of the methods, as taught herein, the internal reference compound may be an azide, ferrocyanide, deuterated lipid, or nitrile. Advantageously, such internal reference avoids interference with the protein and its modifications.

Suitable non-limiting examples of azides include sodium azide (Sigma Aldrich, S8032, or Carl Roth, K305).

Suitable non-limiting examples of ferrocyanides include potassium ferrocyanide (Sigma Aldrich, P3289, or Carl Roth, 7974).

Suitable non-limiting examples of nitriles include propionitrile (Sigma Aldrich, 76671 ) and butyronitrile (Sigma Aldrich, 08436).

Suitable non-limiting examples of deuterated lipids include deuterated sphingolipids (Avanti Polar Lipids, Inc., AL, USA), deuterated phospholipids (Avanti Polar Lipids, Inc., AL, USA), and deuterated sterols (Avanti Polar Lipids, Inc., AL, USA). Examples of deuterated sphingolipids include sphingosine-d7 (860657, Avanti Polar Lipids, Inc., AL, USA), sphingosine-1 -phosphate-d7 (860659, Avanti Polar Lipids, Inc., AL, USA), sphinganine-d7 (860658, Avanti Polar Lipids, Inc., AL, USA), 1 -deoxysphinganine-d3 (860474, Avanti Polar Lipids, Inc., AL, USA), 1 -deoxy-L-threo-sphinganine-d3 (860475, Avanti Polar Lipids, Inc., AL, USA), 1 -desoxymethylsphinganine-d5 (860476, Avanti Polar Lipids, Inc., AL, USA), 16:0-d31 Ceramide (868516, Avanti Polar Lipids, Inc., AL, USA), 16:0-d31 SM (868584, Avanti Polar Lipids, Inc., AL, USA), glucosyl(3) sphingosine-d5 (860636, Avanti Polar Lipids, Inc., AL, USA), galactosyl^) sphingosine-d5 (860637, Avanti Polar Lipids, Inc., AL, USA), and C18 GlcCer-d5 (860638, Avanti Polar Lipids, Inc., AL, USA).

In certain embodiments of the methods, as taught herein, determining the degree of glycosylation comprises determining the degree of one or more types of glycosylation. In other words, the method may be applied to a protein sample containing a mixture of different carbohydrates present on one or more protein species in the protein sample, in order to determine the amount of each carbohydrate.

In certain embodiments, the step of determining from the same IR spectrum the degree of one or more types of glycosylation and the degree of phosphorylation of the protein in the protein sample may be performed by calculating from the IR spectrum the degree of a first type of glycosylation of the protein using a fourth mathematical model configured to determine the degree of a first type of glycosylation of a protein, and optionally the degree of one or more further types of glycosylation of the protein using one or more further mathematical models (e.g. fifth, sixth mathematical models) configured to determine the degree of the one or more further types of glycosylation of a protein, and calculating from the IR spectrum the degree of phosphorylation of the protein using a second mathematical model configured to determine the degree of phosphorylation of a protein.

The recitation "mathematical model configured to determine the degree of a certain type of glycosylation of a protein (e.g., first type of glycosylation of a protein)", as used herein, refers to a description of a certain type of glycosylation of a protein (e.g., first type of glycosylation of a protein) using mathematical equations.

In certain embodiments of the methods, as taught herein, a mathematical model configured to determine the degree of a certain type of glycosylation of a protein (e.g., first type of glycosylation of a protein) may be prepared from standard IR spectra obtained by FTIR spectroscopy of a set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of the certain type of glycosylation (e.g., first type of glycosylation) and optionally a known degree of phosphorylation, and wherein at least one, preferably each, reference protein sample has a degree of the certain type of glycosylation (e.g., first type of glycosylation) which is different from the degree of the certain type of glycosylation (e.g., first type of glycosylation) of at least one, preferably each, other reference protein sample in the first set, and wherein the second mathematical model is prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one, preferably each, reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one, preferably each, other reference protein sample in the second set.

In certain embodiments of the methods as taught herein, a reference protein sample and a protein sample (to be tested) may comprise the same types of glycosylation. For instance, in therapeutic protein development and in QA in protein manufacturing, the protein species in the references protein samples used to build the mathematical models contain the same types of glycosylation as the protein species in the protein samples which are to be analysed. The protein species in a reference protein sample and a protein sample (to be tested) may differ in the degree of each type of glycosylation of the protein, the degree of phosphorylation of the protein, and/or the protein concentration.

In certain embodiments, the methods, as taught herein, may further comprise the step of determining from the (same) IR spectrum of the protein sample the degree of aggregation of the protein and/or the degree of denaturation of the protein in the protein sample.

The method embodying the principles of the present invention advantageously allows determining from the same IR spectrum obtained by FTIR spectroscopy, the degree of glycosylation, the degree of phosphorylation, and the structural integrity of a protein of a protein sample.

The recitation "degree of aggregation of the protein and/or degree of denaturation of the protein" as used herein refers to the structural integrity of the protein.

In certain embodiments of the methods, as taught herein, the degree of aggregation and/or the degree of denaturation may be determined by comparing the IR spectrum of the protein sample with IR spectra of the (same) protein at progressively increasing temperatures.

In certain embodiments of the methods, as taught herein, the degree of aggregation and/or the degree of denaturation (i.e., the structural integrity) may be determined by comparing the stability index of the protein in the protein sample with the stability indexes of the (same) protein at progressively increasing temperatures.

In certain embodiments, the stability index of the protein at progressively increasing temperatures may be calculated from the absorbance at a specific wavenumber, from a relation of absorbance at various wavenumbers, from an integration of absorbance between two wavenumbers, or from a combination or a relation between two or more peak areas which are each calculated by an integration of absorbance between two wavenumbers, of the IR spectra of the protein at progressively increasing temperatures. In certain embodiments, the stability index of the protein in the protein sample may be calculated from the absorbance at a specific wavenumber, from a relation of absorbance at various wavenumbers, from an integration of absorbance between two wavenumbers, or from a combination or a relation between two or more peak areas which are each calculated by an integration of absorbance between two wavenumbers of the IR spectrum of the protein in the protein sample.

In certain preferred embodiments of the methods, as taught herein, the degree of aggregation and/or the degree of denaturation (i.e., the structural integrity) may be determined by comparing the absorbance at a specific wavenumber of the protein in the protein sample with the absorbance at a specific wavenumber of the (same) protein at progressively increasing temperatures.

In certain embodiments, the degree of aggregation and/or the degree of denaturation (i.e., the structural integrity) may be determined by comparing the IR spectrum of the protein sample with IR spectra obtained by FTIR spectroscopy of the (same) protein without (any) degradation and/or without (any) aggregation of the protein.

In certain embodiments of the methods, as taught herein, the degree of aggregation may be determined by using a further mathematical model configured to determine the degree of aggregation of a protein. In certain embodiments of the methods, as taught herein, the degree of denaturation may be determined by using a still further mathematical model configured to determine the denaturation of aggregation of a protein. The respective models are made according to the methods described herein, using standard IR spectra of sets references proteins containing different degrees of protein aggregation and/or of aggregation.

In certain embodiments, the methods, as taught herein, may further comprise the step of determining from the same IR spectrum of the protein sample one or more secondary structure characteristics of the protein in the protein sample.

In certain embodiments of the methods, as taught herein, the secondary structure characteristic may be selected from the group consisting of alpha helix, beta sheet, beta turn, 3io helix, and random structure.

Using FTIR spectroscopy to determine secondary structure and/or secondary structure characteristics of proteins is known in the art (Byler et al., 1986, Biopolymers, 25, 469-87; Jackson et al., 1995, Crit. Rev. Biochem. Mol. Biol., 30, 95-120).

A method and algorithm was developed which demonstrated that, using a stepwise approach, only three wavenumbers in a FTIR spectrum contain all the information that is necessary to determine the secondary structure of proteins. The method is explained in detail in the literature (Goormaghtigh et al., 2006, Biophys J., 90, 2946-57).

In certain embodiments of the methods, as taught herein, the one or more secondary structure characteristics may be determined by using a yet further mathematical model configured to determine one or more secondary structure characteristics a protein. The yet further model is made according to the methods described herein, using standard IR spectra of sets references proteins containing different secondary structure characteristics.

In certain embodiments of the methods, as taught herein, the method may be for determining the degree of phosphorylation, the degree of glycosylation, the concentration, the degree of aggregation, the degree of denaturation, and the one or more secondary structure characteristics of the protein in the protein sample. Such method advantageously allows the simultaneous determination of the degree of phosphorylation, the degree of glycosylation, the concentration, the degree of aggregation, the degree of denaturation, and the one or more secondary structure characteristics of the protein in a protein sample based on a single IR spectrum of the protein obtained by a single FTIR analysis.

In certain embodiments of the methods, as taught herein, the first, second, and/or third and/or further mathematical models may be stored on a database.

In certain embodiments of the methods, as taught herein, the method may be implemented over the Internet, and/or comprise the use of Hypertext Transfer Protocol (HTTP).

In certain embodiments of the methods, as taught herein, the method may further comprise the step of outputting or displaying (the result of) the degree of glycosylation and the degree of phosphorylation of the protein in the protein sample.

In certain embodiments of the methods, as taught herein, the method may use a webserver.

In a second aspect the present invention relates to a computer apparatus configured for executing a method as defined herein. Hence, in a further aspect, the present invention relates to a computer apparatus configured for executing a method for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

In certain embodiments, the present invention relates to an apparatus comprising a processing unit and a computer readable medium configured to perform a method as defined herein. Hence, in certain embodiments, the present invention relates to an apparatus comprising a processing unit and a computer readable medium configured to perform a method for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

In a further aspect the present invention relates to a storage medium for executing a method as defined herein. Accordingly, in a further aspect, the present invention relates to a storage medium for executing a method for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample.

In certain embodiments, the present invention relates to a computer program product comprising computer program code means stored on a computer readable medium configured to perform a method as defined herein when said program is run on a computer. In particular, in certain embodiments, the present invention relates to a computer program product comprising computer program code means stored on a computer readable medium configured to perform a method for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample, when said program is run on a computer.

In a further aspect, the present invention relates to a database containing a first mathematical model configured to determine the degree of glycosylation of a protein and a second mathematical model configured to determine the degree of phosphorylation of a protein. The present database advantageously allows simultaneously determining the degree of phosphorylation and the degree of glycosylation of a protein in a protein sample. The database embodying the principles of the present invention allows the quantification of protein phosphorylation and protein glycosylation based on a single IR spectrum of the protein. The present database advantageously obviates the need to generate calibration IR spectra each time a protein sample is analysed for determining the degree of phosphorylation and the degree of glycosylation of the protein, e.g. in therapeutic protein development or during QA in protein manufacturing.

In yet a further aspect the present invention relates to a computer program stored on a computer-readable medium comprising the database as defined herein. Accordingly, in a further aspect, the present invention relates to a computer program stored on a computer- readable medium comprising the database containing a first mathematical model configured to determine the degree of glycosylation of a protein and a second mathematical model configured to determine the degree of phosphorylation of a protein.

In certain embodiments, the present invention relates to a computer program causing a computer processor to execute a method as taught herein. In certain embodiments, the present invention relates to a computer program causing a computer processor to execute a method for determining at least the degree of glycosylation and the degree of phosphorylation of a protein in a protein sample, the method comprising the steps of: (a) receiving an IR spectrum obtained by FTIR spectroscopy of the protein sample, and (b) determining from the same IR spectrum the degree of phosphorylation of the protein and the degree of glycosylation of the protein in the protein sample. In certain embodiments, the computer program may be stored on a computer readable medium.

In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the first mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein at least one reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of at least one other reference protein sample in the first set, and wherein the second mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein at least one reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of at least one other reference protein sample in the second set. In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the first mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a first set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of glycosylation and optionally a known degree of phosphorylation, and wherein each reference protein sample has a degree of glycosylation which is different from the degree of glycosylation of each other reference protein sample in the first set, and wherein the second mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a second set of reference protein samples using a statistical tool, wherein each reference protein sample has a known degree of phosphorylation and optionally a known degree of glycosylation, and wherein each reference protein sample has a degree of phosphorylation which is different from the degree of phosphorylation of each other reference protein sample in the second set.

In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the database may further comprise a third mathematical model configured to determine the protein concentration.

In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the third mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein at least one reference protein sample has a protein concentration which is different from the protein concentration of at least one other reference protein sample in the third set.

In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the third mathematical model may be prepared from standard IR spectra obtained by FTIR spectroscopy of a set of reference protein samples using a statistical tool, wherein each reference protein sample has a known protein concentration and wherein each reference protein sample has a protein concentration which is different from the protein concentration of each other reference protein sample in the third set.

In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the statistical tool may be selected from the group consisting of partial least square (PLS) regression, classical least square (CLS) regression, inverse least square (ILS) regression, principal component regression (PCR), multiple linear regression (MLR), and support vector machine (SVM). In certain embodiments of the databases or computer programs stored on a computer- readable medium, as taught herein, the statistical tool is PLS regression.

The present invention can be further illustrated by the following examples, although it will be understood that these examples are included merely for purposes of illustration and are not intended to limit the scope of the invention unless otherwise specifically indicated.

EXAMPLES

Example 1 : Simultaneous quantification of phosphorylation and glycosylation in protein samples with a method according to an embodiment of the invention

1.1 Sample preparation

Albumin (A3782, Sigma-Aldrich, Bornem, Belgium) and four peptides ordered at Eurogentec (AS-20292, AS-24537, AS-61329 and AS-61332, Liege, Belgium) were used. Two peptides (AS-20292 and AS-24537) were derived from the insulin receptor tyrosine kinase. AS-24537 is the unphosphorylated form of the peptide (TRDIYETDYYRK, SEQ ID NO: 1 ) and AS-20292 was once phosphorylated (TRDI-pY-ETDYYRK, SEQ ID NO: 2). The peptide AS-61329 (GTTPSPVPTTSTTSAP, SEQ ID NO: 3) was derived from the human mucin MUC5AC gene sequence. The gene MUC5AC is mainly expressed in gastric, tracheo-bronchial mucosae and some tumors, it exhibits two kinds of deduced peptide domains, one of which is 8 amino acid tandemly repeated domain, a consensus peptide TTSTTSAP. AS-61332 is a glycopeptide with the same sequence as AS-61329 with two sites (Thr3 and Thr13) labelled with a N-Acetyl galactosamine.

All samples were purchased powdered and diluted in purified water at a concentration of 5 mg/ml. After dilution, samples were stored at -20°C.

The phosphate content in the sample was evaluated in mass percentage which means the fraction of mass corresponding to phosphate divided by the total mass of the sample. AS- 24537 contained 0 % (w/w) phosphate and AS-20292 4.69 % (w/w) phosphate.

The carbohydrate content in the sample was evaluated in mass percentage which means the fraction of mass corresponding to carbohydrate divided by the total mass of the sample. For the pure peptides AS-61329 and AS-61332, the carbohydrate contents are respectively 0 % (w/w) and 21.28% (w/w) carbohydrate.

The peptides were mixed to albumin to obtain a large range of carbohydrate and phosphate contents as described in Table 1.

Table 1 : Data description for the samples 1 to 7 with glycosylated peptide AS-61332 and phosphorylated peptide AS-20292. The phosphate content corresponds to the mass percentage of phosphate. The carbohydrate content corresponds to the mass percentage of carbohydrate

Corresponding samples were prepared without phosphorylation and without glycosylation using the peptides AS-24537 and AS-61329. The same quantity of peptides was added to albumin (Table 2).

Table 2: Data description for the samples 8 to 14 with unglycosylated peptide AS-61329 and unphosphorylated peptide AS-24537

Potassium ferrocyanide was added in all samples as an internal reference at a concentration of 0.1 mg/ml to use the third mathematical model according to an embodiment of the invention to determine the protein concentration (see Example 2). All samples were prepared six times on a different day with new albumin solution and new potassium ferrocyanide solution.

The first three preparations were used to build the mathematical models. The other three preparations were used to validate these mathematical models. For the validation samples, new vials of peptides from Eurogentec were ordered and new solutions of peptides were prepared.

1.2 FTIR spectroscopy

All measurements were carried out on a Bruker Tensor 27 FTIR spectrometer (Bruker, Karlsruhe, Germany) equipped with a liquid ISh-refrigerated Mercury Cadmium Telluride detector. All spectra were recorded by attenuated total reflection (ATR). A diamond internal reflection element was used on a Golden Gate Micro-ATR from Specac (Orpington, UK). The angle of incidence was 45 degrees. A 0.5 μΙ amount of the protein samples was deposited on the diamond crystal. The sample was quickly evaporated in N2 flux to obtain a homogenous film of proteins. The FTIR measurements were recorded between 4000 and 600 cm "1 . Each spectrum was obtained by averaging 128 scans recorded at a resolution of 2 cm "1 . For each protein sample, five FTIR spectra were recorded.

1.3 Data analysis

All the spectra were pre-processed as follows. The water vapour contribution was subtracted with 1956-1935 cm "1 as reference peak (as described in Goormaghtigh et al., 2009, Adv. Biomed. Spectrosc, 2, 104-128; Goormaghtigh et al., 1994, Spectrochim. Acta., 50, 2137-2144). The spectra were then baseline-corrected; straight lines were interpolated between the following frequencies: 3700 cm "1 , 3002 cm "1 , 2395 cm "1 , 2247 cm "1 , 1702 cm "1 , 1586 cm "1 , 1480 cm "1 , 1355 cm "1 , 1220 cm "1 , 1 190 cm "1 , 1 156 cm "1 , 1000 cm "1 , 960 cm "1 . Then, they were subtracted from the spectrum. Normalization for equal area was applied between 1702 cm "1 and 1480 cm "1 .

The phosphorylated peptide (AS-20292) and the glycosylated peptide (AS-61332) were diluted at various concentrations in albumin (Table 1 ). Corresponding samples were prepared without phosphorylation and without glycosylation using AS-24537 and AS- 61329 (Table 2). In order to evidence spectral variations correlated to post-translational modifications, for each concentration, the mean IR spectra of the protein sample without phosphorylation and without glycosylation (e.g. albumin mixed with AS-24537 and AS- 61329) was subtracted from the IR spectra of the protein samples containing phosphorylation and glycosylation (e.g. albumin mixed with AS-20292 and AS-61332). Hereby, the "difference spectra" or "different IR spectra", which represent the actual modifications corresponding to protein post-translational modifications, were obtained. All difference IR spectra were calculated with fully pre-processed spectra (baseline corrected and normalized).

1.4 Partial least square regression

The statistical tool used in the method according to an embodiment of the invention was partial least square (PLS) regression. Partial least square regression (PLSR) advantageously allows handling many X-variables strongly correlated and possibly noisy. PLS regression allows modelling several response variables (Y). PLSR was used to build mathematical models allowing prediction of a desired characteristic (Y) from a measured spectrum (X), in particular allowing prediction of the quantity of glycosylation and the quantity of phosphorylation of a protein from an IR spectrum measured using FTIR spectroscopy.

It was attempted to determine the quantity of protein glycosylation and protein phosphorylation in a sample on the basis of the infrared spectrum of this sample. Therefore, first and second mathematical models illustrating the invention were built.

The first mathematical model illustrating the invention corresponded in matrix notation to:

(Equation 1 )

wherein B-i contained the regression coefficients that were determined during the calibration step, Xi was the matrix of collected spectra, and was the error term.

To build the first mathematical model according to an embodiment of the invention, new artificial variables or PLS components were defined. The PLS components corresponded to:

7 = X t W t

(Equation 2)

wherein T-i was the matrix of PLS factors and W 1 contained the coefficient of the linear combination. The new variables matrix T-i was a combination of the original variable Xi but also a predictor of Yi variables. The first mathematical model could thus be established:

(Equation 3),

wherein is the regression coefficient matrix calculated with the training set.

The first mathematical model was thus:

(Equation 1 )

wherein B = W 1 Q 1 (Equation 4).

A cross validation was always applied when building the PLS model. Among the n spectra used to establish the first mathematical model illustrating the invention, one of them was left out from the calibration set and the model built with the remaining spectra was tested on the spectrum that did not contribute to the calibration. This procedure was repeated until each IR spectrum was excluded for the mathematical model building. The left-out spectrum was used to determine the carbohydrate concentration with the established mathematical model. The Root Mean Square Error of Cross-Validation (RMSECV) of the predicted value was calculated as follows:

RMSECV1 =

(Equation 5)

wherein n is the number of spectra for the calibration, y tl is the carbohydrate content present in the sample corresponding to the spectrum i, and y^i is the carbohydrate content determined for the left-out spectrum i.

The correlation coefficient r-ι or R-i was calculated as follows:

•■■ -^-((ΣΙ^-^νε " .^-^ 2 )))

(Equation 6)

wherein n is the number of spectra for the calibration, y tl is the carbohydrate content present in the mixture corresponding to the spectrum i, is the quantity of protein glycosylation estimated by the model with spectrum i, and ^ is the average of all reference measurements values in the calibration set.

The second mathematical model illustrating the invention corresponded in matrix notation to:

(Equation 7)

wherein B 2 contained the regression coefficients that were determined during the calibration step, X 2 was the matrix of collected spectra, and E 2 was the error term.

To build the second mathematical model according to an embodiment of the invention, new artificial variables or PLS components were defined. The PLS components corresponded to:

T 2 = X 2 W 2

(Equation 8)

wherein T 2 was the matrix of PLS factors and W 2 contained the coefficient of the linear combination. The new variables matrix T 2 was a combination of the original variable X 2 but also a predictor of Y 2 variables. The second mathematical model could thus be established: Y 2 = T 2 Q 2 + E 2

(Equation 9),

wherein Q 2 is the regression coefficient matrix calculated with the training set.

The second mathematical model was thus:

(Equation 7)

wherein B 2 = W 2 Q 2 (Equation 10).

A cross validation was always applied when building the PLS model. Among the n spectra used to establish the mathematical model, one of them was left out from the calibration set and the model built with the remaining spectra was tested on the spectrum that did not contribute to the calibration. This procedure was repeated until each IR spectrum was excluded for the mathematical model building. The left-out spectrum was used to determine the phosphate concentration with the established mathematical model. The RMSECV of the predicted value was calculated as follows:

RMSECV2 =

(Equation 1 1 )

wherein n is the number of spectra for the calibration, y i2 is the phosphate content present in the sample corresponding to the spectrum i, and y^ 2 is the phosphate content determined for the left-out spectrum i.

The correlation coefficient r 2 or R 2 was calculated as follows: (Equation 12)

wherein n is the number of spectra for the calibration, y i2 is the phosphate content present in the mixture corresponding to the spectrum i, y^ 2 is the quantity of protein phosphorylation estimated by the model with spectrum i, and ^ is the average of all reference measurements values in the calibration set.

1.5 Model calibration to determine the carbohydrate content

Eight concentrations of carbohydrate (7.03%, 3.54%, 1 .7%, 1 .28%, 0.85%, 0.64%, 0.43%, 0%) were prepared three times on three different days to build the model. As described in Table 1 , all the samples also contain phosphate. The same samples were prepared again three other days to validate the first mathematical model illustrating the present invention. The corresponding samples without glycosylation and phosphorylation were prepared with AS-24537 and AS-61329 (same quantity of peptides was mixed with albumin, see Table 2). Five FTI R spectra were recorded per sample resulting in a total of 420 spectra.

The first mathematical model illustrating the invention was built using the difference spectra which correspond to the difference between the IR spectrum of each sample containing carbohydrate and phosphate and the mean IR spectrum of the corresponding sample without glycosylation and without phosphorylation.

A PLS model was built using the 140 difference spectra and considering the spectral region between 1400 and 900 cm "1 . Figure 1 describes the evolution of the Root Mean Square Error of Cross-Validation (RMSECV) as a function of the number of PLS components included in the model. The RMSECV is a first indicator of the performance of the model. The RMSECV of a model with 0 PLS components indicates the accuracy reached with predictions based only on average values of quantity associated with calibration spectra. The decrease of RMSECV with more PLS factors demonstrated that building a model improves the quality of the prediction. A drastic reduction of the RMSECV was observed when considering one PLS component in the first mathematical model illustrating the invention (Figure 1 ). Eight PLS components were chosen to build a first mathematical model according to an embodiment of the invention.

Figure 2 reports the actual carbohydrate content (the mass percentage of carbohydrate) in the various samples as a function of the predicted carbohydrate content using the first mathematical model. Each number represents one IR spectrum. As explained in the previous section, the correlation coefficient R and the RMSECV allow an evaluation of the performance of the first mathematical model illustrating the invention. When the RMSECV is close to zero, the predicted carbohydrate content will be close to the true values. If R is close to 1 , the model is a good representation of the data. With the first mathematical model illustrating the invention, RMSECV1 was 0.21 10 and Ri was 0.9951 .

1.6 Model calibration to determine the phosphate content

Eight concentrations of phosphate (0%, 0.1 %, 0.23%, 0.28%, 0.38%, 0.66%, 1 .31 %, 2.63%) were prepared three times on three different days to build the model. As described in Table 1 , all the samples also contain carbohydrate. The same samples were prepared again three other days to validate the second mathematical model illustrating the present invention. The corresponding samples without phosphorylation and glycosylation were prepared with AS-61329 and AS-24537 (same quantity of peptides was mixed with albumin, see Table 2). Five FTIR spectra were recorded per sample resulting in a total of 420 spectra.

The second mathematical model illustrating the invention was built using the difference spectra which correspond to the difference between the IR spectrum of each sample containing phosphate and carbohydrate and the mean IR spectrum of the corresponding sample without phosphorylation and without glycosylation.

A PLS model was built using the 140 difference spectra and considering the spectral region between 1400 and 900 cm "1 . Figure 3 describes the evolution of the root mean square error of cross validation (RMESCV) as a function of the number of PLS components included in a second mathematical model according to an embodiment of the invention. It is a first indicator of the performance of the model. The RMSECV of a model with 0 PLS components indicates the accuracy reached with predictions based only on average values of quantity associated with calibration spectra. The decrease of RMSECV with more PLS factors demonstrates that building a model improves the quality of the prediction. A drastic reduction of the RMSECV is observed when considering one PLS component in the second mathematical model illustrating the invention (Figure 3). Ten PLS components were chosen to build a second mathematical model illustrating the present invention.

Figure 4 reports the actual phosphate content (the mass percentage of phosphate) in the various samples as a function of the predicted phosphate content using the second mathematical model illustrating the invention. Each number represents one IR spectrum. As explained in the previous section, the correlation coefficient R and the RMSECV allow an evaluation of the performance of the models. When the RMSECV is close to zero, the predicted phosphate content will be close to the true values. If R is close to 1 , the model is a good representation of the data. With the second mathematical model illustrating the invention, RMSECV2 was 0.0725 and R 2 was 0.9959. Hence, the second mathematical model according to an embodiment of the invention provided a good representation of the data.

1.7 Validation of the first and second mathematical models according to embodiments of the invention to determine the degree of phosphorylation and the degree of glycosylation

The final step was to validate the first and second mathematical model illustrating the invention on the samples from the 3 experiments that were not included in the calibration step. The IR spectra were recorded and the difference spectra were calculated as explained previously. Then, the first and second mathematical model illustrating the invention were applied on 105 difference spectra and the results are presented in Figure 5 and Figure 6. With a R 2 of 0.99 for the first mathematical model illustrating the invention and an R 2 of and 0.98 for the second mathematical model illustrating the invention, the predicted values were highly correlated to the true values underlining the performance of the first and second mathematical models illustrating the invention. The correlation coefficients, R 2 , correspond to the correlation coefficients shown in Figures 5 and 6. R 2 correspond to the correlation between predicted values and true values of the validation set, i.e., the set of spectra used to test and validate the models. Note that R 2 is distinct from R and R 2 (Figures 2 and 4). R and R 2 correspond to the correlation between predicted values and true values of the calibration set, i.e., spectra used to build the models.

Table 3 below indicates the determination of carbohydrate and phosphate content for individual IR spectra for samples 1 to 7. At the bottom, the mean and the corresponding error were calculated. The error was around 5% for carbohydrate content superior to 1 %. For carbohydrate content below 1 %, the error was higher, superior to 10 %. For phosphate content superior to 0.5%, the error was around 5%. The error was higher and up to 10% for phosphate content below 0.5% (Table 3).

Table 3: Determination of the carbohydrate content (the mass percentage of carbohydrate) and the phosphate content (the mass percentage of phosphate) using the PLS models for individual IR spectra

These results underline the feasibility to quantify simultaneously glycosylation and phosphorylation in a protein sample containing both phosphate and carbohydrate. Example 2: Simultaneous determination of the protein concentration in protein samples with a method according to an embodiment of the invention

Protein samples were prepared as described in Example 1 , point 1 .1 . FTIR spectroscopy was performed as described above (Example 1 , point 1.2).

2.1 Data analysis

All the IR spectra were pre-processed as follows. The water vapour contribution was subtracted as described previously (Goormaghtigh et al., 2009, Adv. Biomed. Spectrosc, 2, 104-128; Goormaghtigh et al., 1994, Spectrochim. Acta., 50, 2137-2144) with 1956- 1935 cm "1 as reference peak. The spectra were then baseline-corrected; straight lines were interpolated between the following frequencies: 3700 cm "1 , 3002 cm "1 , 2800 cm "1 , 2395 cm "1 , 2247 cm "1 , 1724 cm "1 , 1586 cm "1 , 1480 cm "1 , 1355 cm "1 , 1 190 cm "1 , 960 cm "1 . Then, they were subtracted from the spectrum. Normalization for equal area was applied between 2080 and 2000 cm "1 .

2.2. Building the third mathematical model according to an embodiment of the invention to determine the protein concentration

2.2.1 Sample preparations

Albumin from human serum (A3782, Sigma-Aldrich, Bornem, Belgium) used also in the previous chapter was purchased powdered. Albumin was weighted and diluted in purified water to prepare a solution of 10 mg/ml. Based on this solution, 7 concentrations of albumin were prepared: 9.5 mg/ml, 7.5 mg/ml, 5 mg/ml, 2.5 mg/ml, 1 mg/ml, 0.5 mg/ml, and 0.1 mg/ml. These samples also contained 0.1 mg/ml of Potassium Ferrocyanide as internal reference. They were prepared three times on a different day with a new solution of albumin at 10 mg/ml.

2.2.2 FTIR spectroscopy

All measurements were carried out on a Bruker Tensor 27 FTIR spectrometer (Bruker, Karlsruhe, Germany) equipped with a liquid N 2 -refrigerated Mercury Cadmium Telluride detector. All spectra were recorded by attenuated total reflection (ATR). A diamond internal reflection element was used on a Golden Gate Micro-ATR from Specac (Orpington, UK). The angle of incidence was 45 degrees. A 0.5 μΙ_ amount of the proteins was deposited on the diamond crystal. The sample was quickly evaporated in N 2 flux to obtain a homogenous film of proteins. The FTIR measurements were recorded between 4000 and 600 cm "1 . Each spectrum was obtained by averaging 128 scans recorded at a resolution of 2 cm "1 . For each sample, between 3 and 5 FTIR spectra were recorded.

2.2.3 Data analysis

All IR spectra were pre-processed as follows. The water vapour contribution was subtracted as described previously (Goormaghtigh et al., 2009, Adv. Biomed. Spectrosc, 2, 104-128; Goormaghtigh et al., 1994, Spectrochim. Acta., 50, 2137-2144) with 1956- 1935 cm "1 as reference peak. The IR spectra were then baseline-corrected; straight lines were interpolated between the following frequencies: 3700 cm "1 , 3002 cm "1 , 2800 cm "1 , 2395 cm "1 , 2247 cm "1 , 1724 cm "1 , 1586 cm "1 , 1480 cm "1 , 1355 cm "1 , 1 190 cm "1 , 960 cm "1 . Then, they were subtracted from the spectrum. Normalization for equal area was applied between 2080 and 2000 cm "1 .

2.2.4 Partial Least Square Regression

The statistical tool used in the method according to an embodiment of the invention was partial least square (PLS) regression. PLS regression was used to build a mathematical model allowing prediction of a desired characteristic (Y) from a measured spectrum (X), for instance allowing prediction of the protein concentration from an IR spectrum measured using FTIR spectroscopy.

It was attempted to determine the quantity of protein in a sample on the basis of the infrared spectrum of this sample. Therefore, a third mathematical model illustrating the invention was built.

The third mathematical model illustrating the invention corresponded in matrix notation to:

Y 3 = X 3 B 3 + E 3

(Equation 13)

wherein B 3 contained the regression coefficients that were determined during the calibration step, X 3 was the matrix of collected spectra, and E 3 was the error term.

To build the third mathematical model according to an embodiment of the invention, new artificial variables or PLS components were defined. The PLS components corresponded to:

T 3 = x 2 w 3

(Equation 14)

wherein T 3 was the matrix of PLS factors and W 3 contained the coefficient of the linear combination. The new variables matrix T 3 was a combination of the original variable X 3 but also a predictor of Y 3 variables. The third mathematical model could thus be established:

Y 3 = T 3 Q 3 + E 3

(Equation 15)

wherein Q 3 is the regression coefficient matrix calculated with the training set.

The third mathematical model was thus:

Y 3 = X 3 B 3 + E 3

(Equation 13)

wherein B 3 = W 3 Q 3 (Eq. 16). A cross validation was always applied when building the PLS model. Among the n spectra used to establish the mathematical model, one of them was left out from the calibration set and the model built with the remaining spectra was tested on the spectrum that did not contribute to the calibration. This procedure was repeated until each IR spectrum was excluded for the mathematical model building. The left-out spectrum was used to determine the protein concentration with the established mathematical model. The RMSECV of the predicted value was calculated as follows:

(Eq uation 17)

wherein n is the number of spectra for the calibration, y i3 is the protein concentration present in the sample corresponding to the spectrum i, and y^ 3 is the protein concentration determined for the left-out spectrum i.

The correlation coefficient r 3 or R 3 is calculated as follows:

(Eq uation 18)

wherein n is the number of spectra for the calibration, y i3 is the protein concentration present in the mixture corresponding to the spectrum i, y^ 3 is the quantity of protein estimated by the model with spectrum i, and y is the average of all reference measurements values in the calibration set.

2.2.5 Calibration of the third mathematical model illustrating the invention

A global PLS model was built considering the spectral region between 1800 cm "1 and 1400 cm "1 and using the IR spectra of all the samples prepared as described above in point 2.2.1 . Figure 7 describes the evolution of the root mean square error of cross validation (RMESCV) as a function of the number of PLS components included in the third mathematical model illustrating the invention. The RMSECV is an indicator of the performance of the model. The RMSECV of a model with 0 PLS components indicates the accuracy reached with predictions based only on average values of quantity associated with calibration spectra. The decrease of RMSECV with more PLS factors demonstrates that building a model improves the quality of the prediction. A drastic reduction of the RMSECV was observed when considering one PLS component (Figure 7). The number of PLS components selected to build a third mathematical model according to an embodiment of the invention was three PLS factors. Figure 8 illustrates the actual protein concentration in the various samples as a function of the determined protein concentration using the third mathematical model according to an embodiment of the invention. Each number represents one IR spectrum. With the third mathematical model illustrating the invention, RMSECV3 was 0.6766 and R 3 was 0.9803. 2.2.6 Validation of the third mathematical model illustrating the invention

The third mathematical model illustrating the invention was tested on samples that were not included in the calibration step. The same procedure as described in points 2.2.1 and 2.2.2 was followed to prepare the new samples. The third mathematical model was applied to the FTIR spectra of these new samples. Figure 9 presents the results of the application of the third mathematical model. With a R 2 of 0.96, the predicted values are highly correlated to the true values underlining the performance of this third mathematical model according to an embodiment of the invention.

2.3 Determination of the protein concentration using the third mathematical model according to an embodiment of the present invention

The third mathematical model according to an embodiment of the invention established as described in point 2.2 was applied on the samples 3 to 5 (Table 1 ). The results are presented in Table 4. The protein concentration of all the samples was 5 mg/ml.

Table 4: Determined protein concentration (mg/ml) and error (%) for samples 3, 4 and 5 using the third mathematical model according to an embodiment of the invention

Experiment Sample Protein Error (%)

concentration

(mg/ml)

1 Sample 5 4.76 4.71

Sample 4 4.83 3.38

Sample 3 4.48 10.44

2 Sample 5 4.90 2.02

Sample 4 4.29 14.26

Sample 3 4.35 13.03

3 Sample 5 4.67 6.56

Sample 4 4.34 13.18

Sample 3 4.33 13.45

4 Sample 5 4.64 7.29

Sample 4 4.44 1 1.13

Sample 3 3.91 21.79

5 Sample 5 4.64 7.24

Sample 4 5.75 15.09 Experiment Sample Protein Error (%)

concentration

(mg/ml)

Sample 3 5.50 10.01

6 Sample 5 5.04 0.82

Sample 4 4.89 2.25

Sample 3 4.75 5.08

The error varied between 0.82 and 22%. The method according to an embodiment of the invention may be optimized for instance by varying the choice of the internal reference and/or the number of IR spectra used to build the model. The third mathematical model according to an embodiment of the invention was built only with pure albumin (see Example 2, 2.2). The protein concentrations in Table 4 were determined of a mix of albumin and peptides. This might also have been a source of error.

Example 3: Simultaneous determination of the degree of aggregation and the degree of denaturation in protein samples with a method according to an embodiment of the invention

Protein samples were prepared as described in Example 1 , point 1 .1 . FTIR spectroscopy was performed as described above (Example 1 , point 1.2).

3.1. Data analysis

All the IR spectra were pre-processed as follows. The water vapour contribution was subtracted with 1956-1935 cm "1 as reference peak. The IR spectra were then baseline- corrected; straight lines were interpolated between the following frequencies: 3700 cm "1 , 3002 cm "1 , 2800 cm "1 , 2395 cm "1 , 2247 cm "1 , 1724 cm "1 , 1586 cm "1 , 1480 cm "1 , 1355 cm "1 , 1 190 cm "1 , 960 cm "1 . Then, they were subtracted from the spectrum. Normalization for equal area was applied between 1724 and 1480 cm "1 . Second derivative was finally calculated using the Savitzky-Golay smoothing.

3.2. Establishing the stability index of albumin

To verify the aggregation and denaturation of the protein, referred to herein as the structural integrity of the protein, the absorbance at 1622 cm "1 of the protein sample was compared with the absorbance of the protein without any degradation and/or without any aggregation. In order to find that the absorbance to be compared corresponded to the absorbance at 1622 cm "1 , a stability index for albumin was developed.

3.2.1 Sample preparation

Albumin from human serum (A3782, Sigma-Aldrich, Bornem, Belgium) was used to demonstrate the possibility to assess the structural integrity by FTIR spectroscopy. The protein was purchased powdered and used without further purification. The protein was dissolved in 0.05M NaCI at a concentration of 5 mg/ml. Powders were conserved at 4°C and solutions at -20°C.

The protein was progressively heated from room temperature to 90°C to induce conformational changes and aggregation. Two heating protocols were applied. For the first protocol, the protein was successively exposed during 20 minutes to 40°C, 50°C, 55°C, 60 °C, 65°C, 70°C, 75°C, 80°C and 90°C. Measurements were recorded at all the foresaid temperatures and for a sample of not heated proteins (control). For the second protocol, the protein was consecutively heated during 20 minutes at 40°C, 50°C, 55°C, 57°C, 59°C, 61 °C, 63°C, 65°C, 67°C and 69°C. Measurements were recorded at all the temperature between 55°C and 69°C and for a sample of not heated proteins (control). Both protocols were applied three times. After the heating and before the measurements, all samples were cooled to room temperature

3.2.2 FTIR Spectroscopy

All measurements were carried out on a Bruker Tensor 27 FTIR spectrometer (Bruker, Karlsruhe, Germany) equipped with a liquid N 2 -refrigerated Mercury Cadmium Telluride detector. All spectra were recorded by attenuated total reflection (ATR). A diamond internal reflection element was used on a Golden Gate Micro-ATR from Specac (Orpington, UK). The angle of incidence was 45 degrees. A 0.5 μΙ of the protein was deposited on the diamond crystal. The sample was quickly evaporated in N 2 flux to obtain a homogenous film of proteins. The FTIR measurements were recorded between 4000 and 600 cm "1 . Each IR spectrum was obtained by averaging 128 scans recorded at a resolution of 2 cm "1 . For each temperature, between 3 and 6 independent samples were prepared. For each sample, between 3 and 5 FTIR spectra were recorded.

3.2.3 Data analysis

All the spectra were pre-processed as follows. The water vapour contribution was subtracted with 1956-1935 cm "1 as reference peak. The spectra were then baseline- corrected; straight lines were interpolated between the following frequencies: 3700 cm "1 , 3002 cm "1 , 2800 cm "1 , 2395 cm "1 , 2247 cm "1 , 1724 cm "1 , 1586 cm "1 , 1480 cm "1 , 1355 cm "1 , 1 190 cm "1 , 960 cm "1 . Then, they were subtracted from the spectrum. Normalization for equal area was applied between 1724 and 1480 cm "1 . Second derivative was finally calculated using the Savitzky-Golay smoothing.

In order to evidence spectral variations induced by the heating, the mean IR spectra of unheated proteins (controls) were subtracted from the mean IR spectra of proteins at every temperature. In this way, the "difference spectra" were obtained which represent the actual modifications caused by conformational changes and aggregation (Figure 10). All difference spectra were calculated with fully pre-processed IR spectra (second derivative, baseline corrected and normalized). Student t-test were computed at every wavenumber and allowed a statistical comparison between the spectra of degraded proteins and the control proteins.

Figure 10 illustrates the IR spectral range from 1800 to 1400 cm "1 as these wavenumbers are the most representative of the protein structure. The shape of the difference spectra evidences spectral variations relating to the conformational modification caused by the rising temperature. Wavenumbers where a significant difference occurred (with a significance a = 0.1 % as calculated by Student t-test) were indicated by black stars.

For albumin, it can be noticed that significant variations appeared from 57 °C. Moreover, spectral changes were observed at similar wavenumber and were more intense with increasing temperature (Figure 10).

For albumin, the most intense variation found in the difference spectra corresponded to the peak between 1630 and 1610 cm "1 which is marked by a rectangle (Figure 10). The absorbance at the maximum of this band (1622 cm "1 ) was monitored. The mean of this absorbance and the standard deviation was calculated for each temperature and presented in Figure 1 1 . The mean of this absorbance represented the stability index of albumin. Figure 1 1 illustrates the possibility of FTIR spectroscopy to monitor the structural integrity of proteins.

3.3 Results of the structural integrity

The mean absorbance at 1622 cm "1 was calculated for the samples 3, 4 and 5 (Table 1 ) of the six experiments. As the peptides in the samples have no secondary structure, the samples containing less peptide (i.e., samples 3, 4, and 5) were selected for analysis. The results of the stability index (i.e., mean absorbance at 1622 cm "1 ) are presented in Table 5.

Table 5: Mean of absorbance at 1622 cm "1 for samples 3, 4, and 5 of the six experiments

Experiment Sample Stability index

1 Sample 5 0.1 1

Sample 4 0.10

Sample 3 0.09

2 Sample 5 0.10

Sample 4 0.1 1

Sample 3 0.10

3 Sample 5 0.12

Sample 4 0.09 Experiment Sample Stability index

Sample 3 0.09

4 Sample 5 0.12

Sample 4 0.12

Sample 3 0.12

5 Sample 5 0.10

Sample 4 0.13

Sample 3 0.10

6 Sample 5 0.12

Sample 4 0.12

Sample 3 0.12

The mean absorbance at 1622 cm "1 was comprised between 0.09 and 0.12 (Table 5). When comparing the results of samples 3, 4, and 5 with the stability index (i.e., mean absorbance at 1622 cm "1 ) of albumin (Figure 1 1 ), it can be concluded that the proteins in samples 3, 4, and 5 present a complete structure (without protein aggregation or denaturation).

Example 4: Building a PLS model

The present example serves to illustrate the building of a PLS model.

The first step is a calibration step. In this particular example, 140 difference spectra, are used during the calibration step. Difference spectra correspond to the difference between the IR spectra of the protein samples containing phosphorylation and glycosylation (e.g. albumin mixed with AS-20292 and AS-61332) and the mean IR spectra of the protein sample without phosphorylation and without glycosylation (e.g. albumin mixed with AS-

24537 and AS-61329). Corresponding samples were prepared with phosphorylations and glycosylations (e.g. albumin mixed with AS-20292 and AS-61332) and without phosphorylation and glycosylation (e.g. albumin mixed with AS-24537 and AS-61329)

(Table 1 and Table 2).

The difference spectra used in the calibration spectra are referred to as calibration spectra. The calibration spectra comprise absorbance data in the wavenumber range between 900 and 3000 cm '1 . Each calibration spectrum was measured on a sample with a particular, known, carbohydrate content. The spectra and associated carbohydrate contents are loaded on a computer which is configured to build a PLS model.

In a second step, a user selects the spectral region which will be used for building the PLS model. The computer then receives the spectral region as input from the user. The spectral region is, for example, the entire range of wavenumbers which have been measured; 900-3000 cm '1 in the present example.

In a third step, twelve new variables called PLS factors are calculated as a linear combination of the calibration spectra.

The fourth step is an optimization step in which the optimum spectral range and the optimum number of PLS factors are selected. The RMSECV is used to determine the optimum spectral range for extracting the carbohydrate content in a particular sample. In particular, this optimum spectral range is determined by dividing the entire range of wavenumbers which have been measured in various intervals. Several numbers of intervals are tested in order to calibrate the optimal model (i.e., to obtain the minimal RMSECV, r being closer to 1 ). The optimum spectral range is the spectral range having the lowest RMSECV. The optimum number of PLS factors is selected by considering the RMSECV of PLS models having 0 to 12 PLS fators. The first minimum in the RMSECV as a function of the number of PLS factors is selected as the optimum number of PLS factors. As a general rule, the number of PLS factors may be selected to be as small as possible while achieving a prediction that is as good as possible, i.e. corresponding to r being about 1.

In a fifth and final step, the PLS model is calculated using the optimum spectral zone, 900- 1400 cm '1 , and using 8 PLS factors. The PLS factors are linear cominations of the calibration spectra, wherein the weighting factors in the linear combination are chosen such that the RMSECV is minimized.