METHODS AND APPARATUSES INVOLVING MASS SPECTROMETRY TO IDENTIFY PROTEINS IN A SAMPLE

Title:

METHODS AND APPARATUSES INVOLVING MASS SPECTROMETRY TO IDENTIFY PROTEINS IN A SAMPLE

Document Type and Number:

WIPO Patent Application WO/2014/116711

Kind Code:

Abstract:

Mass spectrometry system comprises: a mass spectrometer and controller that includes computer-readable medium on which programming is encoded, configured to: (i) direct the mass spectrometer to acquire a precursor ion spectrum of a sample stream; (ii) analyze, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a first species; and (iii) if the precursor ion is derived from the first species, direct the mass spectrometer to not analyze additional precursor ions corresponding to the first species. The controller acquires precursor ion spectrum and determines whether the precursor ion corresponds to a protein ion, identify a first peptide corresponding to a first protein from which the precursor ion is derived; increment a peptide count corresponding to the first protein; and when the peptide count for the first protein reaches a predetermined threshold, add the first protein to an exclusion list.

Inventors:

VOLCHENBOUM SAM (US)
KRON STEPHEN J (US)
MAYAMPURATH ANOOP M (US)

Application Number:

PCT/US2014/012564

Publication Date:

July 31, 2014

Filing Date:

January 22, 2014

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV CHICAGO (US)

International Classes:

G01N33/00; G16B40/10; G06G7/58; H01J49/04

Foreign References:

US20060243900A1	2006-11-02
US20020102610A1	2002-08-01
US20120109530A1	2012-05-03
US20090189063A1	2009-07-30
US20060247865A1	2006-11-02
US20050288865A1	2005-12-29
US20030124606A1	2003-07-03

Attorney, Agent or Firm:

SHISHIMA, Gina, N. (98 San Jacinto Blvd. Suite 110, Austin TX, US)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. A system, comprising:

a mass spectrometer; and

a controller connected to the mass spectrometer including a computer-readable medium on which programming is encoded, configured to:

direct the mass spectrometer to acquire a precursor ion spectrum of a sample stream; determine whether the precursor ion corresponds to a protein on an exclusion list; and when the precursor ion does not correspond to a protein on the exclusion list:

analyze the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived;

increment a peptide count corresponding to the first protein; and

when the peptide count for the first protein reaches a predetermined threshold, add the first protein to an exclusion list.

2. The system of claim 1, wherein the first protein corresponds to a first species, and wherein the controller is configured, through the programming on the computer readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold.

3. The system of claim 2, wherein the first species is homo sapiens.

4. The system of claim 2, wherein the first species is a human pathogen.

5. The system of claim 2, wherein the first species is a virus, fungi or bacterium.

6. The system of claim 2, wherein the controller is further configured to: analyze, in real-time, a second precursor ion spectrum for a second precursor ion to determine a second peptide corresponding to a second protein from which the second precursor ion is derived; identify the second protein corresponding to the second peptide and increment a peptide count corresponding to the second protein; and when the peptide count for the second protein reaches the predetermined threshold, add the second protein to the exclusion list.

7. The system of claim 2, wherein the first species is a plant.

8. The system of claim 2, wherein the first species is a cow, pig or chicken.

9. The system of claim 2, wherein the first species is a fish.

10. The system of claim 2, wherein the controller is further configured to add the first species to a database.

11. The system of claim 2, further comprising a database of masses of peptides derived from the first species, wherein the controller is further configured to add to the exclusion list the masses of peptides derived from the first species when the peptide count for at least one of the masses of peptides reaches the predetermined threshold.

12. The system of claim 2, wherein the controller is further configured to identify one or more microbes.

13. The system of claim 2, wherein the controller is further configured to, when the peptide count reaches the predetermined threshold and the first protein corresponds to a set of proteins forming a molecular pathway, add to the exclusion list at least one additional protein forming the same molecular pathway as the first protein.

14. The system of claim 13, wherein the molecular pathway is a cancer pathway or an

inflammation pathway.

15. The system of claim 1, wherein the step of analyzing the precursor ion spectrum is

performed in real-time.

16. A method for controlling a mass spectrometer, comprising:

directing a mass spectrometer to acquire a precursor ion spectrum of a sample stream; determining whether the precursor ion corresponds to a protein on an exclusion list; and

when the precursor ion does not correspond to a protein on the exclusion list:

analyzing, in real-time, the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived;

incrementing a peptide count corresponding to the first protein; and

when the peptide count for the first protein reaches a predetermined threshold, adding the first protein to an exclusion list.

17. The method of claim 16, wherein the first protein corresponds to a first species, and

wherein the controller is configured, through the programming on the computer - readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold.

18. The method of claim 17, wherein the first species is at least one of a group comprising homo sapiens, a human pathogen, a virus, a fungi, a bacterium, a plant, a cow, a pig, a chicken, and a fish.

19. The method of claim 17, further comprising:

analyzing, in real-time, a second precursor ion spectrum for a second precursor ion to determine a second peptide corresponding to a second protein from which the second precursor ion is derived;

identifying the second protein corresponding to the second peptide and increment a peptide count corresponding to the second protein; and

when the peptide count for the second protein reaches the predetermined threshold, adding the second protein to the exclusion list.

20. A computer program product, comprising:

a non-transitory computer-readable medium comprising instructions for performing the steps of: directing a mass spectrometer to acquire a precursor ion spectrum of a sample stream;

determining whether the precursor ion corresponds to a protein on an exclusion list; and

when the precursor ion does not correspond to a protein on the exclusion list:

analyzing, in real-time, the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived;

incrementing a peptide count corresponding to the first protein; and when the peptide count for the first protein reaches a predetermined threshold, adding the first protein to an exclusion list.

21. The computer program product of claim 20, wherein the first protein correspond to a first species, and wherein the controller is configured, through the programming on the computer-readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold.

22. The computer program product of claim 21, wherein the first species is at least one of a group comprising homo sapiens, a human pathogen, a virus, a fungi, a bacterium, a plant, a cow, a pig, a chicken, and a fish.

Description:

DESCRIPTION

METHODS AND APPARATUSES INVOLVING MASS SPECTROMETRY TO IDENTIFY PROTEINS IN A SAMPLE

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of priority to U.S. Provisional Patent

Application Serial No. 61/755,101, filed January 22, 2013 and U.S. Provisional Patent Application Serial No. 61/908,600, filed November 25, 2013, both of which are hereby incorporated by reference in their entirety.

BACKGROUND OF THE INVENTION

I. FIELD OF THE INVENTION

[0002] The invention is in the field of medicine; more specifically, it is in the fields of mass spectroscopy and pathology. Methods and systems involve a mass spectrometer to detect foreign protein(s) in a subject, such as pathogens that cause or are related to a disease or condition. II. DESCRIPTION OF THE RELATED ART

[0003] Rapid detection of bacterial pathogens from biologic specimens presents considerable challenges to clinicians, as highly abundant host biomolecules and other normal flora obscure microbial signatures. Once identified, treatment of bacterial infections is challenging due to the widening array of antibiotic resistance that has emerged. Several days may lapse while bacteria and their resistance patterns are identified using traditional culture- based techniques. A method that offers high selectivity and dynamic range that can rapidly characterize the microbial diversity in clinical samples is desired.

[0004] For comprehensive analysis of biological samples it is preferred to have the ability to fully characterize the constituent proteins. Proteomic analysis of complex mixtures is most often accomplished through chromatographic separation followed by tandem mass spectrometry (LC-MS/MS). Surprisingly, even using state-of-the-art instrumentation and methods, current approaches only yield tentative identifications of a few hundred proteins per run, far short of the expected thousands of proteins. The net effect is that conventional analysis of a complex sample leads to an inadequate list of proteins, many of which are identified with low certainty. SUMMARY OF THE INVENTION

[0005] Embodiments concern systems, components, programs, computer readable medium, and methods involving mass spectroscopy. Certain embodiments involve ways to use or control a mass spectroscopy system. [0006] In some embodiments, there is a mass spectrometry system comprising: (a) a mass spectrometer; and (b) a controller connected to the mass spectrometer including a computer readable medium on which programming is encoded, configured to: (i) direct the mass spectrometer to acquire a precursor ion spectrum of a sample stream; (ii) analyze, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a first species; and (iii) if the precursor ion is derived from the first species, direct the mass spectrometer to not analyze additional precursor ions corresponding to the first species. There is a system, comprising:

[0007] a mass spectrometer; and

[0008] a controller connected to the mass spectrometer including a computer-readable medium on which programming is encoded, configured to:

[0009] direct the mass spectrometer to acquire a precursor ion spectrum of a sample stream;

[0010] determine whether the precursor ion corresponds to a protein on an exclusion list; and [0011] when the precursor ion does not correspond to a protein on the exclusion list:

[0012] analyze the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived;

[0013] increment a peptide count corresponding to the first protein; and

[0014] when the peptide count for the first protein reaches a predetermined threshold, add the first protein to an exclusion list.

[0015] In certain embodiments, the first protein corresponds to a first species, and wherein the controller is configured, through the programming on the computer readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold. In particular embodiments, the first species is homo sapiens. In other embodiments, the first species is a mammal, such as a cow, pig, sheep, horse, bull, or monkey or any other commercially used mammal. In other embodiments, the first species is a bird, such as a chicken, duck, geese, or other birds consumed by humans or populating urban areas. In further embodiments the first species is a plant or tree (or a product of the plant or the tree), including but not limited to, corn, soybean, wheat, rye, barley, sugarcane, or sorghum; or vegetables such as spinach, asparagus, broccoli, carrots, cauliflower, lettuce, cabbage, green onions, squash, alfalfa sprouts, brussels sprouts; or, fruits such as oranges, apples, strawberries, raspberries, blueberries, grapes, cantaloupe, honeydew, watermelon, apricots, plums, peaches, nectarines; or, legumes. In other embodiments, the first species is a fish or other aquatic organisms, including fish or other aquatic organisms consumed by humans, or such organisms consumed by marine animals consumed by humans. In some embodiments, the first species is a human pathogen. In some cases, the first species is a virus, fungi or bacterium. [0016] In some embodiments, the controller of the system is further configured to:

[0017] (iv) analyze, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a second species; and

[0018] (v) if the precursor ion is derived from the second species, direct the mass spectrometer to not analyze additional precursor ions corresponding to the second species. In some embodiments, the controller is further configured to: analyze, in real-time, a second precursor ion spectrum for a second precursor ion to determine a second peptide corresponding to a second protein from which the second precursor ion is derived; identify the second protein corresponding to the second peptide and increment a peptide count corresponding to the second protein; and add the second protein to the exclusion list when the peptide count for the second protein reaches the predetermined threshold.

[0019] In some cases, the controller is further configured to: (iv) analyze, in realtime, the precursor ion spectrum to determine whether the precursor ion is derived from any one of a preselected plurality of species; and (v) if the precursor ion is derived from the preselected plurality of species, direct the mass spectrometer to not analyze additional precursor ions corresponding to the preselected plurality of species. Not analyzing additional precursor ions may involve excluding or ignoring the additional precursor ions or in some embodiments, not analyzing additional precursor ions may involve subtracting out such analysis.

[0020] In some embodiments, the controller is further configured to (iv) add the first species to a database, such as in a separate field or as a specific designation in the database. [0021] In additional embodiments, the system may also include a database of masses of peptides derived from the first species. In further embodiments the controller is further configured to add to the exclusion list the masses of peptides derived from the first species when the peptide count for at least one of the masses of peptides reaches the predetermined threshold. In certain embodiments, the controller is further configured to identify one or more microbes.

[0022] In additional embodiments, the controller is further configured to, when the peptide count reaches the predetermined threshold and the first protein corresponds to a set of proteins forming a molecular pathway, add to the exclusion list at least one additional protein forming the same molecular pathway as the first protein. In some cases, the molecular pathway is a cancer pathway or an inflammation pathway, or it may be a set of proteins corresponding to a particular disease, condition, or development stage. It may also relate to comparing proteomes before and after a particular event, including but not limited to implementation of a therapy or treatment, exposure to a chemical compound, or other change in physical environment. In some embodiments, the molecular pathway is any predetermined set of interacting proteins. In additional embodiments, the molecular pathway is any set of proteins determined through lookup of static databases. It is contemplated that in some cases, the molecular pathway is any set of proteins determined through real-time lookup of databases, either locally or over the Internet.

[0023] In particular embodiments, the step of analyzing the precursor ion spectrum is performed in real-time by the system.

[0024] Additional embodiments include a mass spectrometry system comprising: (a) a mass spectrometer; and (b) a controller connected to the mass spectrometer including a computer readable medium on which programming is encoded, configured to acquire a first set of masses, and configured to not acquire a second set of masses, wherein the second set of masses includes the masses of at least 1,000 proteins of the proteome of a species. In some embodiments, the second set of masses includes the masses of at least or at most five thousand, ten thousand, twenty thousand or one hundred thousand proteins of the proteome of the species, or any range derivable therein. In certain embodiments the proteome is specifically a human proteome.

[0025] In other embodiments there is a mass spectrometry system comprising:

(a) a mass spectrometer; and (b) a controller connected to the mass spectrometer including a computer readable medium on which programming is encoded, configured to: (i) direct the mass spectrometer to acquire a precursor ion spectrum of a sample stream; (ii) analyze, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a first species; and (iii) if the precursor ion is derived from a first protein selected from a set of proteins forming a molecular pathway, direct the mass spectrometer to not analyze subsequent additional precursor ions unless those subsequent additional precursor ions are also found in the set of proteins forming the same molecular pathway. In specific

embodiments, the molecular pathway is a cancer pathway or an inflammation pathway, or any other pathway described herein.

[0026] In particular embodiments there is a mass spectrometry system comprising: (a) a mass spectrometer; and (b) a controller connected to the mass spectrometer including a computer readable medium on which programming is encoded, configured to acquire a first set of masses, and configured to not acquire a second set of masses, wherein the second set of masses includes the masses of at least 1 ,000 peptides derived from the digestion of proteins of the proteome of a species. In some cases, the second set of masses includes the masses of at least 5,000, 10,000, 50,000, 100,000, 500,000, 1,000,000 or 2,000,000 peptides obtained or derived from the digestion of proteins of the proteome of the species. In some embodiments, the second set of masses each has a width of, of at least, or of at most 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 ppm (or any range derivable therein), each centered on the mass of the peptide derived from the digestion of proteins of the proteome of the species. [0027] Embodiments also concern methods of using the system described herein. In some embodiments, there are methods for controlling a mass spectrometer, comprising: (i) directing the mass spectrometer to acquire a precursor ion spectrum of a sample stream; (ii) analyzing, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a first species; and (iii) if the precursor ion is derived from the first species, directing the mass spectrometer to not analyze additional precursor ions corresponding to the first species. Other methods include controlling a mass spectrometer, comprising: a) directing a mass spectrometer to acquire a precursor ion spectrum of a sample stream; b) determining whether the precursor ion corresponds to a protein on an exclusion list; and, c) when the precursor ion does not correspond to a protein on the exclusion list, i) analyzing in real-time the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived; ii) incrementing a peptide count corresponding to the first protein; and iii) adding the first protein to an exclusion list when the peptide count for the first protein reaches a predetermined threshold. In some embodiments, the first protein corresponds to a first species, and the controller is configured, through the programming on the computer -readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold. In other embodiments, the first species is at least one of a group comprising homo sapiens, a mammal, a human pathogen, a virus, a fungi, a bacterium, a plant, a cow, a pig, a chicken, and a fish, or any other species described herein.

[0028] In some embodiments, methods include analyzing, in real-time, a second precursor ion spectrum for a second precursor ion to determine a second peptide corresponding to a second protein from which the second precursor ion is derived; identifying the second protein corresponding to the second peptide and increment a peptide count corresponding to the second protein; and when the peptide count for the second protein reaches the predetermined threshold, adding the second protein to the exclusion list.

[0029] There are other methods for acquiring data on a mass spectrometer, comprising: (a) acquiring a first set of masses, while not acquiring a second set of masses, wherein the second set of masses includes the masses of at least 1,000 proteins of the proteome of a species. Additional methods include methods for identifying one or more microbes in a biological sample comprising subjecting a biological sample to a mass spectrometry system and identifying one or more microbes in the biological sample by mass spectroscopy. It is specifically contemplated that a pathological microbe may be identified. [0030] In some embodiments, there is a method for acquiring data on a mass spectrometer, comprising: (a) directing the mass spectrometer to acquire a precursor ion spectrum of a sample stream; (b) analyzing, in real-time, the precursor ion spectrum to determine whether the precursor ion is derived from a first species; and (c) if the precursor ion is derived from a first protein selected from a set of proteins forming a molecular pathway, directing the mass spectrometer to not analyze subsequent additional precursor ions unless those subsequent additional precursor ions are also found in the set of proteins forming the same molecular pathway. [0031] In certain embodiments, a proteome may be defined not with respect to a subject's entire proteome with any qualification, but it may refer to a subject's proteome in a specific context, such as one qualified by, for example, the biological sample or location in the body; age or development of the subject; or condition or health of the subject. For instance, the presence of bacteria may be foreign/pathogenic on one part of the body, but not another. A similar concept may apply with respect to age or gender, or whether the subject has a particular health condition or disease. Methods may include a step of obtaining a biological sample from a subject. Methods may include taking a swab, swatch, swipe or other type of sample that may contain biological material from a subject, surface, inanimate object, composition, liquid, semi-solid, plant, animal, fruit, organism, or other object to be tested.

[0032] Embodiments also include a computer readable medium encoded with instructions which when loaded on at least one computer, establish processes for the method described herein. In some embodiments there is a computer program product, comprising: a non-transitory computer-readable medium comprising instructions for performing the steps of: directing a mass spectrometer to acquire a precursor ion spectrum of a sample stream; determining whether the precursor ion corresponds to a protein on an exclusion list; and, when the precursor ion does not correspond to a protein on the exclusion list: analyzing, in real-time, the precursor ion spectrum to identify a first peptide corresponding to a first protein from which the precursor ion is derived; incrementing a peptide count corresponding to the first protein; and, when the peptide count for the first protein reaches a predetermined threshold, adding the first protein to an exclusion list. In some embodiments, the first protein corresponds to a first species, and wherein the controller is configured, through the programming on the computer-readable medium, to add to the exclusion list a plurality of proteins corresponding to the first species when the peptide count for the first protein reaches the predetermined threshold. In specific embodiments, the first species is at least one of a group comprising homo sapiens, a human pathogen, a virus, a fungi, a bacterium, a plant, a cow, a pig, a chicken, and a fish.

[0033] In some embodiments, there is at least one computer readable medium encoded with instructions which when loaded on at least one computer, establish processes for the method described herein.

[0034] Any feature discussed in the context of one embodiment herein is contemplated for use in any other embodiment discussed herein. Accordingly, compositions discussed in the disclosure may be used in any method discussed in the disclosure, and vice versa.

BRIEF DESCRIPTION OF THE DRAWINGS

[0035] The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein. [0036] FIG. 1 - Comparison of fragmentation spectra from the non-labeled and isotope-labeled form of a peptide allows the identification of non-shifting (b) and shifting (y) fragmentation ions.

[0037] FIG. 2a-2b - A. Comparison of all peptides from a Mascot search (grey) and peptides filtered by Validator (black). Validator successfully filters out most low-scoring peptides while retaining those with high scores. B. ROC curve demonstrating the low sensitivity and specificity throughout the range of cutoff scores for all Mascot queries (stars) and the improved sensitivity and specificity after filtering with Validator (triangles).

[0038] FIG. 3a-3b - A. Validator iterates through the list of monoisotopic masses generated by Decon2LS. When an isotopic pair is found, Validator compares the two fragmentation spectra to identify non- shifting (b) and shifting (y) ions. The tryptic database is queried and a set of candidate peptides is extracted (blue bar). The fragmentation spectrum of each candidate peptide is calculated and compared to the b and y ions generated by Validator (purple). The sum of the number of b ion and y ion matches is the peptide score, and the peptide with the highest score is the "winner" and likely correct peptide identity. B. Example from the Mann dataset, showing [1] the identification of an isotopic pair, [2] peptide selection from the tryptic database, [3] Validator-based deconvolution of the fragmentation spectra, and [4] scoring and [5] selection of the winning peptide. The identity of this peptide was verified in the original dataset and in an independent database search. This figure also depicts SEQ ID NOs: l-15. [0039] FIG. 4A-4B - MS/MS data from LTQ ion trap (Orbi-LTQ, A) and Orbitrap

(Orbi-Orbi, B) from the ¹⁸0 "monoisotopic" peak of LFVGGIKEDTEEHHLR. Higher resolution and mass accuracy in Orbi-Orbi data allow much more confident matching to theoretical spectra (Table 1). In turn, fewer noise peaks appear in the Orbi-Orbi spectra, corresponding to significantly higher signal to noise.

[0040] FIG. 5A-5C - Simulation of a tandem MS analysis of a sample with 5000 proteins. A. The standard "top 5" approach is employed in which the top five most abundant peaks are chosen for fragmentation. B. The method of Dynamic Exclusion, as it is employed by Thermo Scientific. As each peptide is fragmented, the mass of the peptide is added to an exclusion list, preventing a similar peak from being fragmented for 180 seconds. C. When intelligent, protein-based mass exclusion is applied, many more proteins are identified throughout the run. In addition, the abundance of the peptides chosen for fragmentation declines dramatically and represents the entire dynamic range. Black line = proteins identified, blue dots = peptide abundance.

[0041] FIG. 6 - Scheme for trypsin-mediated carboxyl terminal oxygen exchange as a route to isotopic labeling of tryptic fragments. Incubation in H ₂ ¹⁸0 water and trypsin results in sequential exchange of two ¹⁸0 atoms at the carboxyl terminal of each peptide, yielding a 4.008 Da shift between ¹⁶0 and ¹⁸0 forms.

[0042] FIG. 7 - Isotopic envelopes of stable isotope-labeled peptides. LC-MS/MS analysis of 1 : 1 mixture of SILAC labeled yeast peptides. Note the characteristic doublet of isotopic envelopes. ¹⁶0 and ¹⁸0 peptides would look identical except for a smaller separation m mass (8.02 Da in this SILAC experiment, 4.008 in ¹⁸0 labeling). [0043] FIG. 8A-8B - Automated workflow incorporating Validator, Identifier and

Quantitator. A. Validator iterates through the spectra, finds, isotopic pairs and compares the two ("light" and "heavy") fragmentation spectra to identify non-shifting (b) and shifting (y) ions. Identifier queries a tryptic database and extracts a set of candidate peptides (list). The theoretical fragmentation spectrum of each candidate peptide is compared to the non-shifting (b) and shifting (y) ions identified by Validator (purple). A match score is calculated and the highest score is the "winner" if it statistically higher than a random population. This figure also shows SEQ ID NOs. 1-13. B. Quantitator identifies the chromatographic elution profile of each peptide, compares the theoretical isotopic pattern to the experimental spectra (based on average peptide mass for the size of the peptide or based on sequence) and identifies the most informative spectra for high quality quantitation. The first spectra that passes the fit score is labeled with a star. The monoisotopic mass of that "light" peptide is either selected for fragmentation or excluded if already identified. [0044] FIG. 9 - Phylogenic mapping of tryptic peptides. 15 bacterial species/strains were selected from NIAID "High Potential of Bioengineering" and trypsin digested in silico. Values associated with species/strains represent the number of peptide peptides masses unique to that species/strain. Values at branching points represent the number of peptides shared exclusively by species/strains upstream of the branching point. Peptides exclusively shared by polyphyletic and paraphyletic groups are not represented here (except for Brucella strains, see Venn diagram in top left corner).

[0045] FIG. 1 OA- IOC - Simulation of a tandem MS analysis of a sample with

5000 proteins. A. The standard "top 5" approach is employed in which the top five most abundant peaks are chosen for fragmentation. B. Dynamic Exclusion. As each ion is fragmented, the mass of the peptide is added to an exclusion list, preventing an ion with the same mass from being fragmented for 180 seconds. C. When dynamic, protein-based mass exclusion is applied, many more proteins are identified throughout the run. In addition, the abundance of the peptides chosen for fragmentation decreases dramatically and represents the entire dynamic range. Black line = proteins remaining un-identified, blue dots = peptide abundance.

[0046] FIG. 11 - Intelligent Inclusion of proteins. By looking at bacterial (target) peptide matches to the human (background) proteome, we can identify target peptides of interest that are a) only present in target proteome or (unique peptides) b) has higher detectability than background matched peptide (called inclusive peptides). Unique and inclusive peptides together make up an intelligent inclusion list.

[0047] FIG. 12 - Genes connected to subsystems and their categories (43).

[0048] FIG. 13 - Scheme for trypsin-mediated carboxyl terminal oxygen exchange as a route to isotopic labeling of tryptic fragments. Incubation in H ₂ ¹⁸0 water and trypsin results in sequential exchange of two ¹⁸0 atoms at the carboxyl terminal of each peptide, yielding a 4.008 Da shift between ¹⁶0 and ¹⁸0 forms.

[0049] FIG. 14 - Isotopic envelopes of stable isotope-labeled peptides. LC-MS/MS analysis of 1 : 1 mixture of SILAC labeled yeast peptides. Note the characteristic doublet of isotopic envelopes. ¹⁶0 and ¹⁸0 peptides would look identical except for a smaller separation in mass (8.02 Da in this SILAC experiment, 4.008 in ¹⁸0 labeling).

[0050] FIG. 15A-B - Automated workflow incorporating Validator, Identifier and Quantitator. A. Validator iterates through the spectra, finds, isotopic pairs and compares the two ("light" and "heavy") fragmentation spectra to identify non-shifting (b) and shifting (y) ions. Identifier queries a tryptic database and extracts a set of candidate peptides (list). The theoretical fragmentation spectrum of each candidate peptide is compared to the non- shifting (b) and shifting (y) ions identified by Validator (purple). A match score is calculated and the highest score is the "winner" if it statistically higher than a random population. This figure also depicts SEQ ID NOs: l-15. B. Quantitator identifies the chromatographic elution profile of each peptide, compares the theoretical isotopic pattern to the experimental spectra (based on average peptide mass for the size of the peptide or based on sequence) and identifies the most informative spectra for high quality quantitation. The first spectra that passes the fit score is labeled with a star. The monoisotopic mass of that "light" peptide is either selected for fragmentation or excluded if already identified.

[0051] FIG. 16A-16C - Simulation of a tandem MS analysis of a sample with

[0052] FIG. 17 - shows a standard proteomics analysis workflow. The standard proteomics workflow is diagrammed in Galaxy. The tools have been "wrapped" and instantiated into the publicly-available Galaxy instance. When fully functional, users will be able to use Globus Online to upload content and then analyze it using a variety of proteomics workflows. Results can then be automatically downloaded to the user.

[0053] FIG. 18 - This diagram illustrates how unique bacterial peptides are. Values associated with species/strains represent the number of peptides unique to that species/strain. Values at branching point represent the number of peptides shared exclusively by species/strains upstream of the branching point. Peptides exclusively shared by polyphyletic and paraphyletic groups are not represented here (except for Brucella strains).

[0054] FIG. 19 - This figures shows how the uniqueness facilitates Whole Proteome

Exclusion.

-12- DETAILED DESCRIPTION

[0055] PART 1

[0056] The current shortcomings of LC-MS/MS proteomics cannot be assigned to any single flaw, but represent the combined effects of multiple constraints. An obvious limitation is set by choice of mass spectrometer, which defines the limits on precision, resolution, sensitivity and speed. Nonetheless, both data acquisition and data processing may be more significant weaknesses. In turn, each offers potential for significant improvements. Two opportunities for major advances in performance are to improve the algorithms that select peptide ions for fragmentation and that match the resulting fragment spectra to candidate peptides as a route to their identification. Both challenges can be addressed with a coordinated computational approach, yielding a next-generation LC-MS/MS approach allowing rapid, comprehensive and quantitative analysis of complex samples.

[0057] The prevailing method for peptide and protein identification requires extensive searches through databases of theoretical mass spectra to obtain statistically significant matches. This approach is slow and error-prone, suffers from considerable false positive and false negative rates, and requires an experienced investigator to manually validate each identification. Most importantly, in its current implementation, the method does not take advantage of the remarkable mass accuracy afforded by contemporary instruments. Here, we describe a novel approach that streamlines peptide identification while enhancing sensitivity and specificity of protein detection.

[0058] Second, we address the current limitations in instrument runtime control.

Standard LC-MS/MS data acquisition methods sequentially select the most intense peptide ions for fragmentation, without regard to whether the peptide might be anticipated by others already analyzed. This strategy is poorly suited to analysis of complex biological mixtures, where dozens of peptides derive from each protein component and protein abundance varies over several orders of magnitude. Here, we describe using on-the-fly analysis of peptides and proteins to skew subsequent selection of peptide ions and dramatically increase dynamic range.

[0059] By combining these two approaches, peptides and proteins can be identified in real-time, and these data can exploited to direct data acquisition towards non-redundant and lower-abundance peptides, resulting in the confident identification of thousands of proteins during an individual run. [0060] A tool to rapidly and confidently identify peptides and their parent proteins from high mass accuracy raw LC-MS/MS data.

[0061] Software to deconvolute raw mass spectra. Validator ^RAW - mines and deconvo lutes raw, unsearched MS data, identifying b and y ions. [0062] Software to identify peptides directly from raw MS data. Identifier software - determines peptide sequence given an accurate precursor mass and corresponding b and y ions.

[0063] Software to include peptides with post-translational modifications.

Enhanced Identifier algorithm - includes peptides that contain a wide array of post- translational mo difications .

[0064] Enhanced Validator ^RAW and Identifier for a label- free system. A system that does not require isotope label-dependent spectrum deconvolution capable of analysis of fragmentation data directly with high mass accuracy to ascertain peptide sequence.

[0065] Intelligent, knowledge-based tools for rapid, data-dependent protein identification.

[0066] Query protein interaction databases to augment the assignment of peptides to parent proteins, protein interaction databases are queried to better predict protein identity from the constituent peptides.

[0067] Protein-protein interaction analysis to validate ProteinMiner algorithm. Using cell line-based systems, ProteinMiner enhances protein identification.

[0068] LC-MS/MS control software.

[0069] Implements peptide/protein identification in a real-time, data-dependent fashion. Validator ^RAW I Identifier can identify peptides in real-time, as the data are collected from the MS. [0070] Algorithms for intelligent data-dependent protein-based peptide inclusion and exclusion lists. Software to utilize real-time peptide identification to influence the peaks picked for fragmentation, thus vastly increasing the dynamic range of protein identification.

[0071] Implement algorithms an available LC-MS/MS hardware platform. Using available hardware, software modifications are integrated into the MS enhancing the dynamic range. [0072] RAPID AND COMPREHENSIVE PROTEOMics The ongoing development of proteomics has been greatly advanced by the continuing improvements in database search, where MS/MS spectra are compared to those predicted from any peptide that could be present, based on organism and sample preparation (Rappsilber, et al, 2002). The resulting list of confident peptide ID's is then used to obtain a list of protein ID's. The current automated pattern-matching algorithms (e.g., Mascot, SEQUEST, OMSSA, X!Tandem, etc. (Rappsilver, et al, 2002; Aebersold & Mann, 2003; Craig, 2004; Fenyo, et al, 2003; Frank, et al, 2007; Gras & Muller, 2001; Hunt, et al, 1986; Lin, 2003; Nesvizhskii, et al, 2007; Perkins, et al., 1999; Yates, et al., 1995) determine a score for each comparison and pick a best match, but how scores are assigned and interpreted distinguishes them (Deutsch, et al, 2008), giving rise to a literature on statistical methods to evaluate significance of match scores (Gras, et al, 2001). Despite these recent advances, analyzing mass spectrometry data remains slow and computer-processor intensive. One approach to streamlining database search is to apply parallel computing and other distributed systems (Bogdan, et al, 2008; Duncan, et al, 2005; Wang, et al, 2010). Exploiting prior results to speed identification via "peptide recognition" can improve performance considerably. Limiting the range of potential matches to a list of previously identified, high-confidence ID's in Accurate Mass and Time analysis (AMT) database search is replaced with a list of previously-identified peptides that can be recognized by a characteristic elution time and parent ion mass (Liu, et al., 2007; Pasa-Tolic, et al., 2004; Smith, et al., 2002). Nonetheless, database search remains highly susceptible to both over-reporting false positives (low specificity) and under-reporting true positives (low sensitivity), requiring extensive manual validation.

[0073] A more fundamental limitation not addressed in any of these methods is that

MS data are highly biased by the "system software" that controls the LC-MS/MS run. The critical step of selecting ions for fragmentation is a common weak point. Typically, the spectrometer is programmed to select the most intense ions in the current MS spectrum for fragmentation. As a result, during a run, a high proportion of the selected peptides will derive from a small number of highly abundant proteins and, even then, the most highly abundant peptides may be subjected to repeated fragmentations. Thus, irrespective of the pattern- matching method to be used, the proteins identified with highest confidence will inevitably be dominated by those found in the highest abundance. Medium and low-abundance proteins will go unidentified, despite having yielded multiple detectable ions in the MS spectrum, because of a failure to select any of their peptides for fragmentation. Thus, the technologies described herein for the rapid and accurate identification of peptides and proteins from raw MS data and control of data acquisition represent a disruptive change. No longer will the mass spectrometer be considered a tool for expert users only, requiring manual data analysis and/or extensive sample handling to simplify complex samples into "manageable" sub- proteomes. Confident reproducible identification of thousands of proteins during a single LC- MS/MS run significantly increases the value of MS as a tool for basic research and enables it for use in the clinic. Examples of clinical applications include, but are not limited to (1) detection of biomarkers from patient samples to stratify malignancies and thus aid in treatment decisions, (2) monitoring disease progression and therapy efficacy, and (3) the development of better therapeutics and improvements in disease outcomes.

[0074] IN-DEPTH PROTEOMICS Because the data-dependent MS/MS approach selects the most abundant ions for fragmentation, the results are skewed towards identification of abundant proteins. In fact, it is not uncommon for dozens of peptides from a single abundant protein to be identified. Thus, the mass spectrometer spends time identifying the same protein over and over again at the expense of missing low-abundance peptides/proteins. Being able to control the MS during the run, perform rapid peptide and protein identification, and dictate which ions are selected for fragmentation based on the current list of identified proteins significantly improves dynamic range. Previously, the computation required to perform analyses was far too slow for real-time implementation. Therefore, analysis could only be achieved off-line, after data had been collected from the mass spectrometer. Advances in processing power have resulted in several orders of magnitude improvement in computing speed, making real-time analysis of MS data feasible. Real-time protein decision-based mass spectrometry can be implemented on, for example, the Waters SYNAPT G2 Q-TOF and using the Real Time Databank Searching (RTDS) platform. Identifier, a peptide and protein identification tool, can be used to perform on-the-fly protein identification. Proteins are compiled into lists and used to dictate which ions are selected for or excluded from fragmentation. The software facilitates the confident identification of thousands of proteins from an unknown complex sample in real time, representing an order of magnitude improvement in the speed and accuracy of mass spectrometry-based proteomics. This enables the implementation of mass spectrometry for rapid and clinically useful proteomic analysis, permitting MS-based clinical decision-making within hours of sample collection. [0075] Weaknesses in database search, probability-based peptide and protein identification, and data acquisition compound each other, so that analysis of complex mixtures of proteins has been far from the fast, accurate, and deep investigation required for clinical proteomics and other future applications. Embodiments include software to deconvolute raw MS data (Validator ¹¹⁴^ and determine peptide identifications (Identifier), using isotopic labeling, incorporating detection of post-translational modifications (PTMs), or from unlabeled samples. Embodiments include knowledge-dependent methods for identifying proteins from peptide data (ProteinMiner) method. Additional embodiments include software to enable confident real-time peptide and protein identification. [0076] IDENTIFYING PEPTIDES AND THEIR PARENT PROTEINS FROM

RAW LC-MS/MS DATA

[0077] Stable isotope labeling is a standard method for quantifying relative protein abundance (Ong & Mann, 2005) whereby carboxyl-terminal labeling results in mixtures of pairs of chemically identical but isotopically distinct peptides. Several strategies allow differential labeling with isotopic tags including ¹⁸0-labeling of peptides mediated by proteolytic oxygen exchange (Stewart, et al, 2001; Bonenfant, et al, 2003; Heller, et al, 2003; Miyagi, et al, 2007) and stable isotope labeling with ¹³C and ¹⁵N-labelled amino acids in cell culture (SILAC) (Ong, et al, 2003; Ong, et al, 2002; Amanchy, et al, 2005; Ong, et al, 2007; Mann, 2006). Unlabeled and stable isotope-labeled peptides co-elute as pairs during LC-MS/MS, yielding offset isotopic envelopes, typically 4-10 Da, in the MS ¹ scan. Informatic analysis can then be used to compare the intensity of the isotopic forms to quantify relative abundance (e.g. (Mason, et al, 2007; Wang, et al, 2006).

[0078] VALIDATOR CAN IMPROVE MASCOT SEARCH RESULTS In a typical isotopic labeling experiment, both "light" and "heavy" isotopologues of a peptide will often be selected for collision-induced dissociation (CID) fragmentation. Significantly, fragmentation patterns can be readily distinguished by the differential effect of the carboxyl- terminal label on resulting b and y ions (Scoble & Martin, 1990; Takao, et al., 1991). The C- terminal fragments (y ions) appear as light and heavy forms, while N-terminal fragments (b ions) display a single shared mass (Fig. 1). Validator exploits this pattern to improve peptide identification.

[0079] The Mascot search engine attempts to assign each fragmented ion to a candidate peptide match, but the majority of matches are considered false positives. As such, only peptide ID's with a score of 30-40 are considered significant based on a 5% false discovery rate (FDR) threshold for high-confidence identification. The FDR at a given threshold is calculated as the quotient of the decoy peptides and target peptides identified with scores exceeding the threshold. Using a complex stable isotope-labeled yeast lysate sample, Mascot identified 17,200 peptides but only 2,308 (13.4%) had scores over 35 (Fig. 2A, grey bars). Validator (Volchenboum, et al, 2009), has been developed to mine Mascot data files, find pairs of light and heavy peptides, and deconvolute the fragmentation spectra to evaluate the validity of the Mascot ID. By comparing the Validator-identified b and y ions with the in silico fragmentation patterns of the Mascot-identified peptides, the software is able to corroborate almost all isotopic pairs for high-scoring peptides. For low-scoring peptides, comparing observed b and y ions with the predicted fragmentation pattern improved the Mascot score at which the FDR was 5% from 36 to 22, significantly increasing both sensitivity and specificity. The Validator software processed the 100 Mb Mascot .DAT file in less than five minutes, revealing high-confidence peptide identifications without regard to Mascot score, far faster than manual or other independent validation methods. Receiver operating characteristic curve (ROC) analysis of the full set of Mascot-searched data demonstrates poor sensitivity and specificity throughout (Fig. 2B, stars). When the Validator filtering algorithm is applied to the data (Fig. 2B, triangles), the ROC curve demonstrates a sensitivity of 80% and specificity of 89%> at a threshold score of 35. [0080] OBVIATING TRADITIONAL SEARCH A two-pronged approach overcomes the need for database search to identify peptides. First, an extension of the Validator algorithm works on unsearched data (Validator^ ). Second, the accurate precursor mass can be used to interrogate a mass-sorted species-specific database of tryptic peptides to generate a list of candidates, from which a match is chosen based on the degree of similarity between the in silico fragmentation pattern and the unknown spectrum {Identifier). The algorithm can include functionality to identify post-translational modifications. Data collected with high mass accuracy in MS ² allows for identification of b and y ions without requiring stable isotope labeling.

[0081] SOFTWARE TO DECONVOLUTE RAW MASS SPECTRA [0082] Validator ^ identifies isotopic pairs and deconvolutes spectra directly from raw unsearched data. Validator ^ works on unsearched MS data, independently of Mascot (or any other search engine). All software is coded in Python 2.6 and tested on standard desktop hardware. First, raw data files collected from a Thermo Scientific MS are converted to a flat-text mzXML file by means of ReAdW.exe (a command- line program to convert Xcalibur native acquisition (.RAW) files to mzXML.). Then accurate monoisotopic masses are extracted to a text file using the Horn Mass Transform algorithm (Horn, et al., 2000) within Decon2LS ("Open Source Tools for the Accurate Mass and Time (AMT) Tag Proteomics Pipeline", N. Jaitly, et al, ASMS, 2006 poster) using default options with the exception of a minimum SNR of 1 for peak detection, a peptide background ratio of 1 and using the 'complete fit' option. The software then reads the monoisotopic data completely into memory and creates a dictionary containing sets of 400 sequential scans (about 2.5 min) spaced 10 scans apart. Each 400 scan window is then searched for any two masses that differ by the mass of ¹³C ₆, ¹⁵N ₄-Arg (10.00827) or ¹³C ₆, ¹⁵N ₂ -Lys (8.01420) or some combination (Arg/Arg 20.01654, Lys/Lys 16.0284, Arg/Lys 18.02247). Pairs of scans that match this difference within a 3 ppm tolerance are stored in an array. Potential duplicates are identified and stored in the same array. Once all the pairs are extracted from each scan window, the mzXML file is searched with each light and heavy monoisotopic mass in order to find fragmentation scans that arose from these peaks. Where available, the fragmentation data are stored in the pair array. For the scans for which both light and heavy precursors have at least one set of fragmentation data, the light and heavy MS ² data are compared through simple peak-to-peak iteration to find peaks that match (±1000 ppm) based on having the same mass (non-shifting b ions) or a mass difference equal to one of the stable-isotope masses outlined above (shifting y ions) and having intensities within 25% of one another. At the end of the iteration for each scan window, there is now an array of precursor light/heavy scans with monoisotopic masses, corresponding MS ² scan data, and a set of b and y ions.

[0083] VALiDATOii ^RAW The algorithm is tested using a well-characterized published

LTQ-Orbitap SILAC data set (Cox & Mann, 2007). When data from one 380 Mb slice of raw data was analyzed, Validator ^RAW found 18,118 potential isotopic pairs, but only 2,775 (12.5%) had corresponding MS ² data available for both members of the pair. Of the pairs, there were 1137 "light" peaks and 1672 "heavy" peaks for which no fragmentation data were available. There were 546 windows of 400 scans, with an average number of 1377 precursor scans in each, after redundant monoisotopic masses were merged. The average processing time per scan was 8.2 ms, and the average number of isotopic pairs identified per 400-scan window was 199.6. Thus, the Validator ^ algorithm can process raw MS data, finding and deconvo luting isotopic pairs. [0084] ADDITIONAL VALIDATOR ^{RA W} EMBODIMENTS In certain embodiments, the code can also load either the published list of peptides or search engine-generated data (from Mascot and X!Tandem). As described above, one can iterate through each peak and compare them based on a tolerance of 1000 ppm, scoring each one by summing the number of b and y ion matches. For each peptide, one can repeat this exercise with thirty random sequence- scrambled identical-composition decoy peptides and determine the statistical significance for each peptide match by calculating the 95% confidence interval for the decoy scores and observing if the score for the peptide is outside this range. Based on the number of pairs that exceed statistical significance, one can calculate sensitivity and specificity for the deconvolution algorithm. To the end of optimizing pair matching, one can then modulate the MS ¹ (1-5 ppm) and MS ² (200-2000 ppm) match tolerances and the size of the scan window (100-400 scans), and re-run Validator ^ , generating ROC curves for each variable in order to determine the optimum conditions.

[0085] ISOTOPIC ENVELOPE ANALYSIS In certain embodiments, the Validator algorithm does not use all the information encoded in the sequence of scans corresponding to one chromatographic peak. In these embodiments for instance, while looking for the isotopic pairs, it compares monoisotopic masses for each isotopic envelope found in a scan. This method is fairly robust in the case of highly abundant peaks, but those of low magnitude are often registered with their envelopes incomplete or distorted, making the determination of their monoisotopic mass inaccurate or even impossible. There is simply not enough information in a single scan to make such determinations. But the necessary information may still be available, spread across multiple scans. It was shown that by integration of the peak data over a range of successive scans, as well as by using the information from the peaks corresponding to different charge states of the same peptide, the accuracy of mass determination of the less abundant peptides can be significantly increased to sub-ppm values (Cox & Mann, 2009). When the inventors employ the method of Cox and Mann, they observed that even a short sequence of chromatographic events forming a 3-D ridge can reduce the mass error of individual low-abundance peaks from several ppm to well within 1 ppm. By adding a 3-D integration algorithm to Validator, one achieves a more reliable detection of isotopic pairs, since the MS ¹ tolerance can be restricted even further. The method described by Cox and Mann can be employed, enabling integration of all peak data for any SILAC pair (Cox & Mann, 2009). To accomplish this, one can allow certain embodiments of Validator^ to find pairs as before from the monoisotopic data, but including the extra step of finding all measured m/z values from the mzXML file, over all scans in which the pair appeared. The entire complement of scans, representing all the measurements for the SILAC pair, are then subjected to error correction according to the Cox and Mann algorithm.

[0086] COMPARING ENTIRE FRAGMENTATION PATTERNS In certain embodiments, Validator only compares peak intensity and m/z. A preferred analysis relies on comparing the full fragmentation spectra using robust pattern matching techniques analogous to those employed in radio signal detection, speech recognition, image processing and conventional liquid or gas-phase chromatography. All such algorithms, which can be summarily characterized as "holographic", endeavor to extract useful information from the entire run- length of the signal - in contrast to simple "analytical" methods commonly used in the instrumentation firmware, which are tuned to recognize a small set of characteristic events, such as threshold-crossing in the signal itself and in a few of its derivatives. An advantage of the holographic approach is that it is less sensitive to local distortion of the peaks by random noise or drift, and is more successful in comparing the peptide fragmentation spectra of low magnitude. Algorithms for these approaches can be based on deterministic spectra convolution or based on a stochastic classifier. A statistical learning machine can be used. A deterministic transform-based algorithm, such as convolution filtering can be used. Various convolution kernels that transform the MS ² fragmentation spectra into an abstract functional domain can be used to facilitate the comparison between the two projections between the original frequency spectra. Alternatively, a statistical classifier based on a support vector machine ("SVM") can be used. SVMs have been successfully applied to problems of spectra recognition, and can be used in a similar way to solve the task of comparing any two spectra. It is preferred to minimize the number of parameters required to represent the spectra. One can train the SVM using additional binary input from a human expert confirming whether or not the two spectra originate from the same SILAC pair. This approach preferably includes care formulating the model and depends on the quality of human input, but allows an efficient algorithm to be built without programming.

[0087] Certain embodiments use proprietary libraries of others to read and convert raw data. It may be that the data are transformed during this process, and information may be mutated or lost. Tools to interrogate the raw data files can be used, obviating the need for complex and time-consuming data conversions. SOFTWARE TO IDENTIFY PEPTIDES DIRECTLY FROM RAW MS DATA

[0088] IDENTIFIER A peptide-spectrum matching algorithm, Identifier, is built upon the hypothesis that given a highly accurate precursor mass, there are only a small number of tryptic peptide candidates (Conrads, et al, 2000; Liu, et al, 2007; Mayampurath, et al, 2008; Strittmatter, et al, 2003). Identifier queries a mass-ordered tryptic database over a mass range to assemble a list of candidate peptides and applies the fragmentation comparison module from Validator ^ program to test each for the number of b and y ion matches. An array is maintained with each peptide and bly ion scores. Once all are tallied, the peptide with the highest total number of matches is considered the "winner," if there are more than 3 b and y ions and the highest score is more than 50% greater than the next highest score. To demonstrate the feasibility of the approach, the Consensus CDC (CCDS) core human protein library (20090902, build 9606) file was loaded into memory (23,730 proteins, 14.2 Mb) and using regular expressions, the software parsed the proteins and created tryptic peptides, allowing for one missed cleavage. To store the masses and peptides, a dictionary data structure was chosen, since lookups can take place in constant time. 1,082,556 peptides greater than 5 residues linked to their parent protein were stored in the dictionary and keyed on the accurate monoisotopic mass. Searching the entire mass-sorted tryptic database, a mean of only 104 peptides fall within a ± 0.05 Da window, and only 22 for ± 0.01 Da. The workflow outlined in Fig. 3A has been tested by varying stringency for finding pairs and for matching candidate peptides. When very tight stringency was used, Identifier found 522 peptides with an overlap of 309 peptides of the 1,119 peptides found after traditional database search with X!Tandem and Mascot for this same slice (27.7%). Of the 218 proteins found through database search, identified 157 (72.0%>) were identified. Interestingly, an additional 225 proteins were also identified with high confidence (allowing for non-unique peptide matches). When the criteria for pair finding were relaxed, spectra deconvolution and peptide matching as outlined above (MS ¹ error 5 ppm, MS ² error 2,000 ppm, no filter for picking correct peptide match), Identifier found 2,226 peptides, of which 522 were from the 1,119 found by Mann (46.6%>). Of the 218 proteins found by database search, Identifier found 193 (88.5%), along with many others. A specific example of a peptide match is shown in Figure 3B.

[0089] PEPTIDE SELECTION OPTIMIZATION The peptide selection process can be optimized using the Mann test data as described above. Preferably, those isotopic pairs for which the peptide identity is highly certain are considered. One can rescore candidate peptide matches testing several match score equations, such as total score divided by length of peptide, sum of matches divided by precursor mass error, and combinations of these variables, looking for the combination that affords the greatest ability to discriminate the correct match. [0090] CANDIDATE PEPTIDE SELECTION AND TESTING One can use a precursor mass error range of 0.05 Da, which is 50x larger than needed (given a mass error of 1-3 ppm). Manual inspection of the peptide candidate lists generated reveals that the "winning" peptide is usually within 0.005 Da of the precursor mass. In fact, as published by Mann (36), by employing the peak integration method described, one can improve the precursor mass error to 1 ppm or less. Modeling the database of tryptic peptides to study this effect, at a mass error of 1 ppm, indicates that the average number of candidate peptides falls to less than 5. Rather than arbitrarily choose a mass window, one can use a more sophisticated algorithm for candidate testing based on false discovery rate (FDR) which is faster, more robust, and allows for statistical validation independent of other peptide identifications. Different embodiments of the Identifier software can include the following variations: The tryptic database can be queried with the accurate precursor mass as determined by Validator ^ , and the two nearest peptides can be chosen for testing. The match score for each candidate peptide and a scrambled but identical composition decoy peptide can be calculated using the optimized formula derived above. The score distribution can be kept in an array, and additional peptides (and corresponding decoys) can be tested and logged. Suitable stopping criteria for this testing is when a peptide score in the group has reached statistical significance by being outside the 95% confidence interval for all the peptides tested. This process has the advantage of building statistical validation right into the method. Far fewer peptide comparisons may be required to find a "best match" using this method, facilitating more rapid identifications. [0091] In certain embodiments, a bottleneck for Identifier is the process of iterating through each candidate peptide and determining the theoretical fragmentation pattern, comparing it to the calculated b and y ions, a subroutine that runs over 10 million times during a typical analysis. A speed improvement can be realized when this repeated, processor-intensive code is transcoded into C++ and compiled. In addition, as outlined above, fragmentation-matching algorithms based on machine learning can be used, and can be implemented in for example, compiled C++. IDENTIFIER EMBODIMENTS FOR RECOGNIZING PEPTIDES WITH POST- TRANSLATIONAL MODIFICATIONS

[0092] POST-TRANSLATIONAL MODIFICATIONS Identifying peptides with post- translational modifications (PTMs) remains a central problem in proteomics. Over 200 PTMs have been identified, common ones being acetylation, farnesylation, phosphorylation, and oxidation. Traditional search engines approach this by brute force, comparing the experimental spectrum to the database of possible matches dictated by the user-specified list of potential PTMs. The search time is significantly extended for each additional PTM considered.

[0093] PTM IDENTIFICATION The approach to identifying peptides with PTMs is a natural extension of the strategy to match the deconvoluted spectrum to candidate peptides outlined above. If no peptide match is found within the range of the instrumental mass error, the search is expanded to include modifications. This can be coded in the following way: A dictionary of PTMs is created, based on their biological prevalence. In certain embodiments, the user can manipulate the dictionary. The search for a candidate peptide match can proceed as above, iterating from the peptides closest in mass to the unknown spectrum. Once the mass error between the next candidate and the unknown spectrum exceeds the mass error of the MS (e.g., 1-3 ppm), the "unmodified peptide" search can cease and the hunt for a modified peptide can commence. The dictionary of PTMs is queried, and the PTMs can be considered individually and in combinations based on their prevalence. For each PTM, the mass is subtracted from the monoisotopic mass of the unknown peptide, and the peptide candidates for the modification around this new mass are queried. Each candidate peptide is modified and fragmented in silico and compared to the pattern of b and y ions. For instance, if phosphorylation is being considered, 79.9799 Da is subtracted from the monoisotopic mass, and the tryptic peptide dictionary is queried for peptides that have possible sites of phosphorylation at this mass. Each candidate "phosphopeptide" is then modified on each Ser, Thr or Tyr, fragmented and compared to the Validator results. The search continues evaluating each modification or combination until either a match is found or the search is abandoned. From a programmatic perspective, this approach is attractive, because on a multi- core machine, the software can spawn a separate thread, performing this search in parallel, while other identifications continue.

[0094] Exemplary modifications and their associated masses include oxidation

15.9994, N-acetylation 42.0373, and pyroglutamic acid 17.0306. These can be considered in sequence and in combination for those fragmentation spectra for which no suitable candidate has been found.

[0095] In some cases, even using decoy search and FDR determination, there will be cases where an incorrect peptide from the candidate list has a reasonable score that separates it from the remainder, but where a search for a modified (correct) peptide would find a match with a much higher score. This may not be evident from the first-pass with Identifier. Therefore, it is preferable to devise a knowledge-based and data-dependent peptide-scoring scheme to recognize a correct peptide ID. Beyond simply testing each peptide individually, one can incorporate functionality described below {ProteinMiner) to better predict which peptides are likely to be present, thus strengthening the likelihood of a match being correct.

VALIDATOR^ AND IDENTIFIER EMBODIMENTS USING A LABEL-FREE SYSTEM

[0096] A common criticism of isotopic labeling is that it can increase the complexity of the sample and decrease dynamic range, suggesting significant advantages to a label-free approach. Modern mass analyzers such as the Orbitrap from Thermo-Fisher and the SYNAPT G2 Q-TOF from Waters offer mass accuracy approaching 1 ppm. If this precision is applied to analysis of fragmentation patterns, this can obviate the need for isotopic labeling to obtain high-confidence identifications using Identifier.

[0097] CID FRAGMENT ANALYSIS IN THE ORBITRAP Several systems were compared, in collaboration with Dr. Robert Bergen at the Mayo Proteomics Research Center: LC- MS/MS analysis of a complex sample on the LTQ-Orbitrap where MS ² is analyzed asynchronously in the LTQ linear ion trap at -1500 resolution and where the Orbitrap is used to analyze MS ¹ at 60,000 and MS ² at 7,500 resolution (so-called "Orbi-Orbi" analysis, Fig. 4). The latter approach improved identifications (Table 1) and improved SNR by eliminating most background noise. The cost of Orbi-Orbi analysis is a decrease by -1/2 in the number of ions fragmented, but this may be offset by the far greater yield of confident peptide identifications.

Table 1 - Six j-ions from LFVGGIKEDTEEHHLR identified by Mascot in Orbi-LTQ and Orbi-Orbi fragmentation spectra (Fig. 2) of ¹⁶0 (cols. 1 to 5) and ¹⁸0 (cols. 6 to 10) parent ions. Shown are calculated (cols. 1 and 6) and measured m/z values (cols. 2, 3, 7 and 8) with mass deviations (cols. 4, 5, 9, and 10). Note ten-fold smaller deviations from theoretical in Orbi-Orbi data (cols. 5 and 10).

1 2 7 16 ₀ 18 ₀

Cal. m/z Measured m/z Deviation Calc. m/z Measured m/z Deviation

Ion Orbi-LTQ Orbi- Orbi- Orbi- Orbi-LTQ Orbi- Orbi- Orbi- Orbi LTQ Orbi Orbi LTQ Orbi y3 425.2619 425.3048 425.2604 -0.0429 0.0015 429.2704 429.3807 429.2796 -0.1103 -0.0092 y4 562.3208 562.3345 562.3190 -0.0137 0.0018 566.3293 566.3345 566.3272 -0.0052 0.0021

P 691.3634 691.3732 691.3597 -0.0098 0.0037 695.3719 695.3729 695.3707 -0.0010 0.0012 y8 1035.4806 1036.4756 1036.4778 0.0050 0.0028 1040.4891 1040.4392 1040.4880 0.0499 0.0011 y9 1165.5232 1165.4774 1165.5217 0.0458 0.0015 1169.5317 1169.3828 1169.5300 0.1489 0.0017 ylO 1293.6182 1293.54773 1293.6161 0.0705 0.0021 1297.6267 1297.6151 1297.6227 0.0116 0.0040

[0098] IDENTIFYING PEPTIDES IN A LABEL-FREE SYSTEM One can model label-free fragment pattern matching, thereby agnostic to b vs. y ions, to determine a scoring function that weights the match score based on the error in mass between the measured and predicted fragments. One can determine a threshold mass accuracy for the fragments required to distinguish among the <10 candidate peptides to be considered over a 1 ppm precursor mass error with certainty equal to or better than that provided by isotopic pair analysis.

[0099] A drawback of the LTQ-Orbitrap is that high resolution analysis of peptide fragmentation incurs a significant decrease in yield of spectra. However, in, for example, the Waters SYNAPT G2 Q-TOF, both MS ¹ and MS ² are analyzed at 40,000 resolution and at 1 ppm mass accuracy, yielding up to 20 fragmentations per second, making this a suitable platform for implementation of label- free peptide identification. One can create embodiments of the Validator ^ program to search through the raw data, finding peaks for which fragmentation data are available. The accurate precursor mass can be used to query the database of tryptic peptides, and the search for a candidate match can proceed iterative ly as described above. Whereas in embodiments described above, the matching was between the theoretical fragmentation pattern for the candidate and the Validator-identified b and y ions, the comparison in a label-free system is between the candidate and measured fragmentation spectra. The peaks can be iteratively compared with a low tolerance (~1 ppm) to determine which fragmentation spectrum best matches the unknown based on a match score and employing the statistical methods described above. Alternatively, the pattern-matching algorithm developed in Aim IB can be employed to facilitate and speed up spectrum comparison.

[00100] One can test this embodiment of the Validator ^RAW algorithm in different ways. For example, one can analyze results from running a 48-protein mix on the SYNAPT G2. As the complete complement of this test proteome is known, one can confirm that the correct peptides are being identified. For peptides that are misidentified or for proteins that remain unidentified, one can examine our detailed log files from the run to determine which aspects (peak picking, candidate peptide selection, fragmentation comparison) need further optimization and refinement. One can compare the algorithm by parallel side-by-side comparison of MS analysis of similar complex samples on two different platforms. For instance, SILAC and unlabeled samples from Drs. Kron and Kristjansdottir's studies in breast cancer cells can be independently analyzed on the SYNAPT G2 at Waters and in the core facility at the Mayo Clinic on the LTQ-Orbitrap. One can then apply the label-free Identifier algorithm to the SYNAPT G2 data, comparing the peak selection and downstream analysis to that of other embodiments of the Validator ^RAW I Identifier system, as it is applied to the Orbitrap SILAC data. Again, analyzing the log files that detail the pair picking, peptide candidate selection, and fragmentation comparison can be illustrative in not only demonstrating the performance in the label-free system, but also in helping to optimize the approach. [00101] Calculations indicate that peptide ID can be performed directly from high accuracy fragmentation spectra quickly.

[00102] TOOLS FOR RAPID, DATA-DEPENDENT PROTEIN

IDENTIFICATION. Current methods for protein identification from constituent peptides are normally based on parsimony. That is, the simplest explanation for the population of peptides present is taken to be the correct one. A commonly used tool to enhance protein identification is ProteinProphet (Nesvizhskii, et al., 2003). This open-source module in proteomics pipeline developed by the Institute for Systems Biology (ISB, Seattle, WA) works by adjusting probabilities for single-hit peptides and also seeks to find the simplest set of proteins to explain the peptides present (parsimony). Nevertheless, this strategy is na ^'ive to the theoretical protein-protein interactions and the underlying biologic pathways involved. One can use knowledge-based tools for protein identification based both on parsimony but also on information-based strategies that rapidly mine existing database resources for information that can augment traditional means.

[00103] EMBODIMENTS THAT QUERY INTERACTION DATABASES TO AUGMENT ASSIGNMENT TO PARENT PROTEINS

[00104] PROTEIN INTERACTION DATABASES One can employ a systems biology approach to the development of an algorithm for the rapid assignment of peptides to their parent proteins. Given the richness and prevalence of protein interaction databases, it is possible to mine these resources to predict which proteins are likely to be present, once some information about the system is known. For example, the effects of MYCN amplification on tumor aggressiveness in neuroblastoma have been studied. Nucleophosmin (NPM1) has been identified as an important protein in this system. According to the BioGRID protein interaction database (Stark, et al, Nucl. Acids Res. (201 1) 39 (suppl 1): D698-D704.), there are at least 25 unique interactors with NPM1. Therefore, once NPM1 has been identified, one can expect to see other proteins from this complex, and this can augment our protein identification algorithms. [00105] PROTEINMINER SOFTWARE One can build a protein identification module that focuses not only on parsimony but also interrogates one or more protein interaction databases and builds up information about which proteins are likely to be present in the given sample. For subsequent protein identifications, this database of interactions can be considered along with parsimony information. Using the dataset from the Mann group, for example, one can apply the Validator ^{RA W} I Identifier algorithms to obtain a set of peptide identifications as described above. Proteotypic peptides (peptides that match only one protein) with very high match scores will be used to identify a core set of proteins that are highly likely to be correct. By using the urllib functions in Python, one can code ProteinMiner to query various protein interaction databases, such as BioGRID, IntAct (Kerrien, et al., Nucl. Acids Res. (2012) 40 (Dl): D841-D846.), MIPS (Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, Ruepp A, Frishman D The MIPS mammalian protein-protein interaction database Bioinformatics 2005; 21(6):832-834; [Epub 2004 Nov 5] ), and STRING (Jensen et al. Nucleic Acids Res. 2009, 37(Database issue) :D412-6), among others. Alternatively, the entire interactome from many of these sites can be downloaded and stored locally for faster searching (eg. BioGRID). The interactome generated can be used to enrich the remaining peptides for likely protein identifications. Modestly scored peptides that match an interacting protein can be selected as a correct identification. Conversely, peptides with modest or low scores for which no parent protein is found in the interactome can be re-searched, using either a larger window or with the inclusion of a set of modifications (as discussed above). The process can continue iteratively until the pool of possible spectrum-peptide-protein matches is exhausted. This process can yield a much richer set of protein identifications than a parsimony-based system. [00106] One can optimize the system based on analysis of, for example, the Sigma 48- protein mix with peptides from potentially interacting proteins, identified from BioGRID, spiked into the sample. For instance lactotransferrin, a constituent of the 48-protein mix, has five potential protein interactions (cerruloplasmin, mucin, glucocorticoid kinase, among others). Tryptic peptides from these proteins can be synthesized, spiked into the 48-protein mixture at varying concentrations, and analyzed on the MS. Analysis of the raw data with Validator ^RAW , Identifier and ProteinMiner can be monitored by studying the program logs and determining if the protein interaction database queries are successful in finding and helping validate the interacting proteins. Based on these data, one can fine-tune parameters for several potential variables, such as low a peptide score can be and still be considered correct, so long as a potential protein interaction is identified.

[00107] BETTER LC-MS/MS CONTROL SOFTWARE. Current methods for the selection of peptides for fragmentation are based on relative abundance of the peptide in the precursor scan. In the basic "top 5" approach, the mass spectrometer selects the five most abundant peptides for fragmentation from the precursor scan. In Dynamic Exclusion, any m/z selected is added to an exclusion list, preventing selection in subsequent scans. Although this reduces re-selection and -fragmentation of a single peptide, selection of peptides from the same protein is not affected. A simplified simulation of mass spectrometry demonstrates determinants of dynamic range. 5000 proteins are chosen at random from the CCDS database (build 9606) and assigned a random "abundance" over a wide dynamic range. Each protein is trypsinized in silico, and the mass of each peptide is calculated. Each peptide is assigned a random "ionizability" so that the "intensity" of each peptide peak is the product of the abundance and the intensity. Peptides are represented by a single m/z, representing the monoisotopic mass of a singly charged ion. Peptides are assigned a random scan number, biased so that most peptides "elute" in the middle 90% of the run. Each peptide is programmed to elute over 30-180 s with a triangular profile. A "scan" is then generated 1/s for the 120 min run time.

[00108] PEPTIDE EXCLUSION VS. PROTEIN EXCLUSION Implementing a simple "top 5" approach, the scans are successively parsed, and the 5 most intense peaks are chosen for fragmentation. Using an FDR of 5% (5% incorrect identifications) and a requirement for two peptides to identify a protein, the simulator identified -800 proteins in a 2 hour run. As shown in Figure 5A, identified peptides are greatly skewed to the highest "intensities" (blue dots). To simulate Dynamic Exclusion, an exclusion list was created where each ion selected is added to a list of up to 500 ion masses (± a tolerance) and remains there for 180 scans, preventing any re-selection. With Dynamic Exclusion, -1500 proteins are identified over the two hour run, but the mean intensity of selected peptides remains high (Fig. 5B). Then, Intelligent Exclusion was simulated, whereby the entire protein is excluded from further identification once it is identified. Here, once two unique peptides identify a protein, all other peptides from that protein are added to an unlimited exclusion list. Implementing this method in the simulation had a dramatic effect (Fig. 5C). Even early in the run, peptides with lower intensities are selected (blue) leading to a significant increase in identifications to -4500 of the 5000 proteins (90%, black). Intelligent Exclusion was highly tolerant both to increasing the FDR and the width of the exclusion mass tolerance (data not shown). These data confirm that exclusion of redundant peptides is an effective way to dramatically increase the dynamic range and yield during the mass spectrometer run.

[00109] Preferably, Intelligent Exclusion relies on rapid and faithful real-time peptide and protein identification from mass spectrometry data as it is being accumulated. Thus, implementation of real-time peptide identification and development of computational methods for run-time control are desirable.

IMPLEMENTING PEPTIDE/PROTEIN IDENTIFICATION IN A REAL-TIME, DATA-DEPENDENT FASHION

[00110] The Identifier software is well suited to the task of reliably and quickly identifying peptides from unsearched, raw MS data as it is being collected. Data demonstrate that real-time peptide identification is already feasible with the method described herein. For many proteins, identification is conclusive from one or two proteotypic peptides, but it is a natural extension to apply the ProteinMiner algorithm to facilitate protein identification in real-time. One can build an embodiment of the Identifier software, named Identifier ^PR0T, to accommodate the changes needed to facilitate real-time identification from streaming mass spectrometry data and to incorporate the protein identification module described above.

[00111] IDENTIFIER ¹¹¹⁰⁷ Using, for example, the same test data from the Mann group, one can simulate an MS run by "streaming" the raw data through the processing pipeline. Simulation software can be modified to load an entire data slice into memory and then stream the scans to Validator ^RAW and Identifier in sequential order. Using the implementation of the Cox and Mann algorithm outlined described above, one can build an embodiment of Validator ^RAW \o recognize and integrate an isotopic envelope over its elution, determining the accurate precursor monoisotopic mass, but retaining the measured mass as well. These can be stored in an ever-growing array keyed on accurate monoisotopic mass, and the list will continually be scanned for the presence of an isotopic pair. In addition, the software can maintain a database of MS ² data and keep it associated with the precursor mass of the peptide from which it was derived. Importantly, any membership the precursor achieves in an isotopic envelope can also be noted. Once a pair is identified and the existence of fragmentation data for both members is confirmed, Validator ^RAW can be harnessed to determine b and y ion identity as outlined above. At this point, the identity of the peptide can be determined by the Identifier software as described above. The software can then try to associate the peptide with its parent protein using the ProteinMiner algorithm described above.

ALGORITHMS FOR INTELLIGENT PEPTIDE INCLUSION AND EXCLUSION

[001 12] ADAPTIVE PEAK PICKING ENGINE (APPE) - a comprehensive artificial intelligence platform that can dynamically change inclusion and exclusion criteria for selection of ions for fragmentation. In certain embodiments, if a single peptide has been identified that can be a constituent of two proteins, the masses (± a tolerance) can be added to an inclusion list for preferential selection in order to identify the protein conclusively. Once the protein is identified, the peptide masses from the other candidate proteins can be removed from the inclusion list, and the masses of constituent peptides of the identified protein can be added to the exclusion list. Therefore, the inclusion list "comb" grows and shrinks, while the exclusion list comb continually get largers. One can simulate the inclusion and exclusion combs in the following way. For each precursor scan, the simulation software iterates through the masses in the inclusion list, tests each peak in the scan for membership in one of the mass ranges, and if found, marks the peak for possible selection for fragmentation. A maximum of one inclusion peak per protein will be allowed in order to avoid redundancy, and the software maintains this tally as the peaks are chosen. If fewer than, for example, five peaks have been chosen through inclusion for fragmentation, the exclusion comb can be applied and peaks found in the exclusion windows can be removed from the scan and no longer considered for fragmentation. Among the remaining scans, the most intense can be chosen for fragmentation and ID, and the process repeats itself.

[001 13] INTELLIGENT INCLUSION In addition to including peptide masses on the inclusion list for protein validation, one can also utilize the ProteinMiner software to predict other proteins likely to be found within the same sample and place their constituent peptide masses on the inclusion list for preferential selection. The APPE can be extended to include this modification. When a protein is identified at some desired confidence (e.g., two peptides), ProteinMiner can be used to predict the set of possible interacting proteins. Each protein can be trypsinized in silico, and the masses of the peptides can be calculated and added to the inclusion list. One can model this method by substituting a large set of known interacting proteins into the simulation set, some at low abundance. Therefore, when one of these proteins is identified, the others can be preferentially found in subsequent scans based on the masses being present on the inclusion list. [00114] Exemplary hardware suitable for implementing the software includes an LC-

MS/MS system consisting of a Waters SYNAPT G2 High Definition Mass Spectrometer (HDMS) quadrupole time-of-fiight (Q-TOF) mass spectrometer. The SYNAPT G2 HDMS system can be operated in data-dependent mode like traditional ESI-LC/MS/MS mass spectrometers or in Mass Spectrometry of Everything (MSE) mode. To operate the instrument in real time, one can focus on the traditional data-dependent acquisition (DDA) mode. In DDA mode the MS duty cycle consists of an MS scan then a selection of a set number of precursor ions, typically 6 to 12, for MS/MS fragmentation. This has a drawback of only sampling the most abundant ions and missing the ions present in lower amounts. This can result in a significant loss of potential data on the lower abundance peptides in the sample. Waters Real Time Databank Searching (RTDS) allows proteins to be identified as MS/MS spectra are acquired using rapid MassLynx database searching. Since 1-2 peptides usually are sufficient for protein identification, it is advantageous to focus on peptides from different proteins. Once proteins are identified, RTDS prevents subsequent peptides from the same protein from being selected for MS/MS. This is a suitable system for the protein-based exclusion algorithm. The program includes an interface to the data acquisition buffers and user-controlled ion selection. The APPE is the rapid protein identification system (to replace the database search).

[00115] It is important to note that the SYNAPT G2 typically only collects fragmentation data at high mass accuracy, so one can take advantage of this by implementing the label-free version of Identifier described above. One can analyze the samples from complex systems and compare the results to those from traditional search and from other platforms, such as the Thermo LTQ-Orbitrap. One can further verify the sensitivity of the system by spiking in very low amounts of a ¹³C-labeled test protein. In those cases in which a peak is in local proximity to an intense peak and cannot be chosen for fragmentation, one can place it back onto the inclusion list for later selection.

PART 2 [00116] An attractive alternative to conventional methods of identification is metaproteomics, where liquid chromatography -tandem mass spectrometry (LC-MS/MS) is used to recognize tryptic peptides that are uniquely characteristic of a particular microbial species. Typical MS instrumentation and analysis software favor the most abundant peptides in a sample leading to highly redundant identification of the few most highly expressed (and often uninformative) proteins. Consequently, rare organisms and their gene products are unlikely to be detected amongst the abundant host proteins in the sample.

[00117] Herein is described an approach to LC-MS/MS proteomics to enhance the identification of organisms in complex samples. Typical instruments are limited in their ability to exclude peptides from repeated fragmentation, but one can extend this paradigm to dramatically increase performance. Informatic tools can be used to rapidly and confidently identify and quantify peptides and their parent proteins from high-resolution mass spectrometry data. One can identify peptides in real-time, excluding all other possible peptides of the parent protein from subsequent analysis (dynamic protein exclusion). This facilitates comprehensive and rapid identification of bacterial proteomes and their antibiotic resistance patterns. For more complex samples, real-time identification of peptides from the human host or specific bacterial species permits exclusion of the remaining proteome of that species as non-informative and result in identification of peptides from other species (dynamic proteome exclusion). Finally, real-time interrogation of biologic pathways can be used to inform subsequent peptide selection on-the-fly. Herein is described an approach to bacterial identification based on recognizing bacterial species and their resistance patterns in real-time and focusing LC-MS/MS instrumentation on unidentified components of the sample. Modeling demonstrates that pathogenic bacteria amid a complex array of other bacterial and human proteins can be conclusively identified with high confidence in a single, one-hour experiment using current instrumentation running the software described herein. [00118] Software to rapidly and confidently identify and control selection of peptides during mass spectrometry - a single automated informatics workflow that rapidly and confidently identifies peptides and proteins from high-accuracy mass spectrometry data. The tool is based on a pipeline of software for direct peptide identification {Identifier), validation of peptide assignment (Validator), and peptide and protein quantitation (Quantitator). Algorithms for dynamic protein and proteome-based peptide inclusion and exclusion are described herein. One can build software to utilize real-time peptide identification to dictate peaks picked for fragmentation and model the benefits of protein and proteome exclusion for the identification of specific bacteria and their resistance patterns in complex biologic samples.

[00119] Algorithms for dynamic protein and proteome exclusion on a high- resolution mass spectrometer. Implementation of protein exclusion on high performance mass spectrometer. One can demonstrate dynamic peptide exclusion using a prepared sample of human proteins and a single bacterial species. Using B. subtilis and F. tularensis as model proteomes, one can demonstrate identification of the bacterial species and its antibiotic resistance pattern from a complex biologic specimen during a one-hour run. One can demonstrate the capabilities with mixtures of known bacteria mixed with human proteins.

[00120] Algorithms for real-time biologic pathway analysis with rapid knowledge integration to improve peptide selection during mass spectrometry a complex query engine (Interrogator) capable of interrogating multiple knowledge bases simultaneously to make real-time predictions to inform protein inclusion and exclusion. In addition to extending dynamic range, these algorithms can facilitate the selection of highly relevant proteins, in effect creating a system that "learns" during the run and applies the knowledge to future peptide selections.

[00121] Both globally and in the US, infections with multi-drug resistant organisms

(MDROs) such as MRSA, VRE and others that fail to respond to most available antibiotics are increasing dramatically. MDROs are particularly common in healthcare-associated infections, thus affecting high-risk populations, resulting in prolonged hospitalizations, debilitation and death (Figueiredo, 2008; Giamarellou, 2010; Chan-Tompkins, 2011; Kallen, et al., 2010; Kallen, 2010; Kumarasamy, et al., 2010; McGath, et al., 2010; Perez, et al., 2010; Pfeifer, et al., 2010; Woodford, et al., 2011). Among bacterial pathogens, antimicrobial resistance in Gram-negative bacteria has become especially problematic. Failure to act decisively during the short window of opportunity to treat the organism leads to untreatable bloodstream infections (bacteremia), sepsis, and rapid death. In turn, empiric use of antibiotics only exacerbates the problem, selecting for further drug resistance. The most commonly encountered highly resistant Gram-negative pathogens are described in Table 2 Given the emergence of such highly resistant bacteria, the need for rapid and accurate determination of the species and antimicrobial resistance pattern of a clinical isolate is becoming increasingly important.

[00122] Thus, the timely and accurate determination of antimicrobial susceptibility of a clinical isolate is essential for the optimal antimicrobial therapy of infected patients (Fluit, et al, 1999; Fluit, et al, 2000; Fluit, et al, 2001). Current methods to determine identity and resistance patterns of pathogenic organisms require a 5 to 10 ml sample of blood obtained from the patient and inoculated into aerobic and anaerobic bacterial growth media. Once bacterial growth is recognized, a sample of the culture medium is removed and a Gram-stain is performed. Based on the staining and appearance of the growing organism (e.g. Gram- positive cocci, Gram-negative rods), the isolate is plated onto selective media or injected into biochemical cards that are used for identification of the organism. After the species of a pathogenic bacterium is determined, growth of the organism is assessed in the presence of a selected panel of antimicrobial agents. Those agents that effectively suppress bacterial growth in vitro are typically used to treat infections in vivo. The time period between obtaining the blood sample, identification of the infecting organism, and evaluation of antibiotic sensitivity pattern is typically three to five days, but may reach seven to ten days for common bacteria, and may be weeks for slow growing bacteria, such as the mycobacteria. This delay ensures that many patients will receive overly broad and/or inadequate empiric antimicrobial therapy, which compromises a patient's care, and leads to increased cost and risks of toxicities. In turn, promiscuous use of broad-spectrum antibiotics often acts at cross-purposes to the desired outcome.

[00123] Even where pathogen identification is straightforward, rapid determination of a bacteria's antimicrobial resistance pattern remains problematic. PCR-based detection of bacterial resistance genes is well established, but prediction of antimicrobial resistance suffers from several shortcomings: Identification of resistance genes requires accurate recognition of the pathogen; mechanisms of resistance vary widely between bacterial species; in many cases the number of different genes makes generating an assay too costly; correlations between genomic findings and resistance phenotype are variable; and resistance may arise as a consequence of reduced or increased expression of normal bacterial genes. In addition, proper quality control for molecular assays poses a problem for many laboratories, and this results in questionable results at best. Metagenomics based on massively parallel DNA sequencing offers a powerful tool that may partly address this challenge, but DNA- based prediction of antimicrobial resistance will remain limited and has yet to show broad clinical value. [00124] In efforts to shorten the time period between isolation of a pathogen and administration of appropriate therapy, investigators are developing rapid methods for identifying pathogens and determining their antimicrobial resistance patterns. Complementary to nucleic acid-based approaches, various forms of mass spectrometry have been used. Indeed, MALDI-TOF mass spectrometry instruments have entered diagnostic laboratories as tools for "rapid" identification of bacterial pathogens (Kok, et al., 2011). In addition to the need for organisms to grow in culture, another limitation of the methodology is that only the most abundant peptide ions are used for signature recognition from a bacterial colony. In addition to being unsuited to analysis of complex samples such as blood or tissue, the method is prone to misidentification and cannot detect virulence factors. Metaproteomics using LC-MS/MS has the potential to identify low abundance proteins and rare organisms in complex samples. The ability to follow specific peptides permits the differentiation of closely related organisms and can enable determination of expression of specific virulence genes, resistance factors, or other features.

[00125] The dynamic range of LC-MS/MS mass spectrometry experiments can be improved by introducing real-time peptide identification and subsequent on-the-fly protein and whole proteome exclusion. This is useful for the in-depth interrogation of complex samples required for metaproteomics. Certain embodiments of the present invention are proteomics devices for the hospital lab exploiting protein and proteome exclusion for automated and in-depth coverage of complex samples such as blood or tissue. These can facilitate rapid pathogen detection and characterization of antibiotic resistance patterns and dramatically decrease the response times for initiating appropriate antibiotic coverage, and result in decreased morbidity and mortality, lower rates of multi-drug resistant pathogen infection, and a significant cost savings. [00126] Microbial identification from complex samples Current techniques do not permit rapid and comprehensive microbial identification from complex samples, such as blood, tissue, exudates, secretions, stool, or other patient samples which may include multiple benign microorganisms along with host cells and proteins that can obscure identification of the pathogen(s). Antibody- and PCR-based methods are sensitive and specific, but limited to a small number of specific markers and thus can only identify targeted species and known variants. Total nucleic acid sequencing (metagenomics) offers potentially comprehensive analysis with sufficient sensitivity and specificity, and may detect low abundance agents, but requires extensive sample handling and informatic analysis to obtain results. Current mass spectrometry methods are perhaps the least attractive option, insofar as the dynamic range of the approach is far too low to offer needed sensitivity or specificity, let alone detection of low-abundance organisms. Even using state-of-the-art LC-MS/MS instrumentation and informatics, it remains impractical to detect bacteria in blood at a concentration compatible with a living patient. Successful implementation of a rapid, sensitive and specific proteomic assay that can identify multiple organisms at a wide range of abundance within a complex sample immediately advances metaproteomics to the forefront of emerging tools for clinical microbiology.

[00127] Toward in-depth proteomics Because the conventional data-dependent

MS/MS approach selects the most abundant ions for fragmentation, the results are skewed towards identification of abundant proteins. In fact, it is not uncommon for dozens of peptides from a single abundant protein to be identified. Thus, the mass spectrometer spends time identifying the same protein over and over again at the expense of missing low- abundance proteins. As a result, 20,000 peptide fragmentation events may result in only 500 protein identifications. Being able to control the MS during the run, perform rapid peptide and protein identification, and dictate which ions should or should not be selected for fragmentation can significantly improve dynamic range. This approach preferably uses realtime, on-the-fly peptide identification.

[00128] Previously, the computation required to perform analyses was far too slow for real-time implementation. Therefore, analysis could only be achieved off-line, after data had been collected from the mass spectrometer. Recent advances in mass spectrometry instrumentation and computing speed have made real-time analysis of MS data feasible, as shown by recent studies from the Mann, Coon and other labs (Graumann, et al., 2012; Bailey, et al, 2012). However, while these real-time methods yield results far faster than conventional off-line analysis, their performance overall has not offered even two-fold improvements in protein identification or other metrics.

[00129] With feasibility of confident, on-the-fly peptide identification already demonstrated, one can take full advantage of real-time data analysis to improve the efficiency and dynamic range of tandem mass spectrometry. Elements of a computational pipeline can be used to dramatically enhance LC-MS/MS performance. One can take advantage of Identifier, a high-performance peptide and protein identification tool, that can perform confident, on-the-fly peptide identification from high-resolution data. Once multiple peptides unique to a specific protein are in hand, one can perform data-dependent protein exclusion, based on the rationale that once a protein is identified, further peptides likely to derive from that protein are non-informative. This strategy can increase the speed and depth of protein identification in LC-MS/MS by at least ten fold over "conventional" real-time methods.

[00130] Extending this paradigm to metaproteomics, one can use peptides unique to a specific proteome to confidently identify a component organism and then exclude other peptides that would be informationally redundant, leading to data-dependent proteome exclusion. As such, even highly similar microorganisms can be distinguished in a complex sample, based on detection of unique peptides. Further, unanticipated components, such as plasmid borne resistance factors, can be readily detected.

[00131] By these innovative approaches, one can identify and quantify thousands of proteins and/or tens of organisms from complex samples, yielding a dramatic improvement in speed and accuracy of proteomics, permitting MS-based clinical decision-making within hours of sample collection.

[00132] Peptide and protein identification paradigm Mass spectrometry (MS) is well matched to proteomics as a primary analytical tool. Each unmodified or modified amino acid has a characteristic mass, and a typical commercial mass spectrometer can measure the mass- to-charge ratio (m/z) with high precision and resolution for ions over the range of 100 to 10,000 Daltons. Typically, proteins are digested with trypsin and the resulting Lys and Arg- terminal peptides are separated by reverse-phase liquid chromatography (LC). The eluent is injected into a tandem mass spectrometer and peptides are ionized by electrospray ionization (ESI), yielding doubly or triply charged peptide ions of five to twenty-five residues. Peptides are selected for MS/MS and fragmented via collision-induced dissociation (CID) to create nested series of amino terminal (b-ion) and carboxyl-terminal (y-ion) fragments separated by the mass of the amino acid residues.

[00133] Current approaches to comparative/quantitative proteomics Although challenging, quantitation of proteins by conventional LC-MS/MS is of considerable interest, in that mR A expression is often a poor predictor of protein abundance (Gygi, et al., 1999). Comparison to standards and/or detection of differences between samples after stable-isotope labeling remains the preferred approach. As heavy isotope-labeled peptides co-elute with their unlabeled partners, a direct comparison of ion counts between the heavy and light forms of each peptide results in a reliable measure of relative abundance. Several strategies facilitate consistent differential labeling with isotopic tags, including trypsin-catalyzed ¹⁸0 exchange (Stewart, et al, 2001; Bonenfant, et al, 2003; Heller, et al, 2003; Miyagi, et al, 2007) (Fig. 6 and stable isotope-labeling with ¹³C and ¹⁵N-labeled lysine and/or arginine amino acids in cell culture (SILAC) (Barnes, et al., 2003; Bonardi, et al., 2002; Amanchy, et al., 2005; Abba, et al., 2007; Andersen, et al., 2006). These carboxyl-terminal-labeling strategies result in mixtures of pairs of chemically identical, but isotopically distinct, peptides. The unlabeled and stable isotope-labeled peptides co-elute as pairs during LC- MS/MS, yielding isotopic envelopes offset by 4-10 Da, in the MS scan (Fig. 7). Informatic analysis is used to compare the intensity of the isotopic forms to quantify relative abundance (e.g. (Mason, et al, 2007; Wang, et al, 2006). [00134] Current approaches to peptide identification The prevailing peptide identification technology (reviewed in (Nesvizhskii, et al, 2007) requires the measured fragmentation spectrum be compared by an automated pattern-matching algorithm to a database of species-specific peptide masses and their theoretical MS/MS fragmentation. Combining these approaches with state-of-the-art mass spectrometers that offer high scan speeds (>20 MS/MS scans per second) and high mass-accuracy (<1 ppm) such as the Orbitrap Velos (Thermo) and SYNAPT G2S High Definition Mass Spectrometer (HDMS, Waters), confident identification of hundreds of proteins can be obtained from a single sixty minute experiment.

[00135] Several well-known limitations to standard database search approaches conspire to decrease the yield of identified peptides, as the majority of ions selected for fragmentation fail to lead to a confident identification. Background noise from other peptide fragments, poor CID efficiency, and incorrect pattern matching all contribute to low specificity, and so it is useful to calculate a false-discovery rate based on the score-frequency distribution of random peptides. A number of authors (Ulintz, et al, 2006) have suggested practical score "cut-off thresholds for automated acceptance of database search results, but peptides with scores below the cut-off may be correct and those above incorrect.

[00136] But by far the most important limitation on yield of peptide identifications is that commercial LC-MS/MS systems automatically select the most intense ions in the MS spectrum for fragmentation. Since peptides from high-abundance proteins will be subjected to fragmentation multiple times, many of the selected peptides will derive from a few abundant proteins, and low abundance peptides are never selected and identified. Dynamic (peptide) exclusion is a standard approach to increasing dynamic range by reducing redundant selection of ions with a particular monoisotopic mass, but this strategy does not solve the problem of multiple peptides deriving from a single abundant protein. Consequently, medium- and low- abundance proteins are often not identified, even though they may have yielded multiple detectable ions in the MS spectrum. Analysis of the output from a published dataset (Cox, et al, 2007) shows that of 378 identified proteins, the top 17 (5%) accounted for 25% of the matched spectra, with half of the spectra being used to identify only 14% of the proteins.

[00137] Potential for increased dynamic range with protein exclusion

Significantly improving dynamic range preferably employs the real-time identification of proteins, alleviating repeated fragmentation of peptides from already-identified proteins, while allowing fragmentation of peptides not yet assigned to a protein. Advances in processing power have resulted in several orders-of-magnitude improvement in computing speed, making real-time analysis of MS data feasible as shown by several recent studies (Graumann, et al., 2012; Bailey, et al, 2012).

[00138] Potential for enhanced identification with precision proteomics State- of-the-art mass spectrometers (Thermo Orbitrap, Waters SYNAPT QTOF) can now achieve high mass-accuracy (1 ppm) at both the MS and MS/MS level. The use, abuse, and underuse of high mass-accuracy/high resolution mass spectrometry data in peptide identification has been discussed (Gorschkov, et al, 2005; Mann & Kelleher, 2008). Mass-accuracy and resolution directly contribute to peptide identification by both constraining the precursor ion charge state and the monoisotopic m/z, thereby limiting the range of possible matches. In the case of the bacterium Francisella tularensis, the proteome consists of 1603 predicted proteins, leading to 78,279 tryptic peptides with 64,306 unique masses. This gives an average of only 2 peptides per ~50 ppm interval, indicating that accurate mass alone would be sufficient to confidently identify any F. tularensis protein. [00139] Applications to identification of bacteria Rapid identification of specific bacterial species in a complex sample depends on successful detection of species- specific peptides and proteins. Informatic analysis of bacterial proteomes demonstrates that even highly conserved proteins, such as ribosomes, are likely to differ in amino acid sequence in several of the peptides and could therefore be used to identify bacterial species using mass spectrometry. Since the mass spectrometer detects peptides based on their mass-to-charge ratio, and state-of-the-art instruments such as the Orbitrap and SYNAPT G2S Q-TOF can achieve 1 ppm accuracy or 0.0001% of the peptide mass, a sequence difference of one amino acid or a modification in a peptide is easily detectable. [00140] Current proteomics workflows can include a list of peptides to target

(include) or ignore (exclude) during the mass spectrometry run (Bailey, et al, 2012; Hoopmann, et al., 2009), but this information is typically obtained beforehand, effectively doubling the analysis time and cost. Herein an informatic approach to identify information- rich peptides that can positively identify a species or family of bacteria is described. One can perform peptide identification in real time and control the selection of peptides for fragmentation based on the preceding identifications and the information contained in each peptide. One can interrogate available knowledge bases to further inform our peptide selection, thus dramatically increasing the dynamic range of detection. By selectively excluding redundant, low information peptides and by including those peptides likely to yield high-value protein identifications, one can rapidly and confidently identify single pathogenic bacterial species and their antibiotic resistance profiles from complex biologic samples.

[00141] Software to rapidly and confidently identify and control selection of peptides during mass spectrometry. A single automated informatics workflow that rapidly and confidently identifies peptides and proteins from high-accuracy mass spectrometry data.

[00142] Real-time proteomics preferably includes accurate identification and quantitation without manual validation or post-run statistical analysis, while identifying peptides over the full dynamic range of the >25,000 peptides during a single 90-minute LC- MS/MS run. Enabling development of a real-time workflow are the spectral deconvolution software {Validator (Volchenboum, et al, 2009) and software for direct peptide identification {Identifier) and relative quantitation {Quantitator, Fig. 8). These software packages exploit the embedded information from stable isotope labeling. Briefly, isotopic peptide pairs are identified directly from the precursor (MS) scan and Validator deconvolutes the fragmentation spectra, identifying potential b- and j-ions. Identifier relies on the high- accuracy precursor mass and the Fa/z ^'<iator-assigned potential b- and j-ions to rapidly and confidently assign a peptide sequence selected from a mass-sorted species-specific tryptic database. Quantitator then calculates the peptide pair ratio. Each step occurs considerably faster than the mass spectrometer can fragment a new peptide, making it feasible to generate inclusion and exclusion peptide lists for subsequent scans in real-time as described below.

[00143] Identification of peptide pairs and potential b- and v-ions (Validator)

Carboxyl-terminal stable isotope-labeling methods (SILAC, ¹⁸0 exchange) result in a mixture of pairs of chemically identical, but isotopically distinct, peptides that co-elute from the HPLC as pairs that are readily resolved by the MS and identified by Validator (Fig. 7). Raw data files are converted to mzXML using ReAdW.exe followed by the extraction of monoisotopic masses using the Horn Mass Transform algorithm (Horn, et al., 2000) within Decon2LS. The "light" and "heavy" fragmentation spectra are compared, and δ-ions and y- ions are identified as having the same m/z in both scans (non-shifting) or having a mass difference corresponding to the isotope used (shifting), respectively resulting in a set ion pairs for each scan window.

[00144] Validator was tested on output from the conventional database search engine,

Mascot. Validator identified potential ¾-(non-shifting) and j-(shifting) ions from the fragmentation spectra and compared these to the b- and j-ions calculated from the Mascot peptide sequence (Volchenboum, et al, 2009). In a complex yeast sample, Validator analysis confirmed the identities of 89% of peptides found through traditional database search and post-processing with Peptide and Protein Prophet. Validator also identified potentially valid low-scoring peptides that would otherwise have been discarded, increasing both sensitivity and specificity. [00145] Direct peptide identification (Identifier). Direct peptide identification software, Identifier, which uses the accurate mass of a peptide pair member to identify a range of candidate peptides from a mass-sorted species-specific tryptic database of the proteome(s) of the organism(s) of interest was designed. Each measured experimental mass is compared to the database to identify peptides within a close range (e.g., +/- 10 ppm) and the b- and j-ions from each peptide sequence were compared to the potential b- and j-ions identified by Validator (Fig. 8A). Each potential match is scored according to the number of matching shifting and non-shifting ions, along with a metric to include the number of consecutive matches. The threshold score for each match is determined by comparing the score to a distribution of scores from 1000 randomly generated peptides of similar mass and composition. The 99% cutoff score determines which peptide (if any) is the "winner." Identifier was tested on a yeast whole cell lysate digest expected to contain around 5000 proteins. Identifier identified 1,700 proteins and found 80% of "high quality" Mascot identifications (minimum 2 peptides with 95% Peptide Prophet score, 99% Protein Prophet score). Using a published dataset of high-quality data (Cox, et al, 2008), Identifier was rapidly able to identify 95% of the proteins found through a traditional database search. These results indicate that reliable peptide identifications can be obtained using only the mass and inferred b- and y- ions, demonstrating the feasibility of real-time mass spectrometry. [00146] Quantitator. Relative quantitation using trypsin-catalyzed ¹⁸0 exchange involves directly comparing the "light" and "heavy" peptide peaks at the MS level (Fig. 7). Our quantitation software, Quantitator, uses the peptide sequence assigned by Identifier to calculate an expected isotope distribution via the isotope pattern calculator (IPC) module (Nolting, et al., 2005). The fit of the experimental spectra to the theoretical model is then calculated (Ramos-Fernandez, et al., 2007) to yield a "fit score," which identifies the most informative scans for accurate differential quantitation (Fig. 8B). The extent of ¹⁸0 exchange is calculated from the fit, allowing for correction of quantitative values for incompletely labeled samples. Quantitator was tested on data from a series of ¹⁸0- labeled standards (90% purity) mixed with unlabeled sample at ratios from 1 :20 to 20: 1. Quantitator showed high inter-sample correlation (r=0.91) and tolerance for incomplete labeling.

[00147] Unfinnegan. A set of C libraries to provide access to the raw data contained within the file generated by the Thermo MS has been designed. As the conversion of this file to an open-source consumable format generally requires a proprietary set of libraries, the availability of a fast algorithm for accessing the raw data is essential to ensure reliable pair picking and subsequently analysis steps.

[00148] Software speed: Software was written in Python 2.7 or Perl 5.1 and run on standard laptop and desktop hardware. For a 200 megabyte raw Orbitrap file, conversion to mzXML occurs in under 10 sec. The pair-picking algorithm finds and corroborates all potential pairs from the -20,000 scans in the resulting 500,000 line mzXML file in under 3 minutes. Using this database, Identifier can check over 2500 potential peptides in < 4 min, or < 100 ms per match. [00149] Certain embodiments of the software report an extensive array of metrics describing the isotopic pairs, how they were scored, and how the algorithm matched a peptide to the spectrum. Unfinnegan and Quantitator can be integrated into the analysis pipeline, so that the pathway can be from RAW MS data to confidently identified peptides and proteins. [00150] One can validate this approach against well-curated and searched data sets, for example by using a large set of 72 MS runs from normal human HeLa cells (Cox, et al., 2007). Analyzing representative sections of these data using traditional search methods such as Mascot, X! Tandem, and Scaffold, provides a metric to which to compare the performance of our software. In addition, one can generate complex mixtures of proteins of known composition and quantity in order to accurately model the false-discovery rate of the methods as well as the accuracy of the quantitation. In certain embodiments, it has been found that low-pass filtering can have a dramatic effect on the system's sensitivity and specificity.

[00151] For shorter peptides or when MS2 fragmentation data has low signal-to-noise ratio, it is conceivable that the peptide score is not high enough to declare one peptide candidate as the "winner." Additionally, it is conceivable that some peptides will exist in only the light or heavy form or that only one of these will be fragmented. Indeed, in looking through the small list of peptides found by Scaffold but not by Identifier, we see both of these scenarios. Nevertheless, most proteins have enough peptides so that others from the same protein are fragmented and identified. [00152] As protein exclusion is database-dependent, it is subject to unexpected contaminants, unannotated peptides from novel splicing events or modifications, and other potential confounding features. Any of these unaccounted-for peptides might be selected for fragmentation using the approach described herein. However, the Identifier algorithm is extraordinarily robust and highly tolerant of this type of "contamination" and only rarely reports a false positive identification. The method derives its specificity through two rigorous physical filters, first by differentiating shifting and non-shifting ions by comparing light and heavy fragmentation patterns, and second, by scoring the theoretical fragmentation patterns of similarly-sized tryptic peptides and comparing the categorized ions to the experimentally- derived deconvoluted spectrum. This strategy can eliminate a large number of potential errors that confound typical database search algorithms that cannot differentiate between δ-type, y- type and background fragment ions in fragmentation spectra. [00153] Algorithms for dynamic, data-dependent protein and proteome-based peptide inclusion and exclusion. One can build software to utilize real-time peptide identification to influence the peaks picked for fragmentation and model the benefits of protein and proteome exclusion for the identification of specific bacteria and their resistance patterns in complex biologic samples. Because different proteins are conserved to varying degrees, one can model the specificity of peptides across a large number of bacterial species. Based on these studies, one can identify which peptides are sufficient to identify single bacterial strains (bacteriotypic) or a bacteria family (family-typic). Furthermore, one can derive lists of bacterial resistance genes and develop a database of unique peptides for inclusion. By eliminating large groups of peptides from selection for fragmentation, the mass spectrometer is able to select lower abundance, less common peptides for fragmentation and identification.

[00154] Estimation of high-information peptides Fifteen bacterial species/strains from the NIAID "High Potential of Bioengineering" (Chen, T., Yu, W-Han, Izard, J., Baranova, O.V., Lakshmanan, A., Dewhirst, F.E. (2010) The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database, Vol. 2010, Article ID baq013, doi: 10.1093/database/baq013) were trypsin-digested in silico and the masses of peptides with more than 8 residues were cross- referenced by species/strain. The results were mapped onto a phylogeny tree from the Microbial Rosetta Stone (MRS, The Microbial Rosetta Stone Database: A compilation of global and emerging infectious microorganisms and bioterrorist threat agents by: David J. Ecker, Rangarajan Sampath, Paul Willett, Jacqueline R. Wyatt, Vivek Samant, Christian Massire, Thomas A. Hall, Kumar Hari, John A. McNeil, Cornelia Buchen-Osmond, Bruce Budowle BMC Microbiology, Vol. 5, No. 1. (2005), 19, doi: 10.1186/1471-2180-5-19 Key: citeulike:8901640, Fig. 9), demonstrating a large number of bacteriotypic and family-typic peptides. As expected, the Venn diagram of three Brucella strains (sius, melitensis and melitensis biovar Abortus) shows high homology between the peptides, with around 170,000 peptides homologous in at least two of the three strains (Fig. 3, top left). Surprisingly, each of the three highly related strains contains over 16,000 unique peptides that could potentially be used to identify and distinguish the strains/species from each other and from other species tested in this experiment.

[00155] An in silico trypsin digestion of the human proteome (Uniprot, release

2011 11) yields 2,257,696 distinct peptides of four or more residues from 73,559 proteins, representing 632,169 unique sequences. Were all of these peptide masses determined and combined into an exclusion comb using a conservative tine width of 10 ppm, it would "mask off less than 700 Dalton of the 300-2500 Dalton range in a typical precursor (MSI) scan. Contributing to the small size of the mask, peptides consisting of the same amino acids in different orders yield tines that superimpose. Many other peptides yield tines that overlap. To test whether the remaining 1500 Dalton of "open space" in the MSI scan could be used to find bacterial peptides, an in silico trypsin digestion of the Bacillus subtilis proteome (The complete genome sequence of the Gram-positive bacterium Bacillus subtilis. Nature 390:249- 256(1997), 4188 proteins) was performed. After masking off the whole human proteome, about 69,700 bacterial peptides with masses within 5 ppm of a human peptide would be "combed out." However, 662 B. subtilis peptides would remain available for fragmentation and subsequent identification, of which only a fraction would have to be detected to confirm contamination/infection with B. subtilis. Using the width of 5 ppm offers -1400 B. subtilis peptides as targets. If one applies a 2 ppm tine width for the exclusion comb, the number of detectable B. subtilis peptides would be nearly 4000 (Table 3). These simulations demonstrate the ability to quickly and accurately detect multiple bacterial species present in a complex sample.

Table 3 - Generation of human proteome comb and number of detectable B. subtilis peptides

[00156] Using mass, NET and detectability as a means of extracting high-information peptides. While Table 2 depicts the idea of obtaining peptides unique (in terms of mass) to bacterial strains, we can achieve a higher number of inclusion list candidates using mass, Normalized Elution Time (NET) and detectability. Normalized Elution Time is essentially the time (indicated on a scale of 0 to 1 signifying start and end of run) at which the peptide is expected to elute in the run. This value can be predicted using machine-learning algorithms (Petritis et al. Anal.Chem 2006 and US Patent application US 10/846,188, incorporated by reference herein in its entirety) and is often used in traditional proteomics to reduce the number of candidates. Peptide detectability is defined as the probability of a peptide being observed in a run, which is also predicted from amino acid composition (Li et al., J Proteome Research 2010, 9, 6288-6297). Alternatively, proteotypicity is also often used in the field to describe the probability of peptide being observed (See US Patent 8,501,421 and US Patent application 12/466,045, both incorporated by reference herein in their entirety). We propose a novel system for intelligent inclusion of proteins of interest, wherein machine learning algorithms can be used to predict the "inclusivity" of bacterial peptide given the peptide's mass, NET and detectability along with a matched (i.e. within mass and NET tolerances) human peptide mass, NET and detectability. Peptides unique to bacteria understudy make the first layer of inclusion list candidates. In the next layer, bacterial peptides with high detectability as compared to their human counterparts are utilized in the inclusion list. Thus, if an ion of certain mass and NET is observed in the dataset and the corresponding bacterial peptide match has a higher detectability (or proteotypicity) than the human peptide match, then the ion is likely from the bacterial species and hence is of interest. We hypothesize that the inclusion list built using high-informative peptides will be more effective than using an all-encompassing inclusion list built from all bacterial peptides. Figure 12 illustrates this concept of using detectability, mass and NET for creating better inclusion lists. Theoretically, the intelligent inclusion list algorithms presented here can be used for targeting any protein set (e.g. markers) within a complex background (e.g. serum). Additionally, mass spectrometry control software can be modulated to report MS ion abundance which can be used to calculate effective detectability which can be used to predict inclusivity of ion based on trained algorithms. The inclusivity value can be used to determine whether the ion must be fragmented or not. This opens a new avenue of research for real-time mass spectrometry applications for specifically targeting proteins of interest amidst a complex background.

[00157] Simulation of dynamic protein exclusion One can create a comprehensive complex simulation environment to model LC-MS/MS to study determinants of dynamic range. Five thousand proteins are chosen at random from the CCDS database (build 9606) of human proteins and assigned a random "abundance" over a wide dynamic range. Each protein is trypsinized in silico, and the mass of each peptide is calculated and assigned a random "ionizability." "Intensity" of each peptide peak is the product of the abundance and the ionizability. Peptides appear as a single m/z, representing the monoisotopic mass of a singly charged ion, and are assigned a random scan number. Each peptide is programmed to elute over 30-180 seconds with a triangular profile. A "scan" is then generated every second for a 120 min run. [00158] Implementing a simple "top 5" approach, the scans are successively parsed, and the five most intense peaks are always chosen for fragmentation. Using an FDR of 5% (5% incorrect identifications) and a requirement for two peptides to identify a protein, the simulator identified about 800 proteins in a 2-hour run. As shown in Fig. 10A, identified peptides are greatly skewed to the highest "intensities" (blue dots). To simulate standard dynamic peptide exclusion, as implemented by Thermo, Inc., one can create an exclusion list where each selected ion mass is added to a list of up to 500 ion masses and remains there for 180 scans, preventing any re-selection of a similar peptide. Roughly 1500 proteins are identified over the 2-hour run, but the mean intensity of selected peptides remains high (Fig. 10B). Using a dynamic protein exclusion algorithm, once two unique peptides are found and identify a protein, all other peptides from that protein are added to an unlimited exclusion list. Implementing this method had a dramatic effect (Fig. IOC). Even early in the run, peptides with lower intensities are selected (blue), leading to a significant increase in identifications to -4500 of the 5000 proteins (90%, black). Increasing the FDR and the width of the exclusion mass tolerance had little effect on the number of proteins identified (data not shown). These data confirm that exclusion of non-informative peptides has the potential to dramatically increase protein dynamic range and yield during LC-MS/MS.

[00159] One can generate a complete tryptic bacterial peptidome database, including missed cleavages and multiple charge states, using data from the National Microbial Pathogen Data Resource (NMPDR) (McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards RA, Gerdes SY, Hwang K, Kubal M, Margaryan GR, Meyer F, Mihalo W, Olsen GJ, Olson R, Osterman AL, Paarmann D, Paczian T, Parrello B, Pusch GD, Rodionov DA, Shi X, Vassieva O, Vonstein V, Zagnitko OP, Xia F, Zinner J, Overbeek R, Stevens R. Nucleic Acids Res. 2007 Jan;35 (Database issue):D347-53.), a database of curated annotations for comparative analysis of genomes and biological subsystems (McNeil, et al., 2007). The peptides can be sorted into low- (highly homologous), medium- (family-typic) and high-information peptides (species/strain specific). When found, the high-information peptides conclusively demonstrate the presence of the corresponding bacterial strain. As each protein is identified according to predefined criteria, the remaining constituent peptides can be added to an ever-growing exclusion list, and future precursors with the corresponding mass will not be subjected to fragmentation. In a similar way, one can develop a database of proteins and peptides that constitute the bacterial resistance genes found in human pathogens. Bacterial resistance can occur by several complicated yet predictable mechanisms, and these are encoded on plasmids within the bacteria. One can use, for example, the data within the Antibiotic Resistance Genes Database (ARDB, Liu B, Pop M. ARDB -Antibiotic Resistance Genes Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D443-7) to develop a list of peptides exclusive to bacteria. If a pathogen is identified, peptides from resistance genes specific to that bacteria can be added to the inclusion list for preferential fragmentation by the MS. If a unique resistance gene is identified first, peptides unique to the parent bacterial species will be added to the inclusion list. In this way, the MS can rapidly and confidently identify the pathogen and its susceptibility profile.

[00160] Adaptive Peak Picking Engine (APPE) Through simulations, the dramatic effect of dynamic protein exclusion on dynamic range (Fig. IOC) are shown. A real-time algorithm can be built that uses a comprehensive artificial intelligence engine that can dynamically change inclusion and exclusion criteria for selection of ions for fragmentation. For instance, if a single peptide has been identified that can be a constituent of two proteins, the masses (± tolerance) can be added to an inclusion list for preferential selection in order to identify the protein conclusively. Once the protein is identified, the peptide masses from the other candidate proteins will be removed from the inclusion list, and the masses of constituent peptides of the identified protein will be added to the exclusion list. Therefore, the inclusion list intervals, similar to the teeth of a comb, will grow and shrink, while the exclusion list comb will continually get larger. One can model this using a 5000 protein simulation environment outlined above, allowing for control of every aspect of the APPE and characterizing the effects of dynamic protein exclusion on dynamic range. One can extend the simulation environment to include multiple charge states, complex post-translational modifications, and partial or incomplete isotopic labeling. One can exclude entire bacterial proteomes or even branches of the phylogenetic tree. [00161] In certain embodiments using dynamic inclusion and exclusion mass lists, once two unique tryptic peptides from a protein are identified, the rest of the protein can be excluded from further consideration, significantly increasing the number of proteins identified.

[00162] Two simple bacterial samples consisting of media-grown B. subtilis or F. tularensis grown were analyzed using regular DDA LC-MS/MS settings. These results provide a starting point for comparison to the protein exclusion method. Briefly, B. subtilis was grown in super-rich liquid media (25 g/1 yeast extract, 15 g/1 tryptose, 3 g/1 KH ₂PO ₄, pH 7.5), harvested in mid-log phase and lysed with lysozyme in 100 mM NaCl, 50 mM Tris pH 7.5, 1% Triton X-100, 10 niM EDTA. Thirty μg of lysate was run into an SDS-PAGE gel, the whole sample collected, and proteins were extracted and trypsinized. Eluted peptides were subjected to ¹⁸0 exchange using immobilized trypsin and H ₂ ¹⁸0 (99%, Cambridge Isotope Laboratories). Unlabeled and ¹⁸0-labeled samples were stored at -80°C and mixed 1 : 1 immediately before mass spectrometry analysis. The sample was analyzed on three high-mass accuracy instruments: the Orbitrap (Mayo Proteomics Research Center), Orbitrap Velos (Northwestern University Proteomics Core) and SYNAPT G2 (the predecessor of the G2S, analyzed by Waters). The data were processed using the conventional database search engine Mascot and validated using Peptide and Protein Prophet. Over 375 proteins were identified by each instrument with 65% protein overlap between instruments. The SYNAPT G2 identified the highest number of proteins or 556. Notably, the efficiency of identification was low (6%>), with only 1500 unique peptides identified from >25,000 scans (Orbitrap Velos). Further, the B. subtilis proteome is 4188 proteins, and only ~10%> of the proteome was identified in this unfractionated sample. [00163] F. tularensis was analyzed using similar methods, except that the lysate was divided into membrane (SDS-soluble) and cytoplasmic proteins, separated on SDS-PAGE and cut into 15 fractions. A total of 1101 proteins were identified, representing 63% coverage of the proteome. In addition, comparing different growth conditions for F. tularensis allowed identification and quantitative analysis of Intracellular growth locus proteins A, B and C, which have been shown to be essential for virulence (Ludu, et al, 2008).

[00164] One can implement real-time proteomics algorithms on a dedicated Waters

SYNAPT G2S mass spectrometer or another available system such as a Thermo Orbitrap- based device. Identifier software is well suited to the task of reliably and quickly identifying peptides from unsearched, raw MS data as it is being collected. Real-time peptide identification is feasible with Identifier. Identifier software can be extended to accommodate the changes needed to facilitate real-time identification from streaming mass spectrometry data on the Waters SYNAPT G2S mass spectrometer.

[00165] As implemented by Thermo and others, dynamic peptide exclusion does increase dynamic range of protein identification (Fig. 10B), but the gains are small compared to those that might be realized were protein inclusion and exclusion successfully implemented (Fig. IOC). True data-dependent run-time control has not been embraced or implemented by most manufacturers of mass spectrometers. Recent reports of real-time peptide and protein identification using Thermo instruments has shown the feasibility of this approach, however, real-time identification in itself does not increase proteome coverage (Graumann, et al., 2012). Inclusion and exclusion lists have been utilized and do increase coverage but require off-line processing and re-runs of the sample to generate the inclusion lists. Significant improvements in proteome coverage in real-time will require dynamic protein (and proteome) exclusion. Importantly, Waters established feasibility for protein exclusion on the SYNAPT via implementing Real Time Databank Searching (RTDS). In this working prototype, rapid MassLynx database searching identifies proteins during data acquisition, so that other peptides from the same protein can be excluded from MS/MS.

[00166] The Adaptive Peak Picking Engine (APPE) built on the Identifier -/Validator/Quantitator pipeline can be implemented on a Waters SYNAPT G2S advanced Q-TOF LC-MS/MS system. On the SYNAPT both MS and MS/MS spectra are obtained at the same high resolution and mass accuracy, which enhancing the reliability of pattern matching via Identifier. One can adapt the RTDS software, by exploiting user- controlled ion selection during data acquisition but replacing peptide identification by the MassLynx database search with the faster and more robust APPE.

[00167] For any system that is too slow for real-time performance, one can take an approach similar to Mann, where their initial database search has less stringent identification criteria to allow fast identification, followed by a more stringent validation step after the run is over (15). Given that only 6% of identifications were informative in the B. subtilis sample, even in the presence of some false identifications, one can obtain an enhancement of performance.

[00168] Closely related bacteria can vary by hundreds to thousands of peptides (Fig.

9), and a subset of these can be readily detected even in the presence of the human proteome (Table 3). [00169] Computational methods and databases of bacterial genomes have been used to predict bacterial proteomes and identify candidate biomarker peptides that are linked to a bacterial class, genus, species or strain, based on their unique mass. The FASTA protein databases (ORF) of approximately 800 bacterial species from the Human Oral Microbiome Database (Chen, T., Yu, W-Han, Izard, J., Baranova, O.V., Lakshmanan, A., Dewhirst, F.E. (2010) The Human Oral Microbiome Database: a web accessible resource for investigating oral microbe taxonomic and genomic information. Database, Vol. 2010,) along with the NCBI human genome protein database were subjected to in silico tryptic digestion, yielding a total of 187,801,193 distinct peptide sequences. Extracting the peptide sequences belonging to only one bacterial species returned 168,044,697 species-specific peptide sequences, with an average of 205,685 species-specific peptide sequences per species. Removing from this list all peptides with mass within 10 ppm or 1 ppm of any tryptic peptide from the Human protein database left 6,038,683 (1 ppm) and 392,222 (10 ppm) species-specific peptide sequences, an average of 7,391 (1 ppm) and 480 (10 ppm) mass-distinct peptide sequences per species. Utilizing the example from Aim IB, there are now 10,746, 20,483 and 3,057 unique peptides that can distinguish Brucella suis, melitensis and melitensis biovar Abortus, respectively. Remarkably, each of the three strains has >136 peptides at 1 ppm and >8 at 10 ppm that can identify the species even in the presence of the complete human proteome.

[00170] One can test, for example, complex samples including, for example, any combination of Bacillus subitilis, Acinetobacter baumanii, Stenotrophomonas maltophilia, Burkholderia cepacia, Klebsiella pneumonia, and Escherichia coli. An example of closely related strains to add to the mixture would be Brucella suis, melitensis and melitensis biovar Abortus. Identification of species and antimicrobial resistance genes can then be performed.

[00171] In certain embodiments, one can more heavily weight inclusion lists, including cross-referencing proteomes and selecting "marker" peptides of species. One can also mine existing data and perform extensive control runs to create a list of human proteins that are not detected because they are theoretical, expressed at very low abundance, or are tissue- restricted. The removal of these proteins decreases the combed-out region in the proteome exclusion model. In certain embodiments, antimicrobial resistance genes can be added to inclusion lists.

[00172] Algorithms for real-time biologic pathway analysis with rapid knowledge integration to improve peptide selection during mass spectrometry [00173] A complex query engine (Interrogator) capable of interrogating multiple knowledgebases simultaneously to make real-time predictions to inform protein inclusion and exclusion.

[00174] The ability to exclude large numbers of peptides from selection and fragmentation results in an order-of-magnitude increase in dynamic range in simulations. One can use real-time interrogation of other system- wide datasets in order to predict which other proteins are likely to be seen during the run. One can use other dynamic inclusion and exclusion lists of peptides to inform the mass spectrometer during the precursor scan in order to mask out or preferentially select certain peptides for fragmentation. This provides at least two benefits. First, as demonstrated with dynamic protein and proteome exclusion, it can further increase the dynamic range of peptide detection. Whereas certain embodiments use exclusion and inclusion based on gene sequence-inferred proteome data, others use information about other relevant pathways and interactions that might otherwise be overlooked. Second, this level of complex orthogonal analysis facilitates the reporting of a richer set of data than current methods. In addition to the peptides and proteins identified, an integrated system can report enriched and modulated pathways, likely protein-protein interactions, and other system-wide information not otherwise easily accessible or readily apparent.

[00175] The Stevens group at Argonne Laboratory and University of Chicago helped develop the National Microbial Pathogen Data Resource (NMPDR), a database of curated annotations for comparative analysis of genomes and biological subsystems (McNeil, et al., 2007). The NMPDR and its successors, the PubSEED (Overbeek et al, Nucleic Acids Res 33(17), 2005 (Supplementary material)) and PATRIC ("PATRIC: The Comprehensive Bacterial Bioinformatics Resource with a Focus on Human Pathogenic Species" Infect. Immun 79 (11): 4286-98.) contain complete and whole genome shotgun (WGS) genomes of over 3900 bacteria to support extensive comparative analysis. The underlying Sprout Database which supports these systems includes extensive cross-reference data and contains 34.6 billion characters of information in 2.8 Gb of search indices. Most recently, Dr. Stevens has led the Model SEED project (Henry, et al, 2010), in which the group developed a system for the automated generation of metabolic models from genomic data. Currently, the database contains over 3000 public models and 15,000 private models from over 1900 bacterial species. Part of this initiative was the development of the RAST server (Rapid Annotations using Subsystems Technology), an automated service for annotating bacterial genomes, identifying protein encoding and RNA genes, assigning functions to the genes and predicting subsystem representation (Fig 12) (Biemann, et al, 1988; Meyer, et al, 2008). RAST has been used to annotate over 40,000 genomes since 2007 with 12,000 registered users.

[00176] As proteins are identified in the databases, possible and probable interactions and pathways can be established and the peptide inclusion and exclusion lists will grow and shrink as appropriate. As the run progresses, the system will have "learned" more about the organisms identified and ultimately, will present the user with a list of protein identifications based not only on peptide-spectrum matches, but also built on the rich layers of data retrieved from, for example, the PubSEED and other databases.

[00177] Multiple proteins subunits of multiprotein complexes are commonly present in cells in consistent relative abundance due to control of gene expression and protein stability as well as other mechanisms. Some subunits are common to multiple complexes while others are unique. An example is TRRAP, an adaptor protein shared among multiple histone acetylation complexes that serves to link histone acetyltransferases to distinct DNA binding proteins. Identification of TRRAP associated with another protein, such as MYC, that binds to DNA, raises the question of which histone acetyltransferase(s) and/or other subunits may be present. In practice, identification of TRRAP would add all known partners to the inclusion list, and the identification of one or other protein on this list would provide functional information on the activity of TTRAP as an adaptor for the DNA binding factor.

[00178] Screening populations for cancer remains problematic. Markers must have sufficient sensitivity and specificity to allow low false negatives and low false positives. Each different tissue of origin give rise to a distinct pattern of disease and tissue specific biomarkers must be anticipated. To date, hopes for detection of candidate biomarkers of cancer, such as proteins or peptides in serum, urine, or other biofluids obtained for screening have demonstrated low statistical significance on their own because though some are sufficiently sensitive, most are not specific. For example, in the detection of ovarian cancer, serum CA125 is not satisfactory on its own, insofar as its expression is a function of peritoneal irritation rather than tumor growth per se. Similarly, serum PSA is not specific to prostate cancer but increases in prostatitis. CEACAM1 is a candidate marker for pancreatic cancer that is also observed in pancreatitis. Detection of such a biomarker may gain value if detected in combination with measurements of other proteins that may rule in or rule out alternative diseases, segregating patients and directing them toward distinct treatments. Thus, detection of a candidate biomarker at a level considered potentially significant would then lead to inclusion of other proteins that would be differentially associated with cancer, inflammation, or other processes. [00179] One can run these analyses on, for example, the Waters Synapt G2S having implemented Interrogator into the workflow. One can adapt already-developed complex presentation and visualization tools to summarize the results of organism identification and virulence factor analysis in an easily accessible form for the purpose of making rapid clinical decisions. One can generate reports that use heuristics to inform the clinicians as to the most appropriate intervention, similar to reports now generated by conventional clinical microbiology laboratory workflows, but at a significantly accelerated rate and with potentially much richer data.

[00180] With full integration of a suite of tools for rapid protein identification, dynamic protein and proteome exclusion, and fully integrated and robust orthogonal biologic pathway analysis, one can quickly and confidently analyze a complex biologic sample containing an array of proteins and bacteria, both normal and pathogenic, to fully characterize the identity of the pathogens and their antibiotic susceptibility patterns. One preferably develops a standard procedure for sample handling and processing that will facilitate analysis on our platform.

[00181] Mass spectrometry instrumentation and software suitable for use in the present invention are described in, for example, US Patents 8,053,723, 8,110,793, 7,009,174, 7,351,956, 7,297,941, 7,417,223, 8,168,943, 4,736,101, 8,384,022, 7,199,361, 6,744,043, 7,737,396 and 7,982,181, International Patent Applications PCT/US2005/027074 and PCT/IB2013/000384 and published US Patent Applications 11/777,926, 12/785,705, 11/884,676 and 13/090,120, all of which are hereby incorporated by reference herein.

PART 3

[00182] Multiple myeloma (MM) is the second most common hematologic malignancy, responsible for over 20,000 new cases in 2012. Despite intensive therapy, MM remains incurable for most patients, resulting in over 10,000 deaths each year in the U.S. Outcomes using the proteasome inhibitor bortezomib and immunomodulatory agents such as thalidomide and its analog lenalidomide (Revlimid) have been promising. Recent studies have demonstrated possible mediators of resistance to these therapies, but traditional genomic studies have failed to reveal reliable predictors or mechanisms. A method that offers high selectivity and dynamic range that can rapidly characterize the proteome of MM tumor cells in response to therapy can enhance discovery and lead to better diagnostic strategies and therapies.

[00183] Proteomic analysis of complex mixtures is most often accomplished through chromatographic separation followed by tandem mass spectrometry (LC-MS/MS). Data analysis remains time-consuming and computationally expensive for most users. Even using state-of-the-art instrumentation and methods, current approaches yield tentative identifications of only a few hundred proteins, far fewer than expected. The net effect is an inadequate list of proteins, many of which are either identified with low certainty or are the most highly abundant (and often least interesting) proteins. [00184] A novel approach to LC-MS/MS proteomics to enhance the identification of proteins in patient-derived MM samples. The mass spectrometer's ability to exclude peptides and proteins from repeated fragmentation can be used to dramatically increase performance. Informatic tools have been developed to rapidly and confidently identify and quantify peptides and their parent proteins from high-resolution mass spectrometry data, and one can adapt these algorithms to identify peptides in real-time during acquisition, excluding all other possible peptides of the parent protein from subsequent analysis (dynamic protein exclusion). This facilitates comprehensive and rapid identification of relevant and interesting proteins from complex biologic samples, dramatically increasing the dynamic range of detection. Tools for real-time interrogation of biologic pathways and other orthogonal information are described; these data can be used to inform subsequent peptide selection on-the-fly.

[00185] The approach for multiple myeloma target discovery and pathway identification is based on modulating dynamic inclusion and exclusion peptide lists and focusing LC-MS/MS instrumentation on unidentified, low-abundance components of the sample. Modeling shows that the dynamic range of protein identification can be extended by two orders of magnitude, facilitating the confident identification of thousands of proteins during a single, experiment using current instrumentation running the software described herein.

[00186] Barriers to comprehensive clinical proteomics Though progress has been made in the molecular characterization of diseases at the genomic level, it is becoming clear that proteomic analysis is necessary to complement this effort. With the magnitude of changes being identified at the gene transcription level, it becomes necessary to investigate the importance of translational and post-translational regulation. Recent advances in bioinformatics and protein mass spectrometry have fueled the emergence of proteomics as a possible tool that can be used to answer challenging questions in order to fully characterize cancer at a molecular level. A method that offers high selectivity and dynamic range that can rapidly characterize the proteome of tumor cells at diagnosis and in response to therapy can enhance discovery and potentially lead to better diagnostic strategies and therapies. [00187] Multiple myeloma is largely incurable Multiple myeloma (MM) is the second most common hematological malignancy in the U.S. after non-Hodgkin lymphoma, accounting for 10% of all blood cancers and was responsible for over 20,000 new cases in 2012. It is characterized by clonal proliferation of plasma cells in the bone marrow with elevated serum or urine monoclonal paraprotein. As it advances, it is associated with severe clinical manifestations including lytic bone lesions, anemia, immunodeficiency and renal impairment. Multiple myeloma is typically preceded by an age-progressive disease called monoclonal gammopathy of undetermined significance (MGUS), present in 1% of adults over the age of 25, that progresses to multiple myeloma at a rate of 0.5%-3% per year. [00188] Even with the use of high-dose chemotherapy with stem cell support, MM remains largely incurable, and over half of patients with MM will succumb to their disease, resulting in over 10,000 deaths each year in the U.S. Improved response rates have been achieved in refractory and relapsed patients with novel agents, including thalidomide, the immunoregulator lenalidomide (Revlimid), and the proteasome inhibitor bortezomib. Recent clinical outcomes using the thalidomide analog lenalidomide have been promising. Thalidomide also remains an important treatment option for patients not eligible for autologous stem cell transplant (ASCT) and for those who have refractory or relapsed disease.

[00189] Multiple myeloma as a paradigm system The achievement of a very good partial response (VGPR) or complete response (CR) to initial treatment of myeloma is associated with longer remission and overall survival. Using a predictive marker of response prior to initiation of therapy will facilitate the identification of patients likely to have a suboptimal response, allowing the tailoring of treatment to improve outcomes.

[00190] There have been a limited number of multiple myeloma studies that have utilized mass spectrometry-based proteomics. A proteomic study of 39 newly-diagnosed MM patients treated with thalidomide revealed that responders had elevated levels of vitamin D- binding protein, zinc-a-2-glycoprotein, β-2 -microglobulin, and serum amyloid A protein, while there were lower levels of haptoglobin fragment. SELDI and MALDI have been used to discover panels of spectra that differentiate between different types of MM form normal. Multiple myeloma cells sensitive to dexamethasone were found by proteomic analysis to have over-expression of FKBP5, which is possibly involved in signaling pathways that induce dexamethasone-mediated apoptosis. Further dexamethasone effects were studied by comparing the proteomic profiles of MM cells and normal plasma cells by 2-D electrophoresis and MS. Forty-three differentially expressed proteins were identified, and functional studies were performed demonstrating that annexin Al knockdown induced lethality and potentiates the effects of dexamethasone. A study of tumor reversion used SILAC quantitative proteomics to compare parental and revertant MM cells and revealed 379 proteins activated or inhibited, including down-regulation of STAT3, TCTP, CDC2, BAG2, and PCNA. MALDI studies using arsenic trioxide, known to cause growth inhibition in MM cells and have clinical activity, revealed up-regulation of HSP90 and down-regulation of 14- 3-3ζ protein and members of the ubiquitin-proteasome system in arsenic treated cells. The proteasome inhibitor, PS-341 induced apoptosis in MM cells, but sub-toxic levels appear to sensitize MM resistant cell lines to chemotherapy. Proteomic analysis was used to demonstrate that PS-341 down-regulates several effectors involved in the cellular response to stress leading to increased sensitivity. Post-translational modifications have also been studied. Phosphorylation appears to be responsible for regulation of several MM proteins, including FGFR3. Since FGFR3 is a drug target in some MM and is activated by mutation in several other cancers, MS was employed to identify phosphotyrosine sites modulated by FGFR3 activation and inhibition. Forty drug-sensitive phosphotyrosine sites identified were found to be co-modulated by FDF1. Selective reaction monitoring was used to determine the phosphorylation stoichiometries of two phosphorylation sites on Lyn kinase, the predominant Src family protein-tyrosine kinase in B cells and a protein implicated in B-cell related malignancies such as MM. A large-scale analysis of phosphorylation in MM cells was performed using titanium dioxide enrichment, and 530 phosphorylation sites were identified from 325 unique phosphopeptides corresponding to 260 proteins.

[00191] Proteomic profiling has recently identified TXNDC5 (thioredoxin domain containing protein 5), which codes for a protein disulfide isomerase involved in regulation of oxidative stress, as one potential marker of response to two bortezomib-containing regimens. TXNDC5 gene expression has been found to be up-regulated in certain cancers and stimulates cancer cell growth and proliferation in vitro. Although this mechanism is not fully understood, higher levels of TXNDC5 may play a role in protection of tumor cells from apoptosis and increasing their resistance to therapy. In particular, levels of TXNDC5 may affect activity of proteasome inhibitors, leading to altered production of reactive oxygen species. However, it has not been established whether levels of TXNDC5 are predictive of response to bortezomib versus other components of the treatment regimens. Cereblon is an intracellular protein that is a direct target of immunomodulatory drugs (IMiDs), including thalidomide and lenalidomide, and is required for their activity. A recent study demonstrated that myeloma cells from patients who are resistant to immunomodulatory agents had lower levels of cereblon, suggesting that cereblon depletion is a possible mechanism of resistance to these agents. For patients treated with bortezomib, doxorubicin, and dexamethasone, pre- treatment levels of cereblon were significantly lower in patients with CR or VGPR compared to non-responders. Based on these results, cereblon may potentially be used as a marker of response to IMiDs. In this study, one can evaluate the levels of TXNDC5 protein and cereblon in samples from multiple myeloma patients collected prior to treatment with regimens containing proteasome inhibitors and immunomodulatory agents, and based on these levels, create a plausible predictive model of response to combination therapies.

[00192] All of these MM proteomics studies suffer from compromises made to accommodate the narrow dynamic range of current MS analysis. The use of labeled cell lines, phosphoproteome enrichment, and protein depletion are the most common methods employed to increase the likelihood of finding a low-abundance protein. Unfortunately, these procedures mask or eliminate potentially interesting low-abundance proteins. These restrictions can be overcome by implementing intelligent protein exclusion on a high- resolution mass spectrometer.

[00193] Toward rapid and comprehensive proteomics Real-time peptide identification and subsequent on-the-fly protein exclusion and protein interaction and biological pathway prediction are described herein. Such a transformative advancement in mass spectrometry is preferred for the interrogation of complex samples required for in-depth proteomics. Certain embodiments comprise a proteomics device in the hospital lab exploiting protein exclusion for automated and in-depth coverage of complex samples such as blood or tumor tissue. This facilitates rapid detection and characterization of low-abundance disease-related proteins, dramatically decreasing the response times for initiating appropriate therapy, resulting in decreased morbidity and mortality and significant cost savings. In addition to improved stratification, in-depth proteomics enable better measures of responses to therapy, increased sensitivity of disease surveillance, and drive discovery of potential therapeutic targets.

[00194] Identification of low-abundance proteins from complex samples Present technology does not permit rapid and comprehensive identification of the full complement of proteins complex samples, such as blood, tissue, urine, or other patient samples which may contain disease-specific proteins. Antibody-methods are sensitive and specific but limited to a small number of specific markers and thus can only identify targeted protein species and known variants.

[00195] Current mass spectrometry methods are perhaps the least attractive option, insofar as the dynamic range of the approach is far too low to offer needed sensitivity or specificity, let alone detection of low-abundance proteins. Even using state-of-the-art LC- MS/MS instrumentation and informatics, it remains impractical to detect these low- abundance proteins in blood at physiologic concentrations. Methods for sample depletion or enrichment may improve detection of certain classes of proteins, but do so at the expense of overall sensitivity. Successful implementation of a rapid, sensitive, and specific proteomic assay that can identify rare proteins over a wide range of abundances within a complex sample advances proteomics to the forefront of emerging clinical tools.

[00196] in-depth proteomics Because the conventional data-dependent MS/MS approach selects the most abundant ions for fragmentation, the results are skewed towards identification of abundant proteins. In fact, it is not uncommon for dozens of peptides from a single abundant protein to be identified. Thus, the mass spectrometer spends time identifying the same protein repeatedly at the expense of missing low-abundance proteins. As a result, 20,000 peptide fragmentation events may result in only 500 protein identifications. Being able to control the MS during the run, perform rapid peptide and protein identification, and dictate which ions should or should not be selected for fragmentation can significantly improve dynamic range. Real-time, on-the-fly peptide identification is preferred for this approach.

[00197] Previously, the computation required to perform analyses was far too slow for real-time implementation. Therefore, analysis could only be achieved off-line, after data had been collected from the mass spectrometer. Recent advances in mass spectrometry instrumentation and computing speed have made real-time analysis of MS data feasible, as shown by recent studies from the Mann, Coon and other labs. While these real-time methods yield results far faster than conventional off-line analysis, their performance overall has not offered even two-fold improvements in protein identification or other metrics.

[00198] With feasibility of confident, on-the-fly peptide identification already demonstrated, one can take full advantage of real-time data analysis to improve the efficiency and dynamic range of tandem mass spectrometry. One can take advantage of Identifier, a high-performance peptide and protein identification tool that can perform confident, on-the- fly peptide identification from high-resolution data. Once multiple peptides unique to a specific protein are in hand, one can perform data-dependent protein exclusion, based on the rationale that once a protein is identified, further peptides likely to derive from that protein are non-informative. Extensive simulations show that this strategy can increase the speed and depth of protein identification in LC-MS/MS by at least ten fold over "conventional" realtime methods.

[00199] Extending this paradigm to cancer proteomics, one can use peptides unique to a specific pathway to build inclusion and exclusion lists to modulate the selection of peptides in favor of those that confirm other pathway members, while excluding those that are informationally redundant. As such, signaling pathways important in cancer development and progression can be fully characterized in complex samples. Further, unanticipated components, such as previously unknown targets and other biomarkers, can be readily detected.

[00200] By these innovative approaches, confident identification and quantitation of thousands of proteins from complex samples are possible, yielding a dramatic improvement in speed and accuracy of proteomics, permitting MS-based clinical decision-making within hours of sample collection.

[00201] Peptide and protein identification paradigm Mass spectrometry (MS) is well matched to proteomics as a primary analytical tool. Each unmodified or modified amino acid has a characteristic mass, and a typical commercial mass spectrometer can measure the mass- to-charge ratio (m/z) with high precision and resolution for ions over the range of 100 to 10,000 Daltons. Typically, proteins are digested with trypsin and the resulting Lys and Arg- terminal peptides are separated by reverse-phase liquid chromatography (LC). The eluent is injected into a tandem mass spectrometer and peptides are ionized by electrospray ionization (ESI), yielding doubly or triply charged peptide ions of five to twenty-five residues. Peptides are selected for MS/MS and fragmented via collision-induced dissociation (CID) to create nested series of amino terminal (b-ion) and carboxyl-terminal (y-ion) fragments separated by the mass of the amino acid residues.

[00202] Current approaches to comparative/quantitative proteomics Although challenging, quantitation of proteins by conventional LC-MS/MS is of considerable interest in that m NA expression is often a poor predictor of protein abundance. Comparison to standards and/or detection of differences between samples after stable-isotope labeling remains the preferred approach. As heavy isotope-labeled peptides co-elute with their unlabeled partners, a direct comparison of ion counts between the heavy and light forms of each peptide results in a reliable measure of relative abundance. Several strategies facilitate consistent differential labeling with isotopic tags, including trypsin-catalyzed ¹⁸0 exchange (Fig. 13) and stable isotope-labeling with ¹³C and ¹⁵N-labeled lysine and/or arginine amino acids in cell culture (SILAC). These carboxyl-terminal-labeling strategies result in mixtures of pairs of chemically identical, but isotopically distinct, peptides. The unlabeled and stable isotope-labeled peptides co-elute as pairs during LC-MS/MS, yielding isotopic envelopes offset by 4-10 Da, in the MS scan (Fig. 14). Informatic analysis is used to compare the intensity of the isotopic forms to quantify relative abundance.

[00203] Current approaches to peptide identification The prevailing peptide identification technology requires the measured fragmentation spectrum be compared by an automated pattern-matching algorithm to a database of species-specific peptide masses and their theoretical MS/MS fragmentation. When combining these approaches with state-of-the- art mass spectrometers that offer high scan speeds (>20 MS/MS scans per second) and high mass-accuracy (<1 ppm) such as the Thermo Orbitrap Velos, Waters SYNAPT G2S High Definition Mass Spectrometer, or the Agilent 6500 Series Accurate-Mass Quadrupole Time- of-Flight (Q-TOF) LC/MS, confident identification of hundreds of proteins can be obtained from a single sixty minute experiment. [00204] Several well-known limitations to standard database search approaches conspire to decrease the yield of identified peptides, as the majority of ions selected for fragmentation fail to lead to a confident identification. Background noise from other peptide fragments, poor CID efficiency, and incorrect pattern matching all contribute to low specificity, and so it is useful to calculate a false-discovery rate based on the score-frequency distribution of random peptides. A number of authors have suggested practical score "cutoff thresholds for automated acceptance of database search results, but peptides with scores below the cut-off may be correct and those above incorrect.

[00205] But by far the most important limitation on yield of peptide identifications is that commercial LC-MS/MS systems automatically select the most intense ions in the MS spectrum for fragmentation. Since peptides from high-abundance proteins will be subjected to fragmentation multiple times, many of the selected peptides will derive from a few abundant proteins, and low abundance peptides are never selected and identified. Dynamic (peptide) exclusion is a standard approach to increasing dynamic range by reducing redundant selection of ions with a particular monoisotopic mass, but this strategy does not solve the problem of multiple peptides deriving from a single abundant protein. Consequently, medium- and low- abundance proteins are often not identified, even though they may have yielded multiple detectable ions in the MS spectrum. Analysis of the output from a published dataset shows that of 378 identified proteins, the top 17 (5%) accounted for 25% of the matched spectra, with half of the spectra being used to identify only 14% of the proteins.

[00206] Increased dynamic range with protein exclusion Significantly improving dynamic range requires the real-time identification of proteins, alleviating repeated fragmentation of peptides from already-identified proteins, while allowing fragmentation of peptides not yet assigned to a protein. Advances in processing power have resulted in several orders-of-magnitude improvement in computing speed, making real-time analysis of MS data feasible as shown by several recent studies.

[00207] Potential for enhanced identification with precision proteomics State-of-the-art mass spectrometers (e.g., Thermo Orbitrap, Waters SYNAPT QTOF, Agilent QTOF) can now achieve high mass-accuracy (1 ppm) at both the MS and MS/MS level. The use, abuse, and underuse of high mass-accuracy/high resolution mass spectrometry data in peptide identification has been discussed. Mass accuracy and resolution directly contribute to peptide identification by both constraining the precursor ion charge state and the monoisotopic m/z, thereby limiting the range of possible matches. In the case of the human proteome (NCBI, release 2011 11) there are 87,612 predicted proteins, leading to 1,859,727 distinct tryptic peptides with 3,327,950 unique masses, owing to multiple charge states. This gives an average of only 17 peptides per ~50 ppm interval, indicating that accurate mass alone can be sufficient to confidently identify a human peptide.

[00208] Applications to cancer Rapid identification of cancer-specific species in a complex sample depends on successful detection of low-abundance peptides and proteins. Informatic analysis of the human proteome demonstrates that even highly conserved proteins are likely to differ in amino acid sequence in several of the peptides and can thus be differentiated using mass spectrometry. Since the mass spectrometer detects peptides based on their mass-to-charge ratio, and state-of-the-art instruments such as the Agilent 6500, Thermo Orbitrap, and SYNAPT G2S Q-TOF can achieve 1 ppm accuracy or 0.0001% of the peptide mass, a sequence difference of one amino acid or a modification in a peptide is easily detectable. [00209] Current proteomics workflows can include a list of peptides to target (include) or ignore (exclude) during the mass spectrometry run, but this information must be obtained beforehand, effectively doubling the analysis time and cost. Certain embodiments of the present invention include an informatic approach to identify information-rich peptides in samples from patients with multiple myeloma. One can perform peptide identification in real time and control the selection of peptides for fragmentation based on the preceding identifications and the information contained in each peptide. One can interrogate available knowledgebases to further inform our peptide selection, thus dramatically increasing the dynamic range of detection. By selectively excluding redundant, low information peptides and by including those peptides likely to yield high- value protein identifications, one can rapidly and confidently identify important, low-abundance proteins from complex biologic samples.

[00210] Software to rapidly and confidently identify and control selection of peptides during mass spectrometry. A single automated informatics workflow that rapidly and confidently identifies peptides and proteins from high-accuracy mass spectrometry data. Real-time proteomics requires accurate identification and quantitation without manual validation or post-run statistical analysis, while identifying peptides over the full dynamic range of the almost 2 million predicted tryptic human peptides during a single 90-minute LC-MS/MS run. Enabling development of a real-time workflow are the spectral deconvolution software (Validator) and software for direct peptide identification (Identifier) and relative quantitation (Quantitator, Fig. 15). These software packages exploit the embedded information from stable isotope labeling. Briefly, isotopic peptide pairs are identified directly from the precursor (MS) scan and Validator deconvolutes the fragmentation spectra, identifying potential b- and y-ions. Identifier relies on the high- accuracy precursor mass and the Validator-assigned potential b- and y-ions to rapidly and confidently assign a peptide sequence selected from a mass-sorted species-specific tryptic database. Quantitator then calculates the peptide pair ratio. Each step occurs considerably faster than the mass spectrometer can fragment a new peptide, making it feasible to generate inclusion and exclusion peptide lists for subsequent scans in real-time as described below. [00211] Identification of peptide pairs and potential b- and y-ions (Validator)

Carboxyl-terminal stable isotope-labeling methods (SILAC, ¹⁸0 exchange) result in a mixture of pairs of chemically identical, but isotopically distinct, peptides that co-elute from the HPLC as pairs that are readily resolved by the MS and identified by Validator (Fig. 14). Raw data files are converted to mzXML using MSConvert within Proteo Wizard followed by the extraction of monoisotopic masses using the Horn Mass Transform algorithm within Decon2LS. The "light" and "heavy" fragmentation spectra are compared, and b-ions and y- ions are identified as having the same m/z in both scans (non-shifting) or having a mass difference corresponding to the isotope used (shifting), respectively resulting in a set ion pairs for each scan window.

[00212] Validator was tested on output from the conventional database search engine,

Mascot. Validator identified potential b-(non-shifting) and y-(shifting) ions from the fragmentation spectra and compared these to the b- and y-ions calculated from the Mascot peptide sequence. In a complex yeast sample, Validator analysis confirmed the identities of 89% of peptides found through traditional database search and post-processing with Peptide and Protein Prophet. Validator also identified potentially valid low-scoring peptides that would otherwise have been discarded, increasing both sensitivity and specificity.

[00213] Direct peptide identification (Identifier) Direct peptide identification software, Identifier, which uses the accurate mass of a peptide pair member to identify a range of candidate peptides from a mass-sorted species-specific tryptic database of the proteome(s) of the organism(s) of interest has been developed. Each measured experimental mass is compared to the database to identify peptides within a close range (e.g. +/- 10 ppm) and the b- and y-ions from each peptide sequence are compared to the potential b- and y-ions identified by Validator (Fig. 15 A). Each potential match is scored according to the number of matching shifting and non-shifting ions, along with a metric to include the number of consecutive matches. The threshold score for each match is determined by comparing the score to a distribution of scores from 1000 randomly generated peptides of similar mass and composition. The 99% cutoff score determines which peptide (if any) is the "winner." Identifier was tested on a yeast whole cell lysate digest expected to contain around 5000 proteins. Identifier identified 1,700 proteins and found 80%> of "high quality" Mascot identifications (minimum 2 peptides with 95% Peptide Prophet score, 99% Protein Prophet score). Using a published dataset of high-quality data, Identifier was rapidly able to identify 95% of the proteins found through traditional database search. These results indicate that reliable peptide identifications can be obtained using only the mass and inferred b- and y- ions, demonstrating the feasibility of real-time mass spectrometry.

[00214] Quantitator Relative quantitation using trypsin-catalyzed ¹⁸0 exchange involves directly comparing the "light" and "heavy" peptide peaks at the MS level (Fig. 14). Our quantitation software, Quantitator (in preparation), uses the peptide sequence assigned by Identifier to calculate an expected isotope distribution via the isotope pattern calculator (IPC) module. The fit of the experimental spectra to the theoretical model is then calculated to yield a "fit score," which identifies the most informative scans for accurate differential quantitation (Fig. 15B). The extent of ¹⁸0 exchange is calculated from the fit, allowing for correction of quantitative values for incompletely labeled samples. Quantitator was tested on data from a series of ¹⁸0- labeled standards (90% purity) mixed with unlabeled sample at ratios from 1 :20 to 20: 1. Quantitator showed high inter-sample correlation (r=0.91) and tolerance for incomplete labeling. [00215] Unfmnegan A set of C libraries to provide access to the raw data contained within the file generated by the Thermo MS has been designed. As the conversion of this file to an open-source consumable format generally requires a proprietary set of libraries, the availability of a fast algorithm for accessing the raw data is essential to ensure reliable pair picking and subsequently analysis steps. [00216] Software speed Software was written in Python 2.7 or Perl 5.1 and run on standard laptop and desktop hardware. For a 200 megabyte raw Orbitrap file, conversion to mzXML occurs in under 10 sec. The pair-picking algorithm finds and corroborates all potential pairs from the -20,000 scans in the resulting 500,000 line mzXML file in under 3 minutes. Using this database, Identifier can check over 2500 potential peptides in < 4 min, or < 100 ms per match.

[00217] The approach is tested against well-curated and searched data sets. As before, a large set of 72 MS runs from human HeLa cells is used. Representative sections of these data are analyzed using traditional search methods such as Mascot, X! Tandem, and Scaffold, and these results are used as a metric to which to compare the performance of the software. [00218] For shorter peptides or when MS2 fragmentation data has low signal-to-noise ratio, at times the peptide score is not high enough to declare one peptide candidate as the "winner." Some peptides will exist in only the light or heavy form or that only one of these is fragmented. In looking through the small list of peptides found by Scaffold but not by Identifier, both of these scenarios are seen. Nevertheless, most proteins have enough peptides so that others from the same protein are fragmented and identified.

[00219] As protein exclusion is database-dependent, it typically does not lead to the elimination of unexpected contaminants, unannotated peptides from novel splicing events or modifications, and any number of other potential confounding features. Any of these unaccounted-for peptides might be selected for fragmentation using our approach and compared to the proteome-specific peptide list. However, the Identifier algorithm is extraordinarily robust and highly tolerant of this type of "contamination" and only rarely reports a false positive identification. The method derives its specificity through two rigorous physical filters, first by differentiating shifting and non-shifting ions by comparing light and heavy fragmentation patterns, and second, by scoring the theoretical fragmentation patterns of similarly-sized tryptic peptides and comparing the categorized ions to the experimentally- derived deconvoluted spectrum. This strategy eliminates a large number of potential errors that confound typical database search algorithms that cannot differentiate between b-type, y- type and background fragment ions in fragmentation spectra.

[00220] Algorithms for dynamic protein-based peptide inclusion and exclusion.

Software to utilize real-time peptide identification to intelligently select the peaks picked for fragmentation is described. The benefits of protein exclusion for the identification of low- abundance proteins in complex samples are modeled.

[00221] STUDIES: Estimation of high-information peptides An in silico trypsin digestion of the human proteome (NCBI, release 2011 11) yields 3,327,950 distinct peptide masses of four or more residues from 87,612 proteins. Were all of these peptide masses combined into an exclusion comb using a conservative tine width of 10 ppm, it would "mask off less than 700 Dalton of the 300-2500 Dalton range in a typical precursor (MSI) scan. Contributing to the small size of the mask, peptides consisting of the same amino acids in different orders yield tines that superimpose. Many other peptides yield tines that overlap. It was determined whether the remaining 1500 Dalton of "open space" in the MSI scan can be used to find the two proteins of interest, TXNDC5 and cereblon. An in silico trypsin digestion of the human proteome was performed as before, excluding the two proteins. After masking off the remainder of the whole human proteome, 395 peptides with masses within 5 ppm of a human peptide would be "combed out." However, 7 unique peptides would remain available for fragmentation and subsequent identification, of which only one or two would have to be detected to confirm and quantify these two proteins. If one takes advantage of the full capabilities of modern mass spectrometers and apply a 2 ppm or lppm tine width for the exclusion comb, the number of detectable peptides would be 15 and 27, respectively. These simulations suggest the potential to achieve a major advance in the ability to quickly and accurately detect rare proteins present in a complex sample. [00222] Simulation of dynamic protein exclusion A comprehensive complex simulation environment was created to model LC-MS/MS to study determinants of dynamic range. Five thousand proteins are chosen at random from the CCDS database (build 9606) of human proteins and assigned a random "abundance" over a wide dynamic range. Each protein is trypsinized in silico, and the mass of each peptide is calculated and assigned a random "ionizability." "Intensity" of each peptide peak is the product of the abundance and the ionizability. Peptides appear as a single m/z, representing the monoisotopic mass of a singly charged ion, and are assigned a random scan number. Each peptide is programmed to elute over 30-180 seconds with a triangular profile. A "scan" is then generated every second for a 120 min run.

[00223] Implementing a simple "top 5" approach, the scans are successively parsed, and the five most intense peaks are always chosen for fragmentation. Using an FDR of 5% (5% incorrect identifications) and a requirement for two peptides to identify a protein, the simulator identified about 800 proteins in a 2-hour run. As shown in Fig. 16A, identified peptides are greatly skewed to the highest "intensities" (blue dots). To simulate standard dynamic peptide exclusion, as implemented by Thermo, Inc., an exclusion list was created wherein each selected ion mass is added to a list of up to 500 ion masses and remains there for 180 scans, preventing any re-selection of a similar peptide. Roughly 1500 proteins are identified over the 2-hour run, but the mean intensity of selected peptides remains high (Fig. 16B). Using the algorithm, dynamic protein exclusion, once two unique peptides and found and identify a protein, all other peptides from that protein are added to an unlimited exclusion list. Implementing this method had a dramatic effect (Fig. 16C). Even early in the run, peptides with lower intensities are selected (blue), leading to a significant increase in identifications to -4500 of the 5000 proteins (90%, black). Increasing the FDR and the width of the exclusion mass tolerance had little effect on the number of proteins identified. These data confirm that exclusion of non-informative peptides has the potential to dramatically increase protein dynamic range and yield during LC-MS/MS.

[00224] Adaptive Peak Picking Engine (APPE) Through simulations, the dramatic effect of dynamic protein exclusion on dynamic range was demonstrated (Fig. 16C). The heart of the real-time algorithm is a comprehensive artificial intelligence engine that can dynamically change inclusion and exclusion criteria for selection of ions for fragmentation. For instance, if a single peptide has been identified that can be a constituent of two proteins, the masses (± tolerance) can be added to an inclusion list for preferential selection in order to identify the protein conclusively. Once the protein is identified, the peptide masses from the other candidate proteins can be removed from the inclusion list, and the masses of constituent peptides of the identified protein can be added to the exclusion list. Therefore, the inclusion list intervals, similar to the teeth of a comb, grows and shrinks, while the exclusion list comb continually gets larger. One can model this using the 5000 protein simulation environment outlined above, allowing for control of every aspect of the APPE and characterizing the effects of dynamic protein exclusion on dynamic range. Of particular importance is when to add proteins to the exclusion list. One can extend the simulation environment to include multiple charge states, complex post-translational modifications, and partial or incomplete isotopic labeling.

[00225] In certain embodiments, one can significantly decrease the exclusion list by using the expected elution time of peptides to exclude or include only over a given time range. [00226] Algorithms for dynamic protein and proteome exclusion on a high- resolution mass spectrometer. Using samples of known composition, one can demonstrate the identification of low-abundance proteins using intelligent protein exclusion. Using dynamic exclusion mass lists, one can show that once two unique tryptic peptides from a protein are identified, the rest of the protein can be excluded from further consideration, significantly increasing the number of proteins identified.

[00227] PRELIMINARY STUDIES: AS implemented by Thermo and others, dynamic peptide exclusion does increase dynamic range of protein identification (Fig. 16B), but the gains are small compared to those that might be realized were protein inclusion and exclusion successfully implemented (Fig. 16C). True data-dependent run-time control has not been embraced or implemented by most manufacturers of mass spectrometers. Recent reports of real-time peptide and protein identification using Thermo instruments has shown the feasibility of this approach, however, real-time identification in itself does not increase proteome coverage. Inclusion and exclusion lists have been utilized and do increase coverage but require off-line processing and re -runs of the sample to generate the inclusion lists. Significant improvements in proteome coverage in real-time will require dynamic protein exclusion.

[00228] Implementation of the APPE One can implement the real-time data acquisition algorithm on, for example, a dedicated Agilent 6500 device. The Identifier software is already well-suited to the task of reliably and quickly identifying peptides from unsearched, raw MS data as it is being collected. In fact, our preliminary data effectively demonstrate that real-time peptide identification is already feasible with Identifier. The Adaptive Peak Picking Engine (APPE) built on the Identifier /Validator/Quantitator pipeline will be implemented on an Agilent 6500 Q-TOF LC-MS/MS system. An advantage of the 6500 is high accuracy at which both MS and MS/MS spectra are obtained, which greatly enhances the reliability of pattern matching via Identifier. A second advantage is the control software for which Agilent holds a patent. One can adapt their control software by exploiting machine-controlled ion selection during data acquisition but replacing database-driven peptide identification with our much faster and more robust APPE. One can build on prior simulations by performing experiments with real-world samples to directly compare performance with peptide and/or protein exclusion. Using samples of known composition such as the Universal Proteomics Standard 48-human protein mix (Sigma-Aldrich), and running the 6550 with our software, one can optimize protein exclusion parameters to obtain the greatest yield and dynamic range of proteins. In effect, the complex simulation environment (Fig. 16) already models the realtime application of the software, as the algorithm analyzes spectra on-the-fly as they are streamed from the simulated dataset.

[00229] One can focus more heavily on inclusion lists, including cross-referencing likely proteins with interaction databases. One can also mine existing data and perform extensive control runs to create a list of proteins that are not detected because they are theoretical, expressed at very low abundance, or are tissue-restricted. The removal of these proteins decreases the combed-out region in the exclusion model.

[00230] As outlined above, proteomic techniques have been applied to study the differences in cell lines in response to treatment with immunomodulatory agents and proteasome inhibitors. A phosphoproteomic analysis of myeloma cells using SILAC revealed only 233 quantified phosphoproteins, of which 72 demonstrated differential expression after bortezomib treatment. One site on the protein stathmin was found to be phosphorylated in response to bortezomib therapy. When genomic studies failed to predict response to bortezomib in MM patients, a proteomics approach was employed to identify potential biomarkers in bortezomib-sensitive and bortezomib-resistant MM cell lines, leading to the identification of the MARCKS protein as associated with bortezomib resistance. A differential proteomic analysis of multiple myeloma plasma cells from patients revealed several proteins, including TXNDC5, were expressed at different levels in patients who had responded to frontline therapy versus those who did not. Cereblon is a recently-identified thalidomide target that appears to be required for toxicity and is depleted in thalidomide- and lenalidomide-resistant cells. Cereblon appears to play a role in the proteasome system, but its function remains unclear. Patient samples with known response status to thalidomide were subjected to 2-D difference gel electrophoresis and five differentially expressed proteins were identified by mass spectrometry (Thermo LTQ). One can use the system to define the proteins found to be differentially expressed and predictive of outcome in response to immunomodulatory agents and proteasome inhibitors. By focusing on these and other proteins that have been found to be differentially expressed, one can home in on dysregulated pathways and altered protein interactions that may herald disease progression.

[00231] Algorithms for real-time biologic pathway analysis with rapid knowledge integration to improve peptide selection during mass spectrometry. Develop a complex query engine (Interrogator) capable of interrogating multiple knowledgebases simultaneously to make real-time predictions to inform protein inclusion and exclusion. The ability to exclude large numbers of peptides from selection and fragmentation results in an order-of-magnitude increase in dynamic range in simulations. One can use real-time interrogation of other system-wide datasets in order to predict which other proteins are likely to be seen during the run. One can build on the model of dynamic inclusion and exclusion lists of peptides that informs the mass spectrometer during the precursor scan in order to mask out or preferentially select certain peptides for fragmentation. One can increase the dynamic range of peptide detection. Whereas certain embodiments of the invention use exclusion and inclusion based on gene sequence-inferred proteome data, other embodiments extend this paradigm to include information about other relevant pathways and interactions that might otherwise be overlooked. This level of complex orthogonal analysis facilitates the reporting of a far richer set of data than current methods. In addition to the peptides and proteins identified, an integrated system can report enriched and modulated pathways, likely protein-protein interactions, and other system-wide information not otherwise easily accessible or readily apparent.

[00232] Stevens and his group at Argonne Laboratory have developed a large-scale metabolic dataset for bacteria, the National Microbial Pathogen Data Resource (NMPDR), a database of curated annotations for comparative analysis of genomes and biological subsystems. The NMPDR and its successors, the PubSEED and PATRIC contain complete and whole genome shotgun (WGS) genomes of over 3900 bacteria to support extensive comparative analysis. The underlying Sprout Database which supports these systems includes extensive cross-reference data and contains 34.6 billion characters of information in 2.8 Gb of search indices. Most recently, Dr. Stevens has led the Model SEED project, in which the group developed a system for the automated generation of metabolic models from genomic data. Currently, the database contains over 3000 public models and 15,000 private models from over 1900 bacterial species. Part of this initiative was the development of the RAST server (Rapid Annotations using Subsystems Technology), an automated service for annotating bacterial genomes, identifying protein encoding and RNA genes, assigning functions to the genes and predicting subsystem representation (FIG. 12). RAST has been used to annotate over 40,000 genomes since 2007 with 12,000 registered users. The Stevens group is now extending the metabolic pathway, regulatory network, and signaling pathway databases developed in the SEED project to include eukaryotic proteins and pathways. This work is being done as part of the systems biology knowledge base project. With these extensions one can create estimates of the biochemical and cellular context for a given protein and focus the search on those candidates most likely to be co-occurring in the sample. The Stevens group can now compute a protein family's co-occurrence likelihood table that estimates the probability of observing one protein given the presence of other proteins. This co-occurrence table can be input to the improved search algorithm.

[00233] One can develop hypothesis-generation algorithms to create the queries from mass spectrometry data, as they are accumulated. Concurrently, the Interrogator query-engine can interact with the data in a protein co-occurrence table. The process can be iterative in that as proteins are identified, possible and probable interactions and pathways are established and the peptide inclusion and exclusion lists will grow and shrink as appropriate. Tools to query existing literature and other protein interaction databases in order to further inform peptide selection can be used. One can use simultaneous genomic search during peptide acquisition and identification.

[00234] Some embodiments specifically involve identifying bacteria in a biological sample from a nonbacterial organism. In certain embodiments, the nonbacterial organism is a mammal or human, and the bacteria may be one that is pathogenic. In some cases, the bacteria is one that is not part of the human biota and/or is not a commensal bacteria to humans. In further embodiments, the bacteria is one that is considered pathogenic to the organism being tested or a bacteria whose presence can be identified from a background of the organism's proteome. Proteomes of different bacterial strains and/or species differ enough, and the resulting fragmented peptides differ enough, that once peptides from a given branch of the tree have been identified, one no longer needs to look for those peptides, i.e., they can be masked out, increasing the S/N for the remainder, and if they are nonpathogenic bacteria, you can screen them out from the start. FIGs. 18 and 19 illustrate how the uniqueness of the bacterial peptides allow the bacteria to be identified.

REFERENCES

The following references and any others listed herein, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference in their entirety.

Aebersold & Mann, Nature. 422(6928): 198-207, 2003.

Abba, et al, Mol. Cancer Res. 5(9):881-90, 2007.

Amanchy, et al, J Proteome Res. 4(5):1661— 71 , 2005.

Amanchy, et al, Sci Stke. 2005(267):P12, 2005.

Andersen & Mann, Embo Rep. 7(9):874-9, 2006.

Bailey, et al, Instant Spectral Assignment For Advanced Decision Tree-Driven Mass

Spectrometry. PNAS USA, 2012.

Barnes, et al, Mol Cancer Ther. 2(4):345-51, 2003.

Biemann, Biomed Environ Mass Spectrom. 16(1-12):99-111, 1988.

Bogdan, et al, Bioinformatics. 24(13): 1498-502, 2008.

Bonardi, et al, Appl Radiat hot. 57(5):647-55, 2002.

Bonenfant, et al, PNAS t/&4.100(3):880-5, 2003.

Chan-Tompkins, Crit Care Nurs Q. 34(2):87-100, 201 1.

Conrads, et al, Anal Chem. 72(14):3349-54, 2000.

Cox & Mann, J Am Soc Mass Spectrom. 20(8): 1477-85, 2009.

Cox & Mann, Cell. 130(3):395-8, 2007.

Craig, Bioinformatics. 20(9): 1466-7, 2004.

Deutsch, et al, Physiol Genomics. 33(l): 18-25, 2008.

Duncan, et al, J Proteome Res. 4(5):1842-7, 2005.

Fenyo & Beavis, Analytical Chemistry. 75(4):768-74, 2003.

Figueiredo, Shock. 30(Suppl l):23-9, 2008.

Fluit, et al, Clin. Infect. Dis. 30(3):454-60, 2000.

Fluit, et al, Int. J. Infect. Dis. 3(3): 153-6, 1999.

Fluit, et al, Int. J. Antimicrob. Agents. 18(2): 147-60, 2001.

Frank, et al, J Proteome Res. 6(1):114-23, 2007.

Giamarellou, Int. J. Antimicrob. Agents. 36:S50-4, 2010.

Gorshkov & Zubarev, Rapid Commun Mass Spectrom. 19(24):3755-8, 2005. Gras & Miiller, Computational Aspects Of Protein Identification By Mass Spectrometry. Curr Opin Mol Ther. 2001;3(6):526-32.

Graumann, et al, Molecular & Cellular Proteomics. 11 (3) :M 1 1 1.013185 , 2012.

Gygi, et al, Mol Cell Biol 19(3):1720-30, 1999.

Heller, et al, J Am Soc Mass Spectrom. 14(7):704-18, 2003.

Henry, et al, Nat Biotechnol 28(9):969-74, 2010.

Hoopmann, et al, J Proteome Res. 8(4): 1870-5, 2009.

Horn, et al, PNAS USA. 97(19): 10313-7, 2000.

Hunt, et al, PNAS USA. 83(17):6233-7, 1986.

Kallen, et al, Infect Control Hosp Epidemiol. 31(5):528-31, 2010.

Kallen, et al, Infect Control Hosp Epidemiol. 31(Sl):S51-4, 2010.

Kok, et al, Plos One. 6(8):E23285, 2011.

Kristjansdottir, et al, J Proteome Res. 7(7):2812-24, 2008.

Kumarasamy, et al, The Lancet Infectious Diseases. 10(9):597-602, 2010.

Lin, Biochimica Et Biophysica Acta (Bba) - Proteins & Proteomics. 1646(1 -2): 1 -10, 2003.

Liu, et al, Accurate Mass Measurements In Proteomics. Chem Rev. 2007.

Ludu, et al, J. Bacteriol. 190(13):4584-95, 2008

Mann & Kelleher, Proc Natl Acad Sci Usa. 105(47): 18132-8, 2008

Mann, Nat Rev Mol Cell Biol. 7(12):952-8, 2006.

Mason, et al, Mol Cell Proteomics. 6(2):305-18, 2007.

Mayampurath, et al, Bioinformatics. 24(7): 1021-3, 2008.

Mcgrath & Asmar, Indian JPediatr. 78(2):176-84, 2010.

Mcneil, et al, Nucleic Acids Res. 35(Database):D347-53, 2007.

Meyer, et al, Bmc Bioinformatics. 9(1):386, 2008.

Miyagi & Rao, Mass Spectrom Rev. 26(1): 121-36, 2007.

Nesvizhskii, et al, Analytical Chemistry. 75(17):4646-58, 2003.

Nesvizhskii, et al, Nat Meth. 4(10):787-97, 2007.

Nolting, Isotope Pattern Calculator. Http://Sourceforge.Net/Projects/Isotopatcalc/.

Sourceforge.Net, 2005.

Ong, et al, Mol Cell Proteomics. l(5):376-86, 2002.

Ong, et al, Methods. 29(2): 124-30, 2003.

Ong, et al, Nat Chem Biol. l(5):252-62, 2005.

Ong, et al, Methods Mol Biol. 359:37-52, 2007. Pasa-Tolic, et al, Biotechniques. 37(4):621-4, 6-33, 2004.

Perez, et al, Journal Of Antimicrobial Chemotherapy. 65(8): 1807-18, 2010.

Perkins, et al, Electrophoresis. 20(18):3551-67, 1999.

Pfeifer, et al, International Journal Of Medical Microbiology. 300(6):371-9, 2010. Ramos-Fernandez, et al, Mol Cell Proteomics. 6(7): 1274-86, 2007.

Rappsilber & Mann, Trends Biochem Sci. 27(2):74-8, 2002.

Scoble & Martin, Meth Enzymol. 193:519-36, 1990.

Smith, et al., Proteomics. 2:513-23, 2002.

Stewart, et al, Rapid Commun Mass Spectrom. 15(24):2456-65, 2001.

Strittmatter, et al, J Am Soc Mass Spectrom. 14(9):980-91, 2003.

Takao, et al, Rapid Commun Mass Spectrom. 5(7):312-5, 1991.

Ulintz, et al, Mol Cell Proteomics. 5(3):497-509, 2006.

Volchenboum, et al, Molecular & Cellular Proteomics. 8(8):2011-22, 2009.

Wang, et al, J Proteome Res. 5(5): 1214-23, 2006.

Wang, et al, Rapid Commun Mass Spectrom. 24(12): 1791-8, 2010.

Woodford, et al, Ferns Microbiology Reviews. 35(5):736-55, 201 1.

Yates, et al, Analytical Chemistry. 67(8): 1426-36, 1995.

Previous Patent: OPTIMIZING TRAFFIC FLOWS WHEN USING SERVER VIRTUALIZATION WITH DYNAMIC ROUTING

Next Patent: DATA COMPRESSION AND DECOMPRESSION USING SIMD INSTRUCTIONS