Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR ANALYSIS
Document Type and Number:
WIPO Patent Application WO/2019/175568
Kind Code:
A1
Abstract:
A method of analysing a structure of a composition of matter in a sample comprising: a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first m p /Z p ; b) dividing each spectra into a plurality of m/z bins; c) determining a covariance or a partial covariance between different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin; d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks; e) obtaining a plurality of ion fragmentation patterns for one or more candidate parent ions; f) comparing the true ion correlation peaks to the candidate parent ion fragmentation patterns to determine if the candidate parent ion and the first parent ion are the same.

Inventors:
EDELSON-AVERBUKH MARINA (GB)
AVERBUKH VITALI (GB)
DRIVER TARAN (GB)
AYERS RUTH (GB)
FRASINSKI LESZEK (GB)
KLUG DAVID (GB)
MARANGOS JON (GB)
Application Number:
PCT/GB2019/050690
Publication Date:
September 19, 2019
Filing Date:
March 12, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
IMPERIAL COLLEGE SCI TECH & MEDICINE (GB)
IMPERIAL INNOVATIONS LTD (GB)
IP2IPO INNOVATIONS LTD (GB)
International Classes:
H01J49/00
Domestic Patent References:
WO2018051120A22018-03-22
Foreign References:
US6017693A2000-01-25
Other References:
LESZEK J FRASINSKI: "Covariance mapping techniques", JOURNAL OF PHYSICS B, ATOMIC MOLECULAR AND OPTICAL PHYSICS, INSTITUTE OF PHYSICS PUBLISHING, BRISTOL, GB, vol. 49, no. 15, 5 July 2016 (2016-07-05), pages 152004, XP020307485, ISSN: 0953-4075, [retrieved on 20160705], DOI: 10.1088/0953-4075/49/15/152004
Attorney, Agent or Firm:
CURTIS, Simon Paul (GB)
Download PDF:
Claims:
Claims

1. A method of analysing a structure of a composition of matter in a sample comprising: a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first rrip/Zp ;

b) dividing each spectra into a plurality of m/z bins;

c) determining a covariance or a partial covariance between different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin;

d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks;

e) obtaining a plurality of ion fragmentation patterns for one or more candidate parent ions;

f) comparing the true ion correlation peaks to the candidate parent ion fragmentation patterns to determine if the candidate parent ion and the first parent ion are the same.

2. A method according to claim 1 , wherein the first parent ion is or is derived from a biological sample.

3. A method according to claim 1 or claim 2, wherein the one or more candidate parent ions are selected so as to have m/z within less than 1 Da of the first mp/zp

4. A method according to any of claims 1 to 3 wherein step (c) comprises determining the covariance of different m/z bins according to the formula:

Cov(Y,X ) = (YX) - {Y)(X),

where (... ) represents an average over the plurality of spectra and where X and Y each represent spectrum intensity for each bin.

5. A method according to any of claims 1 to 3 wherein the step (c) comprises determining a control parameter or parameters indicative of synchronised fluctuations in signal intensity across some or all bins, resulting in universal correlation between said bins; and determining the partial covariance pCov(X, Y; I) according to the equation: pCov(X, Y; I) = Cov(X, Y) - Cov(X, l)Cov{H I) Cov{ , Y) where I represents a length k row vector in which each element is one of the k (k ³ 1) fluctuating parameters and the matrix Cov(IT, I)A is the inverse of the k x k dispersion matrix of parameters I, and X and Y each represent the spectrum intensity for each bin

Cov(Y,X) = (YX) - (Y)(X),

where (... ) represents an average over the plurality of spectra.

6. A method according to any preceding claim further comprising two-dimensional mapping the covariance or partial covariance between said different bins of the spectra.

7. A method according to claim 6 in which mapping the partial covariance comprises two- dimensional mapping the correlation of the fluctuation of intensities in the spectra, the correlation being corrected according to the values of the control parameters.

8. A method according to any preceding claim in which determining a statistical significance of each peak or bin comprises computing a statistical significance S(X,Y) according to the equation

S{X, Y) = V[pCov(X, Y; /)] / a(V)

or

S(X, Y) = V[Cov{X, F)] / a(V)

where V is a volume under a covariance or partial covariance peak or a volume of a section of the covariance function Cov(X, Y) or the partial covariance function pCov{X, Y I), and a{V) comprises a measure of the variance of the volume under the peak or the variance of a volume under the section, for example under jackknife resampling.

9. A method according to any of claims 4 to 8 in which determining a statistical significance of each peak or bin comprises computing a statistical significance S(X,Y) according to the equation S(X, Y) = pCov(X, Y; /) / a(pCov(X, Y; /))

Or

S{X, Y) = Cov(X, Y) / a(Cov(X, Y))

where pCov(X, Y; /) or Cov(X, Y) is the value of the partial covariance or covariance respectively between bin X and bin Y or a measure of the combined partial covariance or covariance between bin or bins X and bin or bins Y and a(pCov(X, Y; /)) or a(Cov(X, Y)) comprises a measure of the variance of the value of the partial covariance or covariance between bins X and Y or a measure of the variance of a measure of the combined partial covariance or covariance between bin or bins X and bin or bins Y.

10. A method according to any of claims 4 to 9 in which the control parameters comprise an operating parameter or parameters of the apparatus generating the data sets and/or one or more measures of the experimental conditions under which the plurality of spectra was generated, for example mechanical, electrical, chemical, magnetic, optical and/or thermal conditions.

1 1. A method according to claim 10 wherein the control parameter comprises a measure of any of the following operating parameters: ion current for each spectrum; a total number of ions generated for each spectrum; a total number of ions subjected to analysis for each spectrum, a measure of intensity over one or more parts of the spectrum; a prescan ion current; a relative sample density in a mass analyser; a pressure of gas in an ion trap, ion guide and/or collision cell; a rate of flow of ions into a mass analyser; an intensity and/or pulse duration of ionising radiation; electrospray ionisation capillary voltage; rf and dc voltages applied to an ion trap; ion trap q-value; a voltage applied to one or more of a tube lens, gate lens, focusing lens, ion tunnel or multipole ion guide of the mass spectrometer, a time for which a voltage is applied to one or more of a tube lens, gate lens, focusing lens, ion tunnel or multipole ion guide of the mass spectrometer.

12. A method according to any of claims 4 to 11 in which the control parameter comprises a measure of intensity of at least a selected portion of each of the spectra.

13. A method according to any of claims 4 to 12 in which the control parameters are derived from an integration over at least a portion of each spectrum, for example an integration of the spectrum at one or more m/z values or an integration of the spectrum across all detected m/z values.

14. A method according to any preceding claim wherein step (d) comprises ranking the statistical significance of each spectral correlation relative to the most statistically significant correlation peak.

15. A method according to claim 14 wherein the ranking provides information indicative of the probability of a covariance or partial covariance signal representing a true correlation between fragment ions, a true correlation between fragment ions providing information indicative of the origin of one or more daughter or granddaughter ions or decomposition products of such ions.

16. A method according to any preceding claim wherein step (f) comprises determining a similarity score between the first parent ion and the one or more candidate ions.

17. A method according to claim 16 wherein step (f) comprises determining a similarity score between the first parent ion and a plurality of candidate ions and wherein one of the candidate parent ions is determined to be the most likely identity of the first parent ion on the basis of the similarity score.

18. A method according to claim 16 or claim 17 wherein the calculation of the similarity score comprises classifying at least some of the true ion correlation peaks within one or more ion correlation classifications selected from a list comprising:

(i) complementary ion correlations;

(ii) correlations involving internal ions;

(iii) correlations between non-complementary terminal ions;

(iv) correlations involving ions formed by neutral losses.

19. A method according to any preceding claim wherein the candidate ion fragmentation pattern of step (f) comprises an in silico simulation of one or more covariance or partial covariance peaks to provide one or more true candidate ion correlation peaks.

20. A method according to claim 19 wherein the calculation of the or a similarity score comprises classifying at least some of the true candidate ion correlation peaks within one or more ion correlation classifications selected from a list comprising:

(i) complementary ion correlations;

(ii) correlations involving internal ions;

(iii) correlations between non-complementary terminal ions;

(iv) correlations involving ions formed by neutral losses.

21. A method according to claim 20 wherein the calculation of the similarity score comprises initialising ( e.g . at zero) a similarity score for each of the ion correlation classifications; for at least one of the ion correlation classifications of true candidate ion correlation peaks; identifying true ion correlation peaks of m/z within ±X Da of a candidate ion correlation peak within that classification, such peaks representing a correlation match; for each correlation match, incrementing the similarity score for that ion classification.

22. A method according to claim 21 wherein the similarity score is incremented by a set value for each correlation match.

23. A method according to claim 21 wherein the similarity score is incremented by a weighted value for each correlation match.

24. A method according to claim 23 wherein the weighted value is a function of the statistical significance of the true ion correlation peak, the height of the true ion correlation peak and/or the volume of the true ion correlation peak.

25. A method according to any of claims 20 to 24 comprising repeating the method of claims 20 to 24 for a plurality of (preferably all) true candidate ion correlation peaks within a plurality of (preferably all) candidate ion correlation classifications.

26. A method according to claim 25 comprising calculating a parent ion similarity score by calculating a weighted sum of the similarity scores for each ion classification, whereby each of the ion classifications is ascribed an individual weight in the sum.

27. A method according to claim 26 wherein the ion classifications carrying the greatest weight in the parent ion similarity score are internal ion correlations.

28. A method according to any of claims 26 or 27 wherein the method is repeated for a plurality of candidate parent ions and the candidate parent ion having the highest parent ion similarity score is determined to comprise the most likely structure of the first parent ion.

29. A method according to any preceding claim wherein the candidate parent ion is or is derived directly or indirectly from a molecule selected from a database of molecular (e.g. biomolecular) structures.

30. A method according to claim 29 wherein the candidate parent ion is derived from an in silico digest of the selected molecule.

31. A method according to claim 29 or 30 wherein the database is derived from the application of one or more de novo sequencing algorithms to mass spectrometry data relating to the sample.

32. A method according to any of claims 29 to 31 wherein the database of biomolecular structures contains structural information relating to one or more of proteins, peptides, DNA sequences, RNA sequences, lipids or metabolites.

33. A method according to any of claims 29 to 32 wherein the candidate parent ion is selected from the database or the in silico digest by identifying ions having an m/z within 10 ppm of the first parent ion.

34. A method according to any preceding claim wherein the first parent ion is a peptide ion derived from a first peptide.

35. A method according to claim 34 wherein the first peptide is obtained by in vitro digestion of a protein, for example by means of a tryptic digest, prior to analysis by mass spectrometry.

36. A method according to any of claims 1 to 33 wherein the first parent ion is a DNA or RNA oligomer ion.

37. A method according to claim 36 wherein the DNA or RNA oligomer ion is derived from a DNA or RNA oligomer obtained from a larger DNA or RNA molecule, for example by digestion of the larger molecule.

38. A method according to any of claims 1 to 33 wherein the first parent ion is a lipid ion.

39. A method according to claim 6 or claim 7 wherein the mapping comprises plotting a map of trix/zx against m>/zY against the value of the covariance or partial covariance or statistical significance.

40. A method according to claim 39 wherein all correlations located on the map on at least one mass conservation line of equation zx-mJzx + zy-mY/zY = m, are identified as fragment-complementary ions, where m, is less than or equal to mp.

41. A method according to claim 40 wherein one or more mass conservation lines ( e.g . where /z, is less than mp/Zp) are identified using a Hough transform technique.

42. A method according to claim 40 or claim 41 wherein where m, is equal to mp, and correlations falling on mass conservation lines where where zx-mjzx + zy-m zY = rrij and rrij is less than mp are identified as complementary ion pairs where one or both of the detected correlated ions has undergone a neutral or charged loss.

43. A method according to any of claims 40 to 42 wherein where the correlations are located on a line of equation za-mjza + Zb-mt/Zb + mi = rrij, mi is identified as the total mass of the neutral or charge loss fragment(s) from one or both of the complementary ion pair.

44. A method according to any of claims 39 to 43 comprising identifying for at least one correlation of mJzi and m2/z2 at least one correlation of a third mass to charge ratio m/zs and mJZi, up to an m/z tolerance of Y Da.

45. A method according to claim 44, wherein the difference between zj-mj/z3 and z- m- zz is determined to be indicative of a loss from the larger of m3 and m2; and/or wherein where the charge states z2 and z3 are measured or assumed to be equal, the corresponding loss is identified as a neutral loss.

46. A method according to claim 45 comprising identifying the structure of the neutral loss from the larger of m2 and m3 by the magnitude of the difference between m3 and m2, the difference being mn and being indicative of expected neutral loss structures for a class of molecule under analysis.

47. A method according to claim 45 wherein the expected neutral loss structures comprise one or more structures selected from the group:

Ammonia, water, CO, HPO3, H3PO4, cytosine, uracil, thymine, adenine, guanine.

48. A method according to claim 46 or 47 comprising identifying the charge state of the complementary ions by determining the ratio of the mass of the expected neutral loss structure to \m2/z2 - IV3/Z3 \.

49. A method according to any of claims 40 to 48 comprising identifying the charge state of a pair of correlated complementary ions having mass to charge ratios ( m/z)x and (m/z)y by a method comprising determining the gradient G of a mass conservation line on which the complementary ions are correlated, the gradient being related to the charge state according to the formula G = -zjzy.

50. A method according to claim 49 wherein at least one of zx and zy are determined by deriving the ratio of zxto zy by:

identifying a mass conservation line upon which the two m/z values (m/z)x and (m/z)y are correlated the method of claim 41 ;

finding the gradient G of the mass conservation line;

finding the ratio of zxto zy using the formula G = -zx/zy;

using the ratio of zx to zy to determine absolute values of zx and zy by finding values that satisfy zx + zy £ zp where zp is the experimentally measured or estimated parent ion charge zp. .

51. A method according to claim 49 wherein at least one of zx and zy are determined by deriving the ratio of zxto zy by:

identifying a mass conservation line upon which the two m/z values ( m/z)x and (m/z)y are correlated using initial knowledge of the parent ion mass mp and charge state zp; finding the gradient G of the mass conservation line;

finding the ratio of zx and zy using the formula G = -zx/zy;

using the ratio of zx to zy to determine absolute values of zx and zy by finding values that satisfy zx + zy £ zp where zp is the experimentally measured or estimated parent ion charge zp.

52. A method according to any of claims 40 to 51 comprising identifying complementary ions derived from different parent ions of substantially identical mass to charge ratio (e.g. structural isomers).

53. A method according to claim 52 wherein the identification of complementary ions derived from different parent ions comprises

identifying the smallest likely mass difference n?m/n between two fragment complementary ion masses for a class of biomolecule under analysis,

identifying where 3 or more correlation peaks are present on the mass conservation line within mm;n.

54. A method according to claim 53 wherein the biomolecule under analysis is a protein or peptide and the smallest likely fragment ion is a glycine residue.

55. A method according to any preceding claim wherein the plurality of mass spectra are each derived from a plurality of simultaneously analysed parent ions of different m/z.

56. A method according to claim 55 wherein correlations located on the map on a mass conservation line of equation zx-mx/zx + zy-m,/z, = mp are identified as complementary ions and wherein a different mass conservation line is identified for each parent ion of different m/z.

57. A method of analysing a structure of a composition of matter in a sample comprising: a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first m Zpi b) dividing each spectra into a plurality of m/z bins;

c) determining a covariance or a partial covariance of different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin;

d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks;

e) two-dimensionally mapping the covariance or partial covariance between said different bins of the spectra;

f) identifying as fragment complementary ions all correlations located on the map on at least one mass conservation line of equation zx-mx/zx + zy-my/zy= zi-m/zi, where m Zί is less than or equal to rrip/Zp.

58. A method according to claim 57 wherein one or more mass conservation lines ( e.g . where m/z, is less than m Zp are identified using a Hough transform technique.

59. A method according to claim 57 or claim 58 wherein where m, is equal to mp, and correlations falling on mass conservation lines where where zx-mjzx + zy-m zY = rnj and rnj is less than mp are identified as complementary ion pairs where one or both of the detected correlated ions has undergone a neutral or charged loss.

60. A method according to any of claims 57 to 58 wherein where the correlations are located on a line of equation za-mjz8 + Zb-mt/Zb + mi = ,, mi is identified as the total mass of the neutral or charge loss fragment(s) from one or both of the complementary ion pair.

61. A method according to any of claims 57 to 60 comprising identifying for at least one correlation of m?/z? and m2/z2 at least one correlation of a third mass to charge ratio m/zs and m/z?, up to an m/z tolerance of Y Da.

62. A method according to claim 61 , wherein the difference between Z -mJzs and zrm2/z2 is determined to be indicative of a loss from the larger of m3 and m2 and/or wherein where the charge states z2 and z3 are measured or assumed to be equal, the corresponding loss is identified as a neutral loss.

63. A method according to claim 62 comprising identifying the structure of the neutral loss from the larger of m2 and m3 by the magnitude of the difference between m3 and m2; the difference being m„ and being indicative of expected neutral loss structures for a class of molecule under analysis.

64. A method according to claim 63 wherein the expected neutral loss structures comprise one or more structures selected from the group:

Ammonia, water, CO, HPO3, H3PO4, cytosine, uracil, thymine, adenine, guanine.

65. A method according to claim 63 or 64 comprising identifying the charge state of the complementary ions by determining the ratio of the mass of the expected neutral loss structure to \m2/z2 - /713/Z3I.

66. A method according to any of claims 57 to 65 comprising identifying the charge state of a pair of correlated complementary ions having mass to charge ratios ( m/z)x and (m/z)y by a method comprising determining the gradient G of a mass conservation line on which the complementary ions are correlated, the gradient being related to the charge state according to the formula G = -zx/z .

67. A method according to claim 66 wherein at least one of zx and zy are determined by deriving the ratio of zxto zy by:

identifying a mass conservation line upon which the two m/z values ( m/z)x and (m/z)y are correlated the method of claim 41 ;

finding the gradient G of the mass conservation line;

finding the ratio of zxto zy using the formula G = -zx/zy;

using the ratio of zxto zy and the relative rates of change of (m/z)x and (m/z)y along the mass conservation line to determine the absolute values of zx and zy .

68. A method according to claim 66 wherein at least one of zx and zy are determined by deriving the ratio of zxto zy by:

identifying a mass conservation line upon which the two m/z values ( m/z)x and ( m/z)y are correlated using initial knowledge of the parent ion mass mp and charge state zp; finding the gradient G of the mass conservation line;

finding the ratio of zx and zy using the formula G = -zjzy,

determining the absolute values of zx and zy by solving the equation zx + zy = zp.

69. A method according to any of claims 57 to 68 comprising identifying complementary ions derived from different parent ions of substantially identical mass to charge ratio (e.g. structural isomers).

70. A method according to claim 69 wherein the identification of complementary ions derived from different parent ions comprises

identifying the smallest likely mass difference rnmm between two fragment complementary ion masses for a class of biomolecule under analysis,

identifying where 3 or more correlation peaks are present on the mass conservation line within mm/n.

71. A method according to claim 70 wherein the biomolecule under analysis is a protein or peptide and the smallest likely fragment ion is a glycine residue.

72. A method according to any of claims 57 to 71 preceding claim wherein the plurality of mass spectra are each derived from a plurality of simultaneously analysed parent ions of different m/z

73. A method according to claim 72 wherein correlations located on the map on a mass conservation line of equation zx-mx/zx + zy-my/zy = mp are identified as complementary ions and wherein a different mass conservation line is identified for each parent ion of different m/z.

74. A method according to any preceding claim comprising determining the volume of a first correlation peak for a first pair of detected ions and deriving from the volume of the first correlation peak information relating to the concentration of the first pair of ions or any parent ions thereof in the sample.

75. A method according to claim 74 wherein the relative concentration of the first pair of ions or parent ions thereof is determined by comparing the volume of the first correlation peak to the volume of one or more other correlation peaks.

76. A method according to claim 74 or claim 75 wherein the absolute concentration of the of the first pair of ions or parent ions thereof is determined by comparing the volume of the first correlation peak to the volume of a peak derived from one or more standards internal standards) of known concentration.

77. A method of sequencing a biomolecule comprising performing a method according to any preceding claim.

78. A method according to claim 77 wherein the biomolecule is selected from a protein, peptide, nucleotide, DNA, RNA, lipid, metabolite. 79. A method according to claim 77 or 78 wherein the biomolecule is subject to enzymatic digestion before performing the method of claims 1 to 76.

80. Computer software configured to perform a method according to any preceding claim. 81. A hardware module configured to perform the method of any of claims 1 to 76.

82. A mass spectrometry system comprising a hardware module according to claim 81.

Description:
Methods and Systems for Analysis

The present invention relates to methods of analysing chemical and/or biological samples to determine the structure of one or more of the component parts of the sample. The present invention also relates to systems and apparatus for performing such methods.

Laboratory analytical techniques and equipment have advanced to such a degree of power and sensitivity that they are able to produce large quantities of data when analysing even relatively simple samples. The challenge for the analyst is then in reviewing this data to derive useful information about the sample. This is particularly the case when the sample under analysis is a complex mixture of materials and/or represents a sample of biological origin, such as one or more proteins or peptides, nucleic acids, lipids or metabolites.

Mass spectrometry (MS) is a popular and effective method for analysing the structure of biomolecules such as proteins, nucleic acids, lipids or metabolites. The major applications of biomolecular MS in clinical biology are in protein studies (proteomics). The primary aim of proteomic MS analysis is to establish the sequence of the biomolecular building blocks, (i.e. amino acids), and the covalent post-translational modifications (PTMs) of particular proteins. To do so, the biomolecules are typically first cut into smaller fragments, e.g. acting on proteins with enzymes to obtain peptides and the peptides are sent to a mass spectrometer via soft ionisation techniques [such as electrospray ionisation (ESI) or matrix assisted laser desorption-ionisation (MALDI)], where in tandem mass spectrometry (MS/MS) they are fragmented even further (often by collision induced dissociation, CID) to obtain more detailed information on the peptide structure, and finally the spectral information is pieced together to deduce the full structure of the original biomolecule.

The crucial step of the protein MS workflow is the deduction of the protein amino acid sequence and its possible PTMs from the tandem mass spectra. This task can be accomplished using a range of algorithms that rely either on matching the measured spectra to“theoretical” (expected) ones obtained from a combination of protein sequence databases and a set of standard generalised peptide fragmentation rules, matching the acquired MS/MS spectra to spectral libraries, or on performing a first-principles structural reconstruction using the measured spectrum and the fragmentation rules only (so-called de novo algorithms). Whichever method is chosen for the data interpretation, normally only up to 60% of the measured fragment mass spectra are successfully interpreted and matched to the correct peptide and protein sequences. The high rate of spectrum-to-structure assignment failure is a result of the strong variability of the peptide fragmentation patterns, which are affected by the particular amino acid sequence, presence of PTMs, peptide length etc., frequently leading to highly complicated fragmentation patterns and considerable deviation of experimental mass spectra from those predicted by the simplified peptide fragmentation rules. In addition, many of the key canonical peptide fragments (e.g. b- and y-type ions for CID) which appear at very low relative abundances are often missed during the mass spectra matching procedures causing false assignments even of mass spectra displaying standard fragmentation patterns. Finally, the standard MS analysis is compromised by false identifications or unresolvable ambiguities for the peptide structure, caused by fragments matching isobaric (within the applied mass tolerance) and isomeric ions of incorrect structures, even when measured with an arbitrarily high mass accuracy. This limits significantly the analytical capability of the state of the art MS. These issues are amplified in the analysis of, for example, nucleic acids, which - in their unmodified form - consist of combinations of four fundamental bases (A, C, G and T for DNA and A, C, G and U for RNA) linked by symmetric phosphodiester bonds. Here the limited nature of the constituent parts increases the chance of generating isobaric fragments.

To alleviate this ambiguity, mass resolution has been increased far beyond integer mass-to- charge (m/z) ratio to identify atoms of MS fragments from their accurate masses, e.g. to tell apart two nitrogen atoms (28.007 Da) from one carbon and one oxygen atom (27.995 Da). Although this high resolution MS reduces considerably the number of prospective hits (e.g. peptide sequence options in the case of proteomics) generated by matching the experimental data to databases or deriving sequences directly from the mass spectra (such as in de novo sequencing), it does not provide by itself any experimental evidence for the origin of the biomolecular fragments that is derived from the observed mass-to-charge ratios. This leads to multiple false positive/negative results in the identification of fragments characterised by highly accurate m/z, limiting significantly the capability and reliability of biomolecular structural analysis using MS.

Fragment mass spectra of biomolecules commonly display signals of unusual origin caused by the strong dependence of the fragmentation patterns on amino acid sequence, peptide length, charge state, modifying groups and other factors. A significant proportion of these fragment ions miss identification or do not undergo the correct interpretation, frequently causing the spectrum-to-structure matching failure. Furthermore, low relative peak intensities (“relative abundances”) and poor signal-to-noise ratios of the standard fragments of well- known origin are also very common, which leads to them being missed by the existing MS interpretation algorithms. Indeed, any interpretation of a mass spectrum must employ some form of threshold on the relative abundance of spectral peaks, below which those peaks within a certain mass-to-charge (m/z) range are not taken into consideration. As a result of the finite signal-to-noise ratio, structure determining algorithms work using a limited number of spectral signals (‘good peaks’), which is currently limited to being defined purely on the grounds of relative intensities. If low-intensity MS peaks bearing crucial structural information are not taken into consideration, the algorithm will produce low-confidence structural assignment that eventually leads to lack of successful spectral interpretation and correct sample component analysis. Increasing the mass accuracy and mass resolution of the MS instruments can only partially solve this problem by resolving overlapping signals and reducing the number of candidate fragment ions for relatively intense“unknown” fragments. So, to an extent, can alternative fragmentation techniques by causing different fragmentation patterns. However, a general solution for the poor interpretation efficiency of tandem mass spectra, independent of the fragment origin complexity or fragment ion relative abundance, is clearly lacking.

It is therefore desired to provide means for analysing data such as that produced by a mass spectrometer to reduce wasted data sets and/or to more accurately and reliably determine the structure of compounds under analysis.

It is also desired that the means for analysing data produced by a mass spectrometer can be deployed to sequence biomolecules such as proteins, nucleic acids, lipids and metabolites and their constituent parts.

In a first aspect of the invention, there is provided

A method of analysing a structure of a composition of matter in a sample comprising;

a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first mp/Zp,

b) dividing each spectra into a plurality of m/z bins;

c) determining a covariance or a partial covariance between different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin;

d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks;

e) obtaining a plurality of ion fragmentation patterns for one or more candidate parent ions; f) comparing the true ion correlation peaks to the candidate parent ion fragmentation patterns to determine if the candidate parent ion and the first parent ion are the same.

The term“candidate parent ion” refers to experimentally or in silico derived ions of known m/z and a known or derivable chemical structure. Such ions may, for example be derived from a structure of a protein or peptide listed in a database or otherwise in literature.

Preferably, the first parent ion is or is derived from a biological sample.

In some embodiments, the one or more candidate parent ions may be selected so as to have m/z within a small m/z tolerance (e.g. less than 1 Da) of the first m p /z p .

In some embodiments, step (c) comprises determining the covariance of different m/z bins according to the formula:

Cov(Y, X) = (YX) - (Y){X),

where (... ) represents an average over the plurality of spectra and where X and Y each represent spectral intensity for each bin.

Preferably, step (c) comprises determining a control parameter or parameters indicative of synchronised fluctuations in signal intensity across some or all bins, resulting in universal correlation between said bins; and determining the partial covariance pCov(X, Y; I) according to the equation: pCov(X, Y; I) = Cov(X, Y) - Cov(X, I)Cov(H, I) 1 Cov(H Y) where / represents a length k row vector in which each element is one of the k (k ³ 1) fluctuating parameters and the matrix Cov(I T , /) 1 is the inverse of the k x k dispersion matrix of parameters /, and X and Y each represent the spectrum intensity for each bin, where <... ) represents an average over the plurality of spectra.

Preferably, the method comprises the two-dimensional mapping of the covariance or partial covariance between said different bins of the spectra. This mapping may include the preparation of a plot or graphical representation of the covariance or partial covariance or, for example, an equivalent numerical representation of that data such that it may be processed and/or interpreted by a user, e.g. with a computer.

Preferably, the mapping of the partial covariance comprises two-dimensional mapping of the correlation between the fluctuations of intensities in the spectra, the correlation being corrected according to the values of the control parameters.

In some embodiments, the method includes the determination of a statistical significance of each peak or bin and comprises computing a statistical significance S(X, Y) according to the equation

S(X, Y) = V[pCov(X, Y; /)] / a{V)

or

S(X, Y) = V[Cov(X, F)] / s(V)

where V is a volume under a covariance or partial covariance peak or a volume of a section of the covariance function Cov(X, Y) or the partial covariance function pCov(X, Y; /), and a(V) comprises a measure of the variance of the volume under the peak or the variance of a volume under the section, for example under jackknife resampling.

In some embodiments, the method includes the determination of a statistical significance of each peak or bin and comprises computing a statistical significance S(X, Y) according to the equation

S(X, Y) = pCov(X, Y; I) / a{pCov{X , F; /))

Or

S(X, Y) = Cov(X, F) / a(Cov(X, F))

where pCov(X, Y; /) or Cov(X, Y) is the value of the partial covariance or covariance respectively between bin X and bin Y or a measure of the combined partial covariance or covariance between bin or bins X and bin or bins Y and a(pCov(X, F; /)) or a(Cov(X, Y)) comprises a measure of the variance of the value of the partial covariance or covariance between bins X and Y or a measure of the variance of a measure of the combined partial covariance or covariance between bin or bins X and bin or bins Y, for example under jackknife resampling.

Preferably, the control parameters comprise an operating parameter or parameters of the apparatus generating the data sets and/or one or more measures of the experimental conditions under which the plurality of spectra was generated, for example mechanical, electrical, chemical, magnetic, optical and/or thermal conditions.

The control parameter may comprise a measure of any of the following operating parameters: ion current for each spectrum; a total number of ions generated for each spectrum; a total number of ions subjected to analysis for each spectrum, a measure of intensity over one or more parts of the spectrum; a prescan ion current; a relative sample density in a mass analyser; a pressure of gas in an ion trap, ion guide and/or collision cell; a rate of flow of ions into a mass analyser; an intensity and/or pulse duration of ionising radiation; electrospray ionisation capillary voltage; rf and dc voltages applied to an ion trap; ion trap q-value; a voltage applied to one or more of a tube lens, gate lens, focusing lens, ion tunnel or multipole ion guide of the mass spectrometer, a time for which a voltage is applied to one or more of a tube lens, gate lens, focusing lens, ion tunnel or multipole ion guide of the mass spectrometer.

Preferably, the control parameter comprises a measure of intensity of at least a selected portion of each of the spectra.

In some embodiments, the control parameters are derived from an integration over at least a portion of each spectrum, for example an integration of the spectrum at one or more m/z values or an integration of the spectrum across all detected m/z values.

In some embodiments, following the step (c) the covariance or partial covariance between bins Y and X corresponding to m/z values which are separated by less than a predefined m/z value (e.g. 5 Da) may be neglected (e.g. set to zero), because this value represents the structurally uninformative autocorrelation of a spectral signal with itself.

Step (d) may comprise ranking the statistical significance of each spectral correlation relative to the most statistically significant correlation peak. Preferably, the ranking provides information indicative of the probability of a covariance or partial covariance signal representing a true correlation between fragment ions, a true correlation between fragment ions providing information indicative of the origin of one or more daughter or granddaughter ions or decomposition products of such ions.

In preferred embodiments, step (f) comprises determining a similarity score between the first parent ion and the one or more candidate ions. Step (f) may also comprise determining a similarity score between the first parent ion and a plurality of candidate ions and wherein one of the candidate parent ions is determined to be the most likely identity of the first parent ion on the basis of the similarity score.

The calculation of the similarity score may comprise classifying at least some of the true ion correlation peaks within one or more ion classifications selected from a list comprising:

(i) complementary ion correlations;

(ii) correlations involving internal ions

(iii) correlations involving non-complementary terminal ions;

(iv) correlations involving ions formed by neutral losses.

In some embodiments, the candidate ion fragmentation pattern of step (f) comprises an in silico simulation of a correlation represented by one or more covariance or partial covariance peaks to provide one or more true candidate ion correlation peaks. Preferably, the calculation of the or a similarity score comprises classifying at least some of the true candidate ion correlation peaks within one or more ion classifications selected from a list comprising:

(i) complementary ion correlations;

(ii) correlations involving internal ions

(iii) correlations involving non-complementary terminal ions;

(iv) correlations involving ions formed by neutral losses.

In some embodiments, the calculation of the similarity score comprises initialising ( e.g . at zero) a similarity score for each of the ion classifications; for at least one of the ion classifications of true candidate ion correlation peaks; identifying true ion correlation peaks of m/z within ±X Da of a candidate ion correlation peak within that classification, such peaks representing a correlation match; for each correlation match, incrementing the similarity score for that ion classification. X is preferably less than 3 Da and may be less than 2 Da or 1 Da, for example around 0.8 Da.

The similarity score may be incremented by a set value for each correlation match. Alternatively, the similarity score may be incremented by a weighted value for each correlation match. The weighted value may be a function of the statistical significance of the true ion correlation peak, the height of the true ion correlation peak and/or the volume of the true ion correlation peak. Preferably, the method includes repeating the calculation of the similarity score for a plurality of (preferably all) true candidate ion correlation peaks within a plurality of (preferably all) candidate ion correlation classifications.

Preferably, the method comprises calculating a parent ion similarity score by calculating a weighted sum of the similarity scores for each ion classification, whereby each of the ion classifications is ascribed an individual weight in the sum. It is preferred that the ion classifications carrying the greatest weight in the parent ion similarity score are internal ion correlations. Preferably, the method is repeated for a plurality of candidate parent ions and the candidate parent ion having the highest parent ion similarity score is determined to comprise the most likely structure of the first parent ion.

The candidate parent ion may be derived directly or indirectly from a molecule selected from a database of molecular ( e.g . biomolecular) structures. The candidate parent ion may be derived from an in silico digest of the selected molecule. The database may be derived from the application of one or more de novo sequencing algorithms to mass spectrometry data relating to the sample. Such algorithms may include those described in e.g. Ma, B. et al. . Rapid Commun Mass Spectrom. Rapid Communications in Mass Spectrometry. 17(20):2337- 42. 2003; He, L. et al. Journal of Bioinformatics and Computational Biology. 8(06):981-994. 1/12/2012 or Johnson, R.S. and J.A. Taylor, (2002) Mol Biotechnol, 22(3): p. 301-15.

The database of biomolecular structures preferably contains structural information relating to one or more of proteins, peptides, DNA sequences, RNA sequences, lipids or metabolites.

Preferably, the candidate parent ion is selected from the database or from the in silico digest by identifying ions having an m/z within 10ppm of the first parent ion.

In some embodiments, the first parent ion is a peptide ion derived from a first peptide. It may be obtained by in vitro digestion of a protein, for example by means of a tryptic digest, prior to analysis by mass spectrometry.

In other embodiments, the first parent ion is a DNA or RNA oligomer ion. The DNA or RNA oligomer ion may be derived from a DNA or RNA oligomer obtained from a larger DNA or RNA molecule, for example by digestion of the larger molecule. In a further embodiment, the first parent ion is a lipid ion or a metabolite ion.

In preferred embodiments, the mapping comprises plotting a map of m x /z x against m/z y against intensity. All correlations located on the map on at least one mass conservation line of equation z x -m z x + z y -m/z y =z.-m/z, may be identified as fragment-complementary ions, where z,-m/z, is less than or equal to z r ·p\r/z r .

Fragment-complementary ions are understood to refer to fragment ions which derive directly from the same parent, whether or not this parent constitutes the whole peptide molecular ion or its daughter, granddaughter, great granddaughter etc.

In a further aspect, the invention provides a method of analysing a structure of a composition of matter in a sample comprising:

a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first rri p /Z p ;

b) dividing each spectra into a plurality of m/z bins;

c) determining a covariance or a partial covariance of different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin;

d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks;

e) two-dimensionally mapping the covariance or partial covariance between said different bins of the spectra;

identifying as fragment-complementary ions all correlations located on the map on at least one mass conservation line of equation z x -nVz x + z y -m y /z y =zrm/zj, where z,-mjz, is less than or equal to z p -rr\p/z p .

While is may be desirable to know or derive the charge states of the ions in the correlations, it is not necessary in order to determine the position of the mass conservation line(s).

One or more mass conservation lines (e.g. where zrm/z, is less than or equal to z p -mp/z p ) may be identified using a Hough transform function.

Where m, is equal to m p , correlations falling below the mass conservation line (e.g. on mass conservation lines where z x -m x /z x + z y -m y /z y = m j and m j is less than m p ) may be identified as complementary ion pairs where one or both of the detected correlated ions has undergone a neutral or charged loss. Where the correlations are located on a line of equation z a -mjz a + Z b -m t /Z b + mi = mj, mi may be identified as the total mass of the neutral or charge loss fragment(s) from one or both of the complementary ion pair.

In some embodiments the method may comprise identifying for each correlation of mJz·, and mJZ 2 at least one correlation of a third mass to charge ratio m/zs and mJZi, up to an m/z tolerance of Y Da. Y may be predetermined, e.g. according to the type of molecule under analysis. For example, Y may be between 1 Da and 200 Da.

The difference between za-m/za and z mJZ is determined to be indicative of a loss from the larger of m 3 and m/, and/or wherein where the charge states Z2 and Z3 are measured or assumed to be equal, the corresponding loss is identified as a neutral loss. The structure of the neutral loss may be identified from the larger of mJZz and /773/Z3 by the magnitude the difference between ^mJ- zs- and zrmz/ i, the difference being m n and being indicative of expected neutral loss structures for a class of molecule under analysis.

The expected neutral loss structures may comprise one or more structures selected from the group:

ammonia, water, CO, HPO3, H3PO4, cytosine, uracil, thymine, adenine, guanine. Such structures may in some embodiments be identified by their accurate mass. Exemplary accurate masses (in Da) as neutral losses include:

NH 3 : 17.026549

H 2 0: 18.010565

CO: 27.994915

HP0 3 : 79.966333

H 3 P0 4 : 97.976898

Adenine: 135.054495

Cytosine: 111.043262

Guanine: 151.049410

Thymine: 126.042927

Uracil: 112.027277

The method may also include identifying the charge state of the complementary ions by determining the ratio of the mass of the expected neutral loss structure to |/P3/z 3 - /TJ2/Z2I. The method may include identifying the charge state of a pair of correlated complementary ions having mass to charge ratios ( m/z) x and (m/z) y by a method comprising determining the gradient G of a mass conservation line on which the complementary ions are correlated, the gradient being related to the charge state according to the formula G = -z x /z y .

In some embodiments, at least one of z x and z y are determined by deriving the ratio of z x to z y by:

identifying a mass conservation line upon which the two m/z values ( m/z) x and (m/z) y are correlated using a Hough transform technique;

finding the gradient G of the mass conservation line;

finding the ratio of z x to z y using the formula G = -z x /z y ;

using the ratio of z x to z y to determine absolute values of z x and z y by finding values that satisfy z x + z y £ z p where z p is the experimentally measured or estimated parent ion charge z p .

In some embodiments, at least one of z x and z y are determined by deriving the ratio of z x to z y by:

identifying a mass conservation line upon which the two m/z values ( m/z) x and (m/z) y are correlated using initial knowledge of the parent ion mass m p and charge state z p ;

finding the gradient G of the mass conservation line;

finding the ratio of z x and z y using the formula G = -z x /z y ;

using the ratio of z x to z y to determine absolute values of z x and z y by finding values that satisfy z x + z y < z p where z p is the experimentally measured or estimated parent ion charge z p .

In some embodiments, the method comprises identifying complementary ions derived from different parent ions of substantially identical mass to charge ratio ( e.g . structural isomers). Preferably, the identification of complementary ions derived from different parent ions comprises:

identifying the smallest likely mass difference rn/z m in between two complementary ion masses for a class of biomolecule under analysis,

identifying where 3 or more correlation peaks are present on the mass conservation line within the range of m/Z min ·

Preferably, rn/z m is the molecular mass of the smallest likely sequence fragment divided by the charge state of the 3 or more correlation peaks, preferably where the charge state is determined according to the method described above. Where a biomolecule under analysis is a protein or peptide, the smallest likely sequence fragment ion may be a glycine amino acid.

In some embodiments, the plurality of mass spectra are derived from a plurality of parent ions of different m/z. Correlations located on the map on a mass conservation line of equation Z x -m/Zx + z y -m y /z y = Zp-rrt p /Zp may be identified as complementary ions and wherein a different mass conservation line is identified for each parent ion of different mp/z p . In some embodiments, such as those where a Hough transform is utilised to identify one or more mass conservation lines, the parent ions may be ions derived from two or more complete biomolecules (e.g. proteins). Such methods allow for the deconvolution of complex spectra obtained by top down mass spectrometry.

In some embodiments, the method comprises determining the volume of a first correlation peak for a first pair of detected ions and deriving from the volume of the first correlation peak information relating to the concentration of the first pair of ions or any parent ions thereof in the sample.

In another aspect, the invention provides a method of analysing a structure of a composition of matter in a sample comprising:

a) obtaining a plurality of tandem mass spectra derived from a first parent ion of a first mp/Zp,

b) dividing each spectra into a plurality of m/z bins;

c) determining a covariance or a partial covariance of different bins across the plurality of spectra and correlating the fluctuations of measured intensities in each bin;

d) determining a statistical significance of each correlation to identify one or more true ion correlation peaks;

e) two-dimensionally mapping the covariance or partial covariance between said different bins of the spectra;

f) determining the volume of a first correlation peak for a first pair of detected ions and deriving from the volume of the first correlation peak information relating to the concentration of the first pair of ions or any parent ions thereof in the sample.

The relative concentration of the first pair of ions or parent ions thereof may be determined by comparing the volume of the first correlation peak to the volume of one or more other correlation peaks. The absolute concentration of the of the first pair of ions or parent ions thereof may be determined by comparing the volume of the first correlation peak to the volume of a peak derived from one or more standards ( e. g . internal standards) of known concentration.

In a further aspect, the invention provides a method of sequencing a biomolecule comprising performing a method according to any preceding claim. The biomolecule is preferably selected from a protein, peptide, nucleotide, DNA, RNA, lipid or metabolite. The biomolecule is preferably subject to enzymatic digestion before performing the method.

In a further aspect, the invention provides computer software configured to perform the method described above. The software may be loaded onto a storage medium.

In a further aspect, the invention provides a hardware module, such as a microprocessor, graphics processing unit, reconfigurable computing unit or application-specific integrated circuitconfigured to perform the method described above. In a further aspect, the invention provides a mass spectrometry system comprising the hardware module.

Embodiments of the present invention will now be described with reference to the following drawings:

Figure 1 (a) shows a CID spectrum of [VTIMPK(Ac)DIQLAR+3H , main fragments are annotated;

Figure 1 (b) shows a region in the simple 2D covariance map of the same peptide showing both true (intrinsic, shaded region 12) and false (extrinsic, shaded region 10) correlations; Figure 1 (c) shows the same region as Figure 1 (b) but of the 2D partial covariance map, revealing full suppression of the false (extrinsic, shaded region 14) correlations and survival of all the true (intrinsic, shaded region 12) correlations;

Figure 1 (d) shows a 3D view of the m/z 135 - m/z 610 region of the partial covariance map of [VTIMPKDIQLAR+3H] 3+ in which the overwhelming majority of the peptide fragment ion correlations are observed. In (b-d) the autocorrelation line signals have been removed for clarity);

Figures 2 to 11 show (a) partial covariance maps of fragmentation of various peptide ions and (b) scatter plots showing the relative abundance and relative significance calculated according to embodiments of the present invention for those fragment ions of those peptide ions; Figure 12 (a) shows a CID spectrum of the [EQFDDsYGHMRF(NH2)+2H] z+ sulfopeptide showing prominent SO 3 neutral loss that causes a strong suppression of the peptide sequence-specific fragments, sY=sulfotyrosine;

Figure 12(b) shows a logarithmic plot of relative significances derived from the pC2DMS map according to Eq. (3), sequence specific fragment peaks are shown as triangles, other peaks are shown as squares, the two groups of peaks are well separated;

Figure 12(c) shows a logarithmic plot of relative abundances in 1 D spectrum of the [EQFDDsYGHMRF(NH2)+2H] 2+ ion showing that most of the structure-reporting peaks are at the noise level, i.e. mixed with the square peaks. The relative significances (Fig 12b) of the structure-reporting peptide peaks are enhanced relatively to their relative abundances (Fig 12c) by 2-4 orders of magnitude.

Figure 13 shows a series of histograms showing search engine scores produced according to the invention;

Figure 14 shows a spectrum produced according to the invention marked with a mass conservation line;

Figure 15 shows a spectrum produced according to the present invention used to identify a chimera spectrum;

Figure 16 shows a series of simplified 1 D and 2D spectra for mixes of isomeric peptides; Figure 17 shows a histogram comparing false positive rates for identification of b-ions by 1 D MS and 2DMS according to the invention;

Figure 18 shows a pC-2DMS map for fragmentation spectrum of co-isolated and cofragmented protein parent ions cytochrome c (13+) and ubiquitin (9+) and its deconvoluted mass spectra;

Figure 19 shows an annotated pC-2DMS map for fragmentation of the deprotonated RNA ion [r(GAUCGU)-3H] 3 -.

Figure 20 shows a series of mass conservation lines on a pC-2DMS map for sextuply protonated ubiquitin;

Figure 21 shows a plot tracking the progressively increased relative molar concentration of the molecule [ 4 GKGGKGLGKGGAKR 17 ](Ac2) in the acetylated form K5. Ac Ki6. Ac in a mixture of three other isomeric acetylated forms K5.AcK12.Ac, K 8 ,AcKi 2 ,Ac and K 8 ,ACKI6,AC;

Figure 22 shows an annotated pC-2DMS map for fragmentation of the deprotonated RNA ion [r( U G AG C U G GGUU U )-5H] 5 \ where the underlined bases describe the positions of 2’-0- methylated nucleosides;

Figure 23 shows a comparison of the 1 D MS and pC-2DMS rates of automated assignment for fr(UGAGCUGGGUUU)-5H1 5~ . Figure 12(a) shows a plot of the number of assignments per signal for the 89 1 D m/z signals obtained with top 5 filtering by intensity for bins with a width of 100Da, Figure 12(b) shows the number of assignments per pC-2DMS pair for the top 50 pC-2DMS peaks as ranked by the pC-2DMS correlation score. In both plots any number of assignments greater than one is shown as a negative number to indicate the negative impact of multiple assignments on fragment identification;

Figure 24 shows the distribution of scores for all possible variations of the base sequence UGAGCUGGGUUU with four 2’-0-Methylation modifications, when matched with (a)1 D MS and (b)pC-2DMS experimental data for [r(UGAGCUGGGUUU)-5H]^.

Embodiments of the present invention provide methods for analysing the structure of one or more compounds by obtaining a data set containing data indicative of a physical and/or chemical property of the compound and determining a partial covariance of at least a portion of the data.

Covariance mapping mass spectroscopy was developed as an alternative tool to coincidence techniques for the study of mechanisms of radiation-induced molecular fragmentation. Whilst true coincidence measurements deterministically trace the simultaneously detected fragment ions and electrons to a single parent atom, molecule or cluster, covariance mapping exploits statistical correlations between the shot-to-shot fragment intensities to obtain the same information. This can be used in situations, where there are multiple decompositions, which completely precludes the possibility of coincidence detection.

Covariance mapping rests upon calculation of the covariance function, Cov(X, Y), between the intensity at each pair of different signal channels, X, and

measurements:

Cov(X, V) = (((X) - X)«V> - ¥)), where the angular brackets denote averaging over the N measurements, e.g. (X) = Positive covariance means that the two intensities fluctuate together, indicating the common origin from the same parent Z, either directly: Z X + Y, or via an intermediate decomposition stage: Z X + Xi, Xi Y + Y-i. Zero covariance indicates lack of correlation, meaning the fragments originate from unrelated decomposition processes. Interpretation of a negative covariance, although sometimes assumed to indicate the origin of the fragments from competing processes, is in fact more complicated. It is possible to display the fragment covariance functions as a two-dimensional map, where the x- and y-axes correspond to m/z ratios of the various fragments, while the covariance value may be colour-coded.

A three-dimensional analogue of covariance mapping spectroscopy, exploiting statistical correlations between three fragments, has also been developed for some applications.

Covariance mapping spectroscopy has previously been effective, for example, in unravelling the decomposition mechanisms of so-called‘hollow atoms’ - unstable states of matter formed by intense X-ray irradiation - or in correlating photoelectron emission with fragmentation of hydrocarbons in intense infrared fields. Nevertheless, covariance mapping is often plagued by spurious correlations stemming from fluctuations in some global parameter that lead to the simultaneous increase or decrease of all fragment abundances.

In laser-induced decomposition experiments, it is most often the intensity of the laser pulse that, by exhibiting pulse-to-pulse instability, causes fragments born in completely different decomposition processes to show positive covariance simply because each such process is highly intensity-dependent. A solution to such physical situations is provided by the partial covariance (pCov) mapping technique, where the universal correlations of all fragments to a vector of measured parameters, /, are mathematically removed: pCov(X, Y I) = Cov(X, Y) - Cov(X, l)Cov(H I)^Cov(H Y) (2) where / represents a length k row vector in which each element is one of the k (k ³ 1) fluctuating parameters and the matrix Cov(I T , I) ' is the inverse of the k x k dispersion matrix of parameters I, and X and Y each represent the spectrum intensity for each bin, where (... ) represents an average over the plurality of spectra.

The present inventors have found that the application of methods of covariance and partial covariance mapping may be used to deduce a structure of analysed compounds. In the embodiments described below, all synthetic peptide samples were protonated in a solution of 50% acetonitrile/2% formic acid and directly infused into the mass spectrometer.

All measurements were performed with a LTQ XL (Thermo Scientific) linear ion trap mass spectrometer, with peptide ions infused via a nano-electrospray ion source (Thermo Scientific) at a flow rate of 3-5 mI/min. The temperature of the desolvation ion transfer capillary was held constant at 200°C. The peptide ion of interest was isolated in the linear ion trap and fragmented by collisional induced dissociation at normalised collision energy of either 20% or 35%, activation time of 30 ms and Mathieu q-value of 0.25.

1 D spectra peak picking was performed by the vendor software, with further deisotoping and conversion to the Mascot general format (mgf) done using the open source ProteoWizard MSConvert software. The parent ion m/z was manually adjusted to mimic the performance of a high-resolution Orbitrap (RTM) mass analyser.

For the analysis according to the present invention, software written in the Python (RTM) language takes the raw data and performs all partial covariance, additional statistical and other required analysis. First, a partial covariance map of the data is calculated, using the total ion count across all m/z channels as a partial covariance parameter. Then those features on the map which may correspond to a true correlation are subjected to analysis of their statistical significance upon jackknife resampling. These features are ordered according to their calculated statistical significance, and further a priori filtering of the features according to the m/z of the parent ion is applied. Finally, this filtered set of features is converted to a peak list of individual mass-to-charge values.

Database searches were performed with a parent ion mass tolerance of 7 parts-per-million (ppm) and a fragment ion mass tolerance of 0.8 Daltons (Da). The searches were performed over the fully annotated SwissProt database, the fixed and variable modifications specified were sequence specific. There was no restriction given on the specificity of enzymatic cleavage. Mascot Server (Matrix Science) and the open source MS-Tag (Protein Prospector) database searching software were utilised.

In one embodiment, the invention provides for a method of applying partial covariance mapping technique to mass spectrometric data, producing two-dimensional mass spectra. This offers a range of advantages over the traditional one-dimensional MS in the structural analysis of proteins, for example by collision-induced dissociation. The method provides an analytical application of the partial covariance mapping concept, providing a covariance mapping principle for species as large as peptides with molecular masses of the order of kDa. The method may be performed using industry standard mass spectrometry benchtop instrumentation enabling immediate utilisation as a practical tool. This embodiment is exemplified by an analysis of a peptide that produces abundant structure confirming fragment ions. The inventors performed ESI-MS measurements on the Histone H3 peptide VTIMPKDIQLAR, choosing its triply protonated ion [M+3H] 3+ for collision induced dissociation (CID) fragmentation.

Fig. 1a shows the conventional 1 D CID mass spectrum of [VTIMPKDIQLAR+3H] 3+ ion with abundant peaks of so-called b-type and y-type ions, comprising the N-terminus and C- terminus of the peptide, respectively, and resulting from cleavages along the peptide backbone, e.g. [VTIMPKDIQLAR+3HJ3+ y & 2 * + b 4 + . A 2D covariance map can be built for this ion using Eq. (1 ), where index / corresponds to one microscan of the linear ion trap. As expected, however this map exhibits so-called spurious or extrinsic peaks corresponding to correlations between any arbitrary pair of fragments (see Fig. 1 b). Without wishing to be bound by any particular theory, it is postulated that this is due to the scan-to-scan fluctuations of a number of experimental parameters affecting the abundance of every fragment ion: helium pressure in the ion trap, spray quality, ion focusing and ion trap voltages, etc.

In a standard MS experiment, none of these parameters are monitored on a scan-to-scan basis and some of them are even unknown, such that a direct application of the partial covariance formula (2) to suppress the spurious correlations seems to be impossible. Nevertheless, the invention provides a simple solution to this difficulty: since the fluctuations in experimental conditions lead eventually to fluctuations in the total number of fragment ions detected at each scan comprising one microscan and the latter is well-characterised in a standard MS measurement, we take a sum of the integrals across each m/z channel, correlating to the total ion current of the spectrum, as a single fluctuating parameter, /, to be used for partial covariance mapping, see Eq. (2). In this embodiment the total number of fragment ions detected is used as an internal standard to allow shot-to-shot normalisation of the data and thereby remove extrinsic fluctuations that would otherwise appear as strong correlations, which would in turn mask the correlations due to the fragmentation itself.

Application of the partial covariance formula (2) leads to the result shown in Fig. 1c: all the extrinsically induced (false) correlations become suppressed (shaded regions 10 in (b) become shaded regions 14 in (c)) and a series of peaks reflecting the expected connectivity between corresponding y- and b-type ions and their consecutive decomposition products (neutral losses and further backbone ruptures of the peptide), are revealed as sharp features in the partial covariance spectrum. While the complementary b-ion/y-ion correlations of the peptide are arranged along one of the two mass-conservation lines (each corresponding to a specific charge partition between y- and b-type ion series) of the peptide, consecutive decomposition fragments lead to the formation of horizontal and vertical peak series relatively to each main peak enabling a rapid assignment of both the primary and secondary fragments of the peptide. This demonstrates that the detection efficiency of the linear ion trap system is high enough to enable the application of the partial covariance technique to trapped peptide ions, while showing that the scan-to-scan change in the total ion current is a confident measure of the significant fluctuations in numerous experimental parameters.

To confirm these conclusions, we have successfully tested the validity of the proposed partial covariance mapping on a representative sample of peptide ions including unmodified structures and peptide sequences bearing various PTMs (phosphorylation, sulphation, nitration, methylation), the data being shown in Figures 2-11.

Each of Figures 2 to 11 shows a 3D plot of partial covariance map and illustration of multiple order of magnitude enhancement of signal intensity for structurally informative peaks using the partial covariance-based two-dimensional mass spectrometry of the present invention.

In each, part (a) shows a partial covariance map of the fragmentation of the relevant parent peptide molecule upon collision-induced dissociation. The plot is of the partial covariance map with total ion count as the partial covariance parameter. The m/z values of the correlated peaks are plotted along the x- and y-axes whilst the surface represents the partial covariance function values, normalised to the highest peak on the partial covariance map. The autocorrelation line, which trivially correlates each peak to itself, has been manually cut from each map along a width of 5.67 Da. The line graph plotted against the back walls of the partial covariance map is the 1 D mass spectrum.

In each part b), there is shown an illustration of the enhancement of structural signals using the method of the present invention. Crosses represent relative abundances of those peptide sequence informative peaks in the 1 D spectrum which were identified by the automatic database search engine. Triangles represent those peaks identified as structurally informative by the method of the invention, represented by their calculated relative significance. Diamonds represent signals were not assigned to an expected peptide fragmentation. It should be noted that relative abundance and relative significance values are plotted on the same logarithmic scale to illustrate the relative amplification of multiple structural signals by several orders of magnitude in the data subjected to the analysis of the invention. Circles represent those peaks in the 1 D spectra which could not be identified by the automatic database search engine as structurally informative sequence ions. Dashed lines connect the relative significance signals identified as structurally significant to the corresponding relative abundance signals in the 1 D mass spectrum.

The example considered in relation to Fig. 1 shows the principle of the method of the invention using peptides with abundant sequence-specific fragmentation ions. The method of the invention has crucial advantages over the standard one dimensional approach, particularly where fragmentation signals are suppressed and/or their origin is poorly understood.

As a further example, the CID spectrum of doubly protonated perisulfakinin sequence [EQFDDSYGHMRF(NH 2 )+2H] 2+ , which is dominated by neutral loss of sulphur trioxide with sequence specific peaks of y-and b-type ions being strongly suppressed, may be considered, see Figure 12a. This is an example of a peptide ion that is either assigned an incorrect structure by standard automatic analysis tools or else is assigned the correct structure with low confidence, still leading to lack of identification.

The partial covariance mapping procedure according to the invention was applied to the CID spectrum of the [EQFDDsYGHMRF(NH 2 )+2H] 2+ ion using the total fragment ion count as the single fluctuating parameter. The map can be seen at Figure 12(b), showing sequence specific fragment ions as triangles and other spectral signals as squares. In order to make the map amenable to automatic spectrum-structure assignment, a procedure for automatic peak picking across the resulting 2D map is introduced to create the ranked lists of the correlated fragments. This involves calculating a statistical significance, S(X, Y), of each off- diagonal peak on the partial covariance map,

S(X, Y) = n P Cov(X, Y; iyi / a(V) (3) where \/ is the volume under a partial covariance peak corresponding to statistical correlation of fragments X and Y and a(V) is the variance of this volume computed upon jackknife resampling.

It is noted that a similar analysis of the statistical significance of the simple covariance may be calculated according to the formula:

S(X, Y) = [C(J, F)] / a{V) (4) where Vis the volume under a partial covariance peak corresponding to statistical correlation of fragments X and Y and a(V) is the variance of this volume computed upon jackknife resampling.

The spectral correlations are ranked according to their statistical significances and each CID fragment is assigned with its relative significance as percentage of its highest spectral correlation relative to the highest S(X,Y) on the 2D map. The resulting fragment ranking is directly comparable with a standard 1 D data ranking, done according to the relative ion intensities, also known as relative abundances.

It is instructive to compare the two CID fragment rankings (see Fig. 12(b) and (c). Indeed, the relative spectral statistical significances of the structure-confirming fragment ions are two to four orders of magnitude higher than their relative abundances. Noise-level sequence-specific peptide signals of the standard 1 D spectrum are therefore shown to give rise to high- significance off-diagonal correlations in the partial covariance map provided by the invention, demonstrating its use for high-confidence peptide identification. It is noteworthy that the relative spectral statistical significances provided by the invention can be used directly by the state of the art automatic analysis tools instead of the relative abundances.

Attempting such identification of the doubly protonated perisulfakinin, the invention provides a spectacular result: the scoring algorithms (Mascot and MS Tag) that misinterpreted the 1 D spectrum or interpreted it with low confidence, provide a clear high-confidence identification of the same peptide on the basis of the relative spectral statistical significances. With further investigation of the method of the invention, it is shown that such an identification pattern is typical for the peptide with challenging one dimensional CID spectra (i.e. with low-abundance sequence-specific peaks).

One embodiment of the invention involves the use of a database to obtain sequence information about biomolecules such as proteins. This provides a search engine that matches a most probable database peptide sequence to the measured pC-2DMS spectrum. The search engine takes the list of top-ranked features in pC-2DMS map as an input and relies on protein databases for possible peptide sequences. The search engine operates according to the following algorithm:

1 ) Selecting a protein database expected to have measured sequence(s) in.

2) Digesting a number of protein sequences in silico according to specified digestion selectivity [experimental parameter] to give list of candidate peptide ions. ) For any constructed pC-2DMS map:

a) Identify true pC-2DMS correlation features using pC-2DMS peak ranking procedure as described above.

b) For each peptide sequence in database-derived list of candidate peptide ions, take only those sequences which give a peptide ion m/z within specified tolerance of the measured fragmented peptide ion, at the charge state measured for the fragmented peptide ion. This results in the filtered list of candidate peptide ions.

c) For each ion in the filtered list of candidate peptide ions:

i) Use experimentally derived fragmentation rules to generate in silico all possible pC-2DMS correlations for the ion sequence and charge state. These include but are not limited to: complementary b-ion/y-ion correlations, complementary b-ion/y- ion correlations with the b-ίoh and/or or y-ion having suffered one or more small molecule neutral losses of H2O or N H3, complementary b-ion/y-ion correlations with the b-ion having suffered neutral loss of CO, b-ion/internal b-type ion correlations, internal b-type ion/y-ion correlations.

ii) Calculate a similarity score between the experimentally measured peptide ion and the candidate peptide ion according to:

(1 ) classify all pC-2DMS correlation features as either complementary ion correlations, or correlations involving ions formed by neutral losses, or correlations involving internal ions.

(2) initialise (at 0) a similarity score for each different class of correlation, e.g. b- ion/internal b-type ion,

(3) for each set of each type of candidate peptide ion correlations generated in silico, identify those experimentally measured pC-2DMS correlations that fall within a predetermined m/z tolerance of these correlations (e.g. less than 1 Da)(these are the matching correlations),

(4) for each matching correlation, increment the score for this type of correlation by either a) a set value or b) a weighted value according to the peak ranking parameter/volume/another measure of the experimental correlation,

(5) calculate the similarity score between the experimentally measured peptide sequence and each candidate peptide ion by calculating the weighted sum of each similarity score for each different type of correlation. The similarity score for each different type of correlation is afforded a different weight in this sum, which can be decided according to the results of false positive analysis, previously optimised values for optimised similarity scores for measured known peptide sequences, etc. We currently ascribe the largest weight to b- ion/internal b-type ion correlations and internal b-type ion/y-ion correlations iii) The candidate peptide sequence with the highest similarity score between itself and the experimentally measured peptide ion is assigned as the most likely sequence for the measured peptide ion.

Figure 13 shows a number of histograms of matching sequences as a function of the pC- 2DMS sequence score obtained according to the algorithm described above. Histograms a), b) and c) represent instances of doubly and triply peptides which are both unmodified and modified. In histogram (d), the same peptide as in histogram c) is shown but in a 50:50 mixture with its reverse sequence isomer, which because it is unnatural, is not in the searched database. In all cases, the correct sequence obtains the top score allowing correct identification. However, the 1 D Mascot search engine fails to correctly identify the naturally occurring isomer in the case of the mixture (d).

The algorithm also takes into account a series of further calculations and features which have been devised to further enhance the capabilities of the 2DMS system. These are described below:

Classification of ion pair correlations according to their positions on pC-2DMS map

Ion-ion correlations may be classified on the basis of the position of the corresponding feature on the pC-2DMS map. This feature is demonstrated by reference to Figure 14.

Figure 14 shows a pC-2DMS map of the GSNKAIIGLM+2H + peptide ion with regions corresponding to three different classes of ion correlations shaded. The dotted diagonal line represents a mass conservation line where complementary b-y pairs are found. The lightly shaded area immediately beneath the mass conservation line is where small molecule neutral losses would be observed. The larger area further beneath that line is where correlations involving internal ions appear. The area above the mass conservation line is the mass conservation violation region, where no true correlations found.

All correlations which fall on the mass conservation line {z,-m/z y + z x -m/z x = z p -m/z p , z p -m/z p =mass of parent ion) correlate two complementary ions, such that these complementary pairs can be immediately and reliably identified and extracted from the pC-2DMS map. Equally, all correlations falling on the lines z y -m/z y + z x -m/z x + 17= z p -m/z p , z y -m/z y + z x -m/z x +18= z p -m/z p or Z y -m/Z y + Z x -m/Z x +28= z p -m/z p can be identified as a complementary pair where one of the complementary ions has suffered a canonical small molecule neutral loss of ammonia/water/carbon monoxide respectively.

Meanwhile, correlations falling within the region of the map described by m/z y + m/z x +A= m/z p , where A is a (loosely defined) number larger than the masses of canonical small molecule neutral loss, correlate two fragment ions whose masses sum to a significantly lower total mass than the original parent ion. The important internal ion-terminal ion correlations fall within this region, as do the much less common terminal ion-terminal ion correlations where one or both of the terminal ions has undergone a second backbone cleavage to eliminate one or more amino acids, along with correlations where fragment ions have suffered a larger mass neutral loss.

No correlations fall in the mass conservation violation region because they would correlate two m/z values which, multiplied by the corresponding charges, sum to give a mass greater than the mass of the fragmented parent ion. Such a correlation would violate the conservation of mass.

The ability to classify different correlations (complementary ion pairs, internal ion correlations, etc.) allows for greater specificity in matching the candidate peptide sequences to the measured pC-2DMS map through its use in the optimisation of the weights that each class of correlations is given in the matching procedure described above.

Identification and localisation of characteristic neutral losses

The present invention provides experimental verification of both the mass of neutral/charged losses from measured ions and the fragment from which the neutral/charged loss has occurred. The method for doing so is as follows.

- For every feature on a pC-2DMS map correlating two m/z' s m/zi and m/z 2 , identify all other features on the pC-2DMS map correlating any other m/z with either m/zi or m/z 2 , up to a given m/z tolerance. For the purpose of this disclosure, it is assumed that for a feature correlating m/zi and m/z 2 there is another correlation on the pC-2DMS map correlating m/zi with a new m/z, m/z 3 .

- The feature correlating m/zi with m/z 3 indicates that both m/z 2 and m/z 3 correlate with the same m/zi. Depending on the charge states of the fragment ions at m/z å and m/z 3 and their relative m/z values, this means that either the ion at m/z2 is the ion at m/z 3 having suffered a neutral/charged loss or the ion at m/z 3 is the ion at m/z2 having suffered a neutral/charged loss.

For the simplest case of a 2+ parent ion (where we can only measure neutral loss because charged loss from a correlated fragment ion would render it an undetectable neutral), this will indicate that the ion at the smaller m/z of m/å2 and m/z 3 is the ion at the larger m/z having suffered a neutral loss of abs (m/z 2 - m/z 3 ) Da.

This embodiment of the invention provides software which automatically performs this identification having loaded a feature list generated from the automatic resampling analysis of features on a pC-2DMS map. This feature enables, amongst other applications, the a priori identification (and localisation) of post-translational modifications which induce a characteristic neutral loss under MS/MS fragmentation (e.g. neutral loss of 98 Da from phosphothreonine), as well as identification of particular fragment ion types according to a characteristic neutral loss under MS/MS (e.g. 28 Da as an indicator of b-ions).

Identification of chimera spectra

Chimera spectra exist where more than one parent ion is subject to fragmentation in the same spectrum. Where the multiple parent ions are structural isomers, it is extremely challenging to recognise this in the analysis of traditional 1 D mass spectra. However, according to methods of the present invention, the reliable identification of correlated complementary ion pairs thanks to the position on the pC-2DMS map of the relevant correlations (along the mass conservation line) allows for a robust diagnostic for the identification of chimera spectra. For example, under the conditions of low-energy CID, the complementary ions which are produced are of b-type (N-terminus) and y-type (C-terminus), corresponding to cleavage of the peptide bond along the peptide backbone. Each successive b-ion or y-ίoh from the cleavage of a particular sequence is separated in mass by the mass of the next amino acid residue. At 57 Da, glycine is the simplest amino acid (with the R-group being a single H) and therefore has the smallest possible mass of any amino acid. Therefore, in the pC-2DMS map of a doubly charged peptide ion, the presence of three or more complementary ions appearing on the mass conservation diagonal within 57 Da indicate (at least) two consecutive b-ions or y-ions which are within less than 57 Da of each other, meaning that not all complementary ions can come from the same sequence. The spectrum is therefore identified as a chimera spectrum. This feature is illustrated in Figure 15. Mixtures were measured of the palindromic peptide ions [GSNKGAIIGLM+2HJ2+ (P1 ) and [MLGIIAGKNSG+2HJ2+ (P2), each present at different molar ratios to simulate the situations where two fully isobaric isomeric parent ions are coisolated. It is not possible to separate the two different co-isolated ions even at arbitrarily high mass resolution. The table in panel (a) demonstrates that the 2D chimera diagnostic is able to identify the mixed spectra as chimera from the 2D fragment ion pC-2DMS map alone, even when the isomeric ion [P1 +2H] 2+ is at one five-hundredth the relative molar concentration of its counterpart. Panel (b) illustrates the successful identification of 2 chimera flags in the 1 :499 mix. Analysis of the two sets of three ions within 57 Da of each other reveals the two chimera flags to consist of m/z’s {276.7 (y3+ of P2), 301.8 (b3+ of P2), 319.7 (y3+ of P1 )} and {741.2 (b8+ of P1 ), 759.4 (y8+ of P2), 784.1 (b8+ of P2)}.

The percentage of chimera spectra which have three or more complementary ion pairs within 57 Da from each other was also determined. Numerical simulations were performed to estimate the robustness of the 2D chimera diagnostic of the invention. All protein sequences in the UniProt/Swiss-Prot database (http://www.uniprot.org) were subjected to a tryptic digest in silico (no missed cleavages). 100,000 of these peptide sequences (of length between 6 and 14 residues) were selected at random and for each randomly selected peptide sequence identified all unique sequences which gave a doubly charged peptide ion within 1 Da of the parent ion, returning a list of 21 ,747,126 sequences whose doubly charged ions may be coisolated provided a typical isolation window width of 1 Da.

Analysis of a representative set of experimentally measured doubly charged peptide ions indicated that just over 75% of all possible complementary ion correlations appear in the top 50 significance-ranked features for doubly charged ions. Using this assumption to randomly define whether each complementary b-ion/y-ion pair is measured, our numerical simulations indicate that 94.3% of all pairs of possible co-isolated doubly charged peptide ions produce at least one chimera flag, with the mean average number of flags at 6.58. The simulated complementary ion spectra for every one of the 100,000 pure peptides (for which we assumed 100% detection efficiency for all possible complementary b-ion/y-ion pairs) produced zero chimera flags, thereby proving the reliability of the identification of chimera spectra provided by the invention.

Methods of the invention extend this principle to triply- and higher-charged peptide ions. For example, occurrence of five complementary ion correlations within the range of 28.5 Da on a mass conservation diagonal on the pC-2DMS map of a triply charged peptide ion indicates a chimera spectrum. This condition can be relaxed if the charges of the correlated ions are known, for example through analysis of their small molecule neutral losses (e.g., loss of water would be 18 Da for a singly charged ion, but 9 Da for a doubly charged one). In the case of successful charge state identification, occurrence of three correlations involving singly charged ions within the 57 Da range on a mass conservation diagonal or occurrence of three doubly charged ions within the 28.5 Da range on the mass conservation diagonal would mean chimera spectra.

Identification of the same complementary pair with converse charge distribution on the same mass conservation diagonal also allows for confirmation of the charge state of the correlated ions. This is necessary to determine the mass of a molecular ion from its measured m/z value, to achieve higher interpretation rates of MS/MS spectra and for more specific structure- to-spectrum matching. Traditionally, this has required either the unreliable identification of multiple spectral signals corresponding to the same molecule at different charge states, or a well-resolved isotopic envelope that exploits the natural abundance of the 13C isotope (-1.1 %) to determine the charge of a fragment ion by measuring the m/z difference between two molecules separated in mass by 1 atomic unit. Obtaining a well-resolved isotopic envelope can be challenging, especially for lower mass resolution instruments and/or for highly charged fragment ions.

The arrangement of pC-2DMS signals along mass conservation lines is a result of the physical law of mass conservation, as described above - if one molecule breaks into two smaller molecules, the masses of these molecules sum to the mass of the whole molecule. As well as the primary mass conservation diagonals corresponding to the mass of the entire molecule under analysis, the appearance of secondary mass conservation diagonals, whose signals correlate two secondary fragment ions that sum to the mass of a primary molecular fragment, is also observed.

The fundamental formula governing the location of the mass conservation diagonals is that the mass of both correlated ions, ion x and ion y, must sum to give the mass of the parent ion;

m x + m y = M

where M is the mass of the parent ion in question which can either be the full molecule under analysis or a fragment of this full molecule which has subsequently undergone secondary fragmentation. The observed quantities in a mass spectrum are not molecular masses but mass-to-charge ratios, m/z. As such, the law governing the position of the mass conservation lines can be written in terms of m/z:

This equation shows that for a mass conservation diagonal on a pC-2DMS map of m x /z x vs. m y /z y the gradient of the mass conservation diagonal along which a set of fragment ions, ions x (those with m/z value read from the x-axis) and another set of fragment ions, ions y (with m/z value read from the y-axis) are correlated will be equal to—z x /z y . It is therefore possible to infer the charge state of the two ions correlated along a mass conservation diagonal from the gradient of the diagonal. This can be achieved by, for example, comparing the difference between two or more m x /z x values of correlations with the difference between two or more m y /z y values of correlations to derive the ratio of z x to z . Information relating to the charge state of the overall parent ion, for example that obtained using known techniques involving the accurate mass of the overall parent ion, may also be utilised.

The analysis may be automated, for example by using a Python (RTM) script. As well as identifying mass conservation diagonals from the break-up of the full molecule under analysis, which can be defined a priori given the measured mass and charge state of the full ion, the script is also able to identify mass conservation diagonals resulting from the secondary fragmentation of molecular fragments using a Hough transform routine.

Figure 20 illustrates the success of this analysis for sextuply protonated ubiquitin. The scatter plot shows the m/z values of the top 50 pC-2DMS correlation score-ordered signals, for which 72% have had their charge state identified from the mass conservation diagonals along which they fall (charge state of the correlated ion with m/z value read off each of the two axes is indicated in the legend). Further to the mass conservation diagonals originating from the break-up of the original structure (labelled ‘pre-defined’), the Hough transform has also identified a mass conservation diagonal corresponding to a CO loss from the mass of the full molecule. As well as allowing the charge state of the correlated ions on this mass conservation diagonal to be identified, the Hough transform also identifies the common origin of these signals.

Note that the symmetry of the pC-2DMS maps about x = y means that the same signal is plotted twice on the scatter plot, where for each duplicate the m/z value of the two correlated ions is read off different axes. As a result, the same pC-2DMS signals fall along each of two mass conservation diagonals, where the gradient of one is the inverse of the other because for the duplicate of each signal the x and y subscript in the expression for the gradient (- z x /z y ) are exchanged. The signals falling on the mass conservation diagonals are starred, but only for one of the two duplicates, namely when z x > z y .

Resolution of structural isomers

Methods of the invention also provide for the resolution of structural isomers. In 1 D MS, the resolution of structural isomers is highly challenging and in some cases fundamentally impossible owing to there being no possible reporting 1 D fragment ions for distinguishing one structural isomer from another. The present invention solves this problem by producing isomer-specific marker ion pairs.

By way of example, pC-2DMS spectra of 4 different isomers were measured of the naturally occurring diacetylated histone H4 peptide [ 4 GKGGKGLGKGGAKR 17 ](Ac)2 containing combinatorial PTMs (lysine acetylation), and their mixtures. Further details are shown in Figure 16, which shows simplified spectra of 1 D MS vs 2D MS as applied to those mixtures. The mixtures of 2 and 4 isomeric diacetylated peptides cannot be distinguished using the standard 1 D MS as the corresponding 1 D spectra are practically identical (top). Using the marker correlations between the internal and the terminal (b-type) ions, the pC-2DMS provides unambiguous differentiation between the two mixtures and readily determines which isomers are present in each of the cases (bottom).

Therefore, whilst 1 D MS is unable to distinguish, for example, between a mixture of two and a mixture of four such modified histone sequences (see figure 16 below, top), by revealing the connectivity of the fragment ions, pC-2DMS is readily able to unambiguously resolve and identify each of the two mixtures (see the bottom panel of Figure 16). Given all potential modification states of a structural isomer, embodiments of the invention are able to automatically identify all unique marker ion correlations for each modification state and perform automated analysis accordingly.

Quantitative analysis of mass spectra

Upon certain assumptions on the fluctuation statistics (for example, assuming they obey the Poisson distribution), the volume of islands on a covariance or partial covariance map is directly proportional to the probability of the fragmentation reaction they correspond to, which is in turn directly proportional to the number of the corresponding parent ions fragmented in the trap. Therefore, the invention can be used for the quantitative analysis of samples. An embodiment of this is measuring the relative concentration of a sample molecule in a mixture with one or more other sample molecules, by comparing a ratio of a measure of the covariance or partial covariance peaks due to a particular component in a sample with a measure of the covariance or partial covariance peaks due to another component in that sample. Absolute quantitation can also be performed if absolute concentration of one of the components in the sample is established previously.

The relative quantification of a component in a sample is demonstrated in Figure 21 through tracking the progressively increased relative molar concentration of the molecule [ 4 GKGGKGLGKGGAKR 17 ](AC2) in the acetylated form K 5,AC KI 6,AC in a mixture of three other isomeric acetylated forms K 5 ,A C KI 2 ,A C , K 8, A C Ki2,Ac and K 8 ,A C KI 6, A C ·‘Quantity measured’ is the ratio between the volume of the highest scoring partial covariance feature unique to the K 5,AC KI 6,AC peptide (ys.A c * & bio ) and the volume of a partial covariance feature that cannot be produced by the K5,A C KI6,A C peptide (ys 2* & bg,2A c + , ys + & [ g,2A c -NH 3 ] 2+ or y 8 + & [bg,2A c -C7HnNO-CO] + ), and can therefore be taken as a reference.‘Quantity added’ represents the actual relative increase in the concentration of peptide K 5,AC KI 6,AC . The dashed line represents the best linear fit. These results demonstrate that the volume of the partial covariance peaks provides a highly accurate measure of the quantity of the measured isomer. The error bars show the dependence of the quantification on the particular choice of the reference correlation, demonstrating consistently accurate results for each of the three chosen references.

Resolution of spectra of parent ions of different masses

In embodiments of the invention there is provided the ability to use the pC-2DMS map to identify complementary ion pairs for a peptide of mass M lying along the mass conservation. In some embodiments, this provides the ability to resolve the mass spectra of several different parent ions of different masses fragmented simultaneously. In the simplest case, for n doubly- charged parent ions of different masses Mi, M 2 ,..., M n , the complementary pairs coming from each peptide ion of mass M, will lie on their own mass conservation line described by y + x = M,. This leads to the possibility of fragmenting multiple peptide ions of different masses simultaneously and being able to unambiguously resolve the key structure-informative ions of each peptide, increasing throughput and available data from each biological sample. This embodiment of the invention may also further deconvolve internal ion correlations, etc., according to their position on the map relative to correlations which have already been assigned to a particular sequence.

Reduction in false positive rate in database searching for matching of pC-2DMS correlations vs matching of 1 D MS fragment ion m/z’s

In order to determine the relative success of the methods of the invention versus those using 1 D MS, numerical simulations of database searches were performed to demonstrate the reduction in the rate of false positive matches between experimental data and database sequences for pC-2DMS vs 1 D MS/MS. This false positive rate (FPR), for matching of an experimentally measured 1 D fragment ion m/z with a 1 D fragment ion m/z generated in silico from a database sequence, was compared with the false positive rate for matching of an experimentally measured 2D fragment ion correlation with a 2D fragment ion correlation generated in silico from a database sequence.

Figure 17. shows the false positive rate (FPR) for identification of b-ions by 1 D MS (left hand bars, crosshatch), internal ions by 1 D MS (central bars, filled with small circles) and y-ions by 1 D MS (right hand bars, slanted hatch) at mass tolerances 0.8 Da (typical for ion trap mass analyser), 0.05 Da (typical fortime-of-flight mass analyser), 0.02 Da (typical for Orbitrap (RTM) mass analyser) and 0 Da, corresponding to infinite mass resolution. FPR is averaged over fragment lengths from 2 to 15 amino acids. The dashed line shows the averaged FPR for correlations of internal ions with b-ions and y-ions in pC-2DMS at the typical ion trap m/z tolerance of 0.8 Da.

The 2D pC-2DMS fragment ion matching FPR for 2D b-ion/internal ion correlations and 2D internal ion/y-ion correlations at fragment ion tolerance 0.8 Da remains over an order of magnitude lower than the 1 D b-ion, 1 D internal ion and 1 D y-ion FPR as the 1 D fragment ion matching m/z tolerance is decreased to 0.02 Da. Remarkably even at an infinitely tight fragment ion matching tolerance (0 Da), meaning only fully isomeric fragment ions are matched, the FPR for pC-2DMS correlations at the fragment ion tolerance of 0.8 Da is shown to be almost an order of magnitude lower than for the 1 D fragment ions. This demonstrates the remarkable specificity of pC-2DMS correlations, and the potential to provide an unprecedentedly specific 2D spectra fingerprint even at moderate mass accuracy. FPR is averaged over all fragment ion lengths between 2 residues and 10 residues for these calculations. It should be noted that the estimated FPR for the matching of complementary b-ion/y-ion correlations at mass tolerance 0.8 Da is only slightly lower than the corresponding 1 D b-ion and y-ion fragment ion m/z matching FPR (-0.085). This is a result of the pre-selection of candidate database peptide sequences according to parent ion m/z, which means all tested database peptide sequences are very close in mass. As a result, a match between a 1 D b-ion or y-ion m/z of the analysed peptide and the b-ion or y-ion of a database sequence will result in the complementary b-ion or y-ion of the analysed peptide being close enough in m/z with the corresponding complementary b-ion or y-ion of the database sequence to give a complementary b-ion/y-ion correlation match in almost all cases.

In Silico Deconvolution of Mixtures of Intact Proteins by pC-2DMS.

Biological samples are typically complex mixtures of more than one protein, and separation of these mixtures prior to top down 1 D MS/MS analysis is essential to avoid the insurmountably difficult task of identifying proteins from the overlapping 1 D fragment ion signals resulting from the simultaneous decomposition of several protein molecules. Liquid chromatography (LC) is the preferred method for separation of complex peptide mixtures analysis because it is straightforwardly automated and is able to couple directly to a mass spectrometer for online analysis. Reversed-phase liquid chromatography (RP-LC) is a common technique in the separation of mixtures of peptide molecules (37).

However, different realisations of RP-LC present various diverse difficulties for top-down analysis and the technique typically suffers from limited separation capabilities and poor protein recovery for intact proteins, although the use of shorter alkyl chains for the stationary phase can improve recovery rates (38, 39). Hydrophilic interaction liquid chromatography (HIL!C) has been directly coupled to mass spectrometry systems (40, 41 ) and is able to separate some protein mixtures, but the technique is unsuitable for molecules which are not easily soluble in high concentrations of organic solvent. The direct coupling of capillary electrophoresis (CE) to mass spectrometric systems (42) can also be used to separate mixtures of intact proteins but it experiences relatively poor reproducibility, its generality to the separation of intact proteins has not been demonstrated and its sample loading volume is highly limited, reducing sensitivity and dynamic range (38). pC-2DMS allows for the unprecedented in silico separation of protein mixtures which have been co-isolated and cofragmented, without the costly, wasteful and challenging process of upstream separation.

As described above, in a pC-2DMS map according to the present invention, complementary ions produced by the fragmentation of parent molecules of different mass and/or charge state fall along uniquely defined mass conservation lines. The separation of overlapping fragment ions direct from the pC-2DMS map therefore requires the identification of the different mass conservation lines present. As described above, this is may be performed by use of a Hough transform.

Figure 18 demonstrates the in silico separation of the two co-isolated and co-fragmented intact protein ions, cytochrome c (13+) and ubiquitin (9+). Plotted are the top 200 pC-2DMS correlation score-ranked features, which have been passed to the Hough transform along with the roughly determined parent ion m/z values as measured in the precursor scan in the linear ion trap. The Hough transform has identified two sets of mass conservation lines, corresponding to parent ions of average mass 8572.7 Da * 309 and charge state 9+ (blue) and average mass 12368.4 Da† and charge state 13+ (red). The zoomed-in view of the horizontal 1 D MS/MS spectrum illustrates the deconvolution and charge state identification performance of the Hough transform in this particularly congested region of the spectrum. Each set of correlation features lying along the two different sets of mass conservation lines has been individually passed to the pC-2DMS search engine, along with the parent mass and charge state as determined by the Hough transform. As illustrated by the inset histograms in Fig. 18, the pC-2DMS search engine unambiguously identifies each of the two mixed proteins from the two sets of deconvolved pC-2DMS features.

Application to oligonucleotide structural analysis

As noted above, oligonucleotides are constructed from a more limited selection of monomers than peptides, with only 4 different fundamental bases linked by symmetric phospodiester bonds. This increases the chance of generating isobaric fragments that conventional MS cannot differentiate. It is also common for many fragmentation pathways, such as secondary fragmentation or the loss of one or more nucleobases, to be considered uninformative in conventional MS analysis. Embodiments of the invention address these challenges in the following ways.

Isobaric fragments may be identified and distinguished through correlation with their sibling fragment, which is assigned as the other half of their correlated fragment pair. This eliminates matches with other candidate fragments of the same mass.

For the case of nucleobase loss (whether neutral or charged), the fragment which has lost the nucleobase can be assigned by pC-2DMS, because the correlation with the fragment representing the rest of the molecular ion confirms the nature and, in many cases, the location of the loss.

Furthermore, secondary backbone cleavage cases are identified and assigned when the total mass of a correlated pair does not add up to the parent ion mass and the possibility of the fragmentation being explained by any common loss events has been eliminated.

Examples of each of these is included in the annotated pC-2DMS map of the deprotonated RNA ion [r(GAUCGU)-3H] 3 shown in Figure 19, each example used here being underlined on the drawing. The method is also effective in analysing DNA ions.

In Figure 19, the pair of correlated m/z values 537.82 & 599.07 is assigned as [C2 - A] & y 4 2~ . The value of 537.82 could have matched with [x 2 - U] , but since 599.07 is assigned as y 4 2 , we know that 537.82 must correspond to [c 2 A] because this is a complementary fragment to y 4 2~ , which accounts for the correlation between the two values.

The pair of correlated m/z values 601.55 & 516.75 is assigned as a 4 2~ & [w 2 - G] ~ . [w 2 G] ~ is not likely to have been assigned from the conventional MS spectrum because the feature is very small and there would have been no way of knowing that the loss of a guanine (G) base had happened from a w 2 ~ type fragment, which we can deduce because w 2 is complementary to a 4 2 , which is assigned as the other part of the correlated pair.

In the pair c 2 & X3 2 a secondary correlation has been identified between two terminal fragments and in the pairs c^ ) & y 2 and c 3 & Ci (4 > secondary correlations between a terminal and an internal fragment have been identified. In each of these cases we identify that a secondary fragmentation must have occurred since the parent mass is not conserved in the total mass of the correlated fragments, so the parent ion could not have dissociated to form these fragments in a single fragmentation event.

The ability of pC-2DMS to characterize the correlations of these fragments, previously assumed to be "uninformative" by the state of the art, means that many more sequence- specific molecular fragmentations contribute to sequence determination than in the standard 1 D methodology under the same activation conditions. This means that we have a much more specific profile of the oligonucleotide to lead to a much more confident spectrum to structure assignment either through database search or through de novo sequencing. Figure 22 shows that these principles can also be applied to understanding the fragmentation of olignucleotides with modifications. In this case the oligoribonucleotide studied was [rdJGAGCUGGGUUUVSHI 5- where the modifications are 2 ' -0-methylated nucleosides, the locations of which are shown as underlined bases. This oligonucleotide is also used to demonstrate the improved specificity achieved by pC-2DMS in Figure 22, which shows a comparison of the 1 D MS and pC-2DMS rates of automated assignment. Here we note that the ambiguity in assignment is significantly reduced by pC-2DMS where multiple assignments only occur for internal fragments and are limited to a maximum of three possible assignments, whereas 1 D MS in some cases generates up to twenty-six possible assignments for a single fragment.

We also note that the highest ambiguity in both 1 D MS and pC-2DMS occurs when a single backbone fragmentation cannot be assigned. In the present invention, it is possible to then search for matching internal fragments. These fragments have broken on both ends, which means that they do not inherit either end from the parent ion. When it comes to identifying internal fragments pC-2DMS of the present invention provides a significant advantage because where one part of the pC-2DMS pair is identified as a product of a typical single backbone fragmentation then this identification narrows the search for the internal fragment to the constituent parts of the complementary terminal fragment. This complementary identification also provides information about the nature of the end of the internal fragment which was attached to the identified terminal fragment. Thus the ability to calculate the set of potential internal fragments from much narrower initial conditions is provided, leading to fewer incorrect fragment identifications and therefore more reliable identification of oligonucleotide internal fragments than with techniques of the prior art.

Mass spectrometry based modification mapping protocols for RNA utilise liquid chromatography (LC) MS/MS to discover or confirm the positions of modifications. Traditional LC methods can struggle to separate structural isomers, however, the present invention provides a reliable method for doing so. The performance of pC-2DMS methods of the invention were tested in the challenging task of differentiating a sequence from all possible isomers with a different modification pattern.

The experimental data for the modified oligoribonucleotide studied in Figure 22 - [r(UGAGCUGGGUUU)-5H] 5‘ - was matched against a database of all possible variations of the base sequence UGAGCUGGGUUU with four 2’-0-Methylation modifications along the backbone; which gives 495 candidate sequences. A score was assigned for the concurrence of the experimental data and the in silico fragments for each candidate sequence with both pC-2DMS and 1 D MS (this is a weighted sum with a contribution of 1 for the RNA characteristic c & y type fragments, 0.75 for the second most common a, a-B & w fragments and 0.5 for all other fragment assignments).

The distribution of these scores is compared in Figure 24. All scores are normalised to the highest recorded score and plotted against the number of pairwise swaps in modification position required to regain the measured sequence - which acts as an indicator of the dissimilarity of the sequence with the measured sequence. The score for the measured sequence can be seen on the far left of each panel and is highlighted in both with a dashed line square. 1 D MS fails to assign the highest score to the measured sequence and a large variety of other sequences are ranked higher than the measured sequence. In contrast pC- 2DMS singles out the measured sequence with the highest score and as the dissimilarity of the sequence increases a clear trend in the score decreasing can be seen in Figure 24. This indicates that pC-2DMS would have successfully identified the correct sequence whereas 1 D MS would have failed.

Methods of the present invention therefore provide new general two-dimensional mass spectrometry based on partial covariance mapping and demonstrated that the method can be applied to structural analysis in proteomics using a standard mass spectrometer platform. Without requiring any a priori information about the analysed peptides, the partial covariance map shows correlations between the fragment ions formed in the same or in the consecutive dissociations, facilitating interpretation of the spectra and matching them to the correct peptide structures. The assignment of relative spectral statistical significances to the CID fragments allows the user to confidently derive correct peptide sequences from spectral peaks, including the unusual, complex origin and noise-level signals that are routinely misinterpreted or disregarded by traditional one dimensional mass spectrometry.

The methods of the present invention therefore solve the poor interpretation problem of proteomic mass spectrometry and opens new opportunities for characterisation of biomolecules. Such methods could be applied to many other forms of spectroscopy. Other spectroscopic methods are suited to the analysis approach as the data they produce comprise a plurality of spectra that can be divided into bins. In all it is possible to identify a control parameter that is indicative of synchronised fluctuations that can be employed in the partial covariance analysis to reveal the true statistical correlations between spectral bins. Preferences and options for a given aspect, feature or parameter of the invention should, unless the context indicates otherwise, be regarded as having been disclosed in combination with any and all preferences and options for all other aspects, features and parameters of the invention.

The listing or discussion of background information or an apparently prior-published document in this specification should not necessarily be taken as an acknowledgement that the information or document is part of the state of the art or is common general knowledge.