Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR RECONSTRUCTION OF SEQUENCES FROM SPECTROMETRY DATA SETS
Document Type and Number:
WIPO Patent Application WO/2023/073064
Kind Code:
A1
Abstract:
A computer implemented method for the reconstruction of peptide sequences from a spectrometry data set obtained by measurement of total masses of multiple diverse peptides in a sample and after breaking the peptides of the sample into fragments measuring the partial masses of said fragments forming together with the total masses the spectrometry data set. The method comprising the step of calculating by a processor a first and a second sequence as complementary sequences of mass pairs wherein the mass pairs in the first sequence are complementary to the mass pairs in the second sequence and both sequences are representing the same reconstructed sequence from the spectrometry data set.

Inventors:
HRUZ TOMAS (CH)
Application Number:
PCT/EP2022/080011
Publication Date:
May 04, 2023
Filing Date:
October 26, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ETH ZUERICH (CH)
International Classes:
G16B40/10
Other References:
TING CHEN ET AL: "A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 January 2001 (2001-01-18), XP080039374, DOI: 10.1089/10665270152530872
YAN YAN ET AL: "NovoHCD: De novo Peptide Sequencing From HCD Spectra", IEEE TRANSACTIONS ON NANOBIOSCIENCE, IEEE SERVICE CENTER, PISCATAWAY, NY, US, vol. 13, no. 2, 29 May 2014 (2014-05-29), pages 65 - 72, XP011549697, ISSN: 1536-1241, [retrieved on 20140530], DOI: 10.1109/TNB.2014.2316424
BO YAN ET AL: "PRIME: A Mass Spectrum Data Mining Tool for De Nova Sequencing and PTMs Identification", JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, KLUWER ACADEMIC PUBLISHERS, BO, vol. 20, no. 4, 1 July 2005 (2005-07-01), pages 483 - 490, XP019217090, ISSN: 1860-4749, DOI: 10.1007/S11390-005-0483-5
LI HUI ET AL: "Rapid and Accurate Generation of Peptide Sequence Tags with a Graph Search Approach", 27 May 2011, SAT 2015 18TH INTERNATIONAL CONFERENCE, AUSTIN, TX, USA, SEPTEMBER 24-27, 2015; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER, BERLIN, HEIDELBERG, PAGE(S) 253 - 261, ISBN: 978-3-540-74549-5, XP047376561
THILO MUTH ET AL: "Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification?", BRIEFINGS IN BIOINFORMATICS, vol. 19, no. 5, 21 March 2017 (2017-03-21), GB, pages 954 - 970, XP055763951, ISSN: 1467-5463, DOI: 10.1093/bib/bbx033
Attorney, Agent or Firm:
RENTSCH PARTNER AG (CH)
Download PDF:
Claims:
PATENT CLAIMS

1. A computer implemented method for reconstruction of multiple diverse sequences of weighted building blocks from a spectrometry data set, in particular obtained by tandem mass spectrometry, said spectrometry data set comprising a total weight per sequence, partial weights of prefixes and suffixes of the sequences and weight 0, the method comprising the steps of: a. generating by a processor (9) weight pairs (1 ) from the spectrometry data set, each weight pair comprising two weights if the difference between the weights in a weight pair equals to a weight of a sequence of weighted building blocks; b. ensuring by the processor (9) that for each weight pair there exists at least one complementary weight pair (2) with respect to a total weight by confirming i. the difference of the smaller weight and the weight 0 in the given pair is equal to a difference in weight between the larger weight of the complementary pair and the total weight and ii. the difference between the smaller weight and the larger weight of each pair are the same for both weight pairs; and c. calculating by the processor (9) a first and a second sequence as complementary sequences of weight pairs (1 ) wherein the weight pairs in the first sequence are complementary to the weight pairs in the second sequence and both sequences are representing the same reconstructed sequence from the spectrometry data set.

2. The computer implemented method according to claim 1 , wherein the weighted building blocks are amino acids having a specific mass.

3. The computer implemented method according to at least one of the previous claims, wherein the partial weights of the spectrometry data set are amended by the processor (9) for obtaining a set of partial weights symmetric with respect to the total weights by a. calculating symmetric weights (3) for each partial weight with respect to each total weight as the difference of the respective total weight and the respective partial weight; b. comparing each symmetric weight to the partial weights of the data set and in case of no match, the symmetric weight is appended; and c. each symmetric weight is preferably labeled as an added weight, which was not present in the original spectrometry data set.

4. The computer implemented method according to at least one of the previous claims, wherein the method comprises the steps of generating by the processor (9) a directed fragment graph as follows: a. generating for each weight pair two vertices and one edge interconnecting them, said vertices representing the set off all weights (the total weights, the partial weights and zero), and said edges representing the difference in weight of the two vertices in the pair interconnected by the respective edge where the edge represents also the building blocks corresponding to the difference; b. ensuring that every vertex is connected to the vertex corresponding to weight zero and every vertex is connected to a vertex corresponding to one of the total weights; c. ensuring the directed graph consists of complementary edges, wherein a given edge has a complementary edge if the underlying pairs are complementary; and d. calculating paths of edges together with a complementary path where the edges in the first sequence are complementary to the edges in the second sequence representing a reconstructed sequence from the data set.

5. The computer implemented method according to claim 4, wherein generating by the processor (9) a directed fragment graph comprises the step of reducing the edges of the directed fragment graph by a transitive reduction.

6. The computer implemented method according to at least one of the previous claims 4 or 5, wherein the directed fragment graph is generated by the processor (9) by combining directed fragment graphs constructed for every total weight alone.

7. The computer implemented method according to at least one of the previous claims 4 to 6, wherein the method comprises the steps of: a. constructing by the processor (9) a flow network from the directed fragment graph, wherein the flow network construction comprises the steps of: i. generating pairs of peak nodes (4) and convex cost peak edges (5) between said pairs of peak nodes (4) each representing a vertex of the directed fragment graph; ii. generating replica edges (6) which are connecting the peak nodes through zero or more auxiliary nodes and edges (7) on both sides of the replica edge where the replica edges are representing partial sequences; and iii. ensuring a complementary relation (8) between replica edges using the fragment graph complementarity and ensuring that every replica edge has exactly one complementary replica edge; and b. setting by the processor (9) costs and capacities of the edges in the network so that a minimal cost integer flow corresponds to reconstructed sequences from the data set; and such that the minimal cost integral flow fulfills one or both conditions: i. for every replica edge the flow through the edge and the flow through the complementary replica edge are equal; and ii. the sum of inflow through replica edges which have complementary edge with respect to some total weight is equal to the sum of outflow through the edges which have a complementary edge with respect to the same total weight; and c. calculating by the processor (9) a minimal cost integer flow of the flow network corresponding to complementary paths each representing a reconstructed sequence from the data set, wherein a linear programming or mixed-integer programming or an equivalent method for computing the minimal flow is used.

8. The computer implemented method according to claim 7, wherein for each pair of peak nodes (5) the processor (9) assigns a non-separable convex cost condition for the replica edges, in particular contracting the peak edge and one of the peak nodes (5) of the respective pair of peak nodes (5), for counting each vertex in the fragment graph at most once towards an optimum solution.

9. The computer implemented method according to at least one of the previous claims 7 and 8, wherein one or more edges in the flow network are constrained by the processor (9) to be integer flow edges to ensure that the minimal cost flow is integral.

10. A computer implemented method for the reconstruction of a mass spectrum according to at least one of the previous claims, wherein the spectrometry data set is obtained by the tandem mass spectrometry and the spectrometry data set represents the mass spectrum obtained by measurement of total masses of multiple diverse peptides in a sample and measurement after breaking peptides of a sample into fragments measuring the partial masses of said fragments forming together with the total masses and zero mass the spectrometry data set. The computer implemented method according to at least one of the previous claims, wherein the method comprises the steps of a. generating by the processor (9) a message comprising sequence information relating to the reconstructed sequences; and/or b. displaying by a display interconnect to the processor (9) sequence information relating to the reconstructed sequences. A computer system comprising a processor (9) configured to perform the steps according to at least one of the previous claims. A computer program product comprising instructions, when executed on a processor (9) of a computer system, the processor (9) performs the steps of the method according to at least one of the claims 1 to 11 , in particular: a. obtaining by the processor (9) a mass spectrum generated by a mass spectrometer, the mass spectrum comprising total masses of multiple diverse peptides in a sample and partial masses measured after breaking the peptides of the sample into fragments (partial peptides); b. generating by the processor (9) a fragment graph comprising vertices and edges respectively interconnecting two vertices, said vertices representing the total masses and the partial masses, and said edges representing the difference in mass of the two vertices interconnected by the respective edge; c. constructing by the processor (9) a flow network from the fragment graph; d. calculating by the processor (9) a minimal cost flow of the flow network corresponding to sequences of fragment masses representing a reconstruction of the mass spectrum.

14. An analysis system (10) for analyzing of samples comprising multiple diverse peptides, the analysis system (10) comprising: a. a tandem mass spectrometer (11 ) configured to receive the sample and to generate therefrom a spectrometry data set comprising a total weight per sequence, partial weights of prefixes and suffixes of the sequences; and b. a reconstruction unit (12) comprising at least one processor (9) being configured to obtain the spectrometry data set generated by the tandem mass spectrometer (11 ) and to carry out the method according to at least one of the claims 1 to 11 for reconstructing the sequences in the sample.

Description:
Method for Reconstruction of Sequences from Spectrometry Data Sets

FIELD OF THE DISCLOSURE

The present disclosure relates to a method, in particular a computer implemented method, a computer system and a non-transitory memory for reconstruction of spectral data sets, in particular mass spectrometry data sets I measurements.

BACKGROUND OF THE DISCLOSURE

It has been a long known challenge in biology, pharmaceutics etc. to identify proteins in biological samples. A protein being formed by a sequence of amino acids, wherein the number of amino acids of a single protein is typically in the order of 10 15 . An established tool to analyze these samples is liquid chromatography-tandem mass spectrometry (LC- MS/MS), delivering a mass spectrometry data sets which typically consist of simultaneous measurement of around 5 to 20 different peptides (also known as precursors), as well as fragments thereof, from the original sample. Peptides are typically created by breaking the proteins of the original sample into shorter sequences of amino acids. In order to identify and reconstruct these peptides from the mass spectrometry data set, the data set has to be algorithmically processed on a computer system to provide the sequence of amino acids of each peptide. The known methods for the reconstruction of such mass spectrometry data sets have a very high computational complexity, making them unsuitable to reconstruct mass spectrometry data sets measured from more than 2 to 3 different simultaneously fragmented peptides. SUMMARY OF THE DISCLOSURE

A first aspect of the disclosure is directed to a computer implemented method for reconstruction.

In this case we speak of an alphabet (of building blocks I amino acids) as the set of amino acids weighted with their masses and the strings build from that alphabet are protein fragments called peptides. The spectrum is obtained by a mass spectrometry measurement device which measures the total masses of the peptides and then breaks the peptides to prefixes and suffixes. The masses of the prefixes and the suffixes are measured by the device in the second step and together with the total masses they are provided as a mass spectrum (data set).

The problem is to reconstruct the unknown peptides as strings of amino acids. The present disclosure provides a computer procedure to solve this problem implemented on a computer system with a lower computational complexity than the known methods. However, since the approach described herein is not limited to the reconstruction of unknown strings of amino acids a general language is used. Further examples of application include the identification of pharmaceuticals in wastewater or a reconstruction of chemical molecules fragmented in the predefined positions which are consisting of weighted parts, or in information processing, a reconstruction of a paragraph of speech where every world has a weight and the paragraph is fragmented to parts with known total weight. The disclosed method can be used in situations where a physical sequence resp. collection of weighted items is broken to certain subcollections with known weight. Given a finite weighted alphabet A as a set of weighted symbols with positive integer weights we consider the problem how to reconstruct a set of k unknown strings s 1 , s 2 , ..., s k with known total weights 0 < m 1 < m 2 < < m k from a given spectrum, where spectrum is a set of nonnegative integers containing some weights of prefixes and suffixes of the unknown strings. The solution strings are generally not unique; however, we are interested in providing one solution from the set of optimal solutions to the problem. The optimality is understood in a natural sense i.e. that the solution should cover maximum number of given spectral points. Moreover, in applications, the alphabet is known in advance; therefore, the most interesting scenario is the case when the alphabet is fixed.

The problem with a fixed alphabet was solved for k = 1. Since that time the complexity of the problem for k > 1 is not known and no practicable algorithm was at disposal. An illustration of the weighted alphabet spectral reconstruction problem for k = 2 is in Figure 1. In this disclosure we solve the problem with network flow and integer linear programming methods (ILP) which allow to solve the problem for k larger than 3.

We study the problem in an abstract setting for any alphabet, but for example, the problem occurs in mass spectrometry. In this case the alphabet is the set of amino acids weighted with their masses 1 and the strings are proteins or their fragments called peptides. The spectrum is obtained by a mass spectrometry measurement device which measures the total masses of the peptides and then breaks the peptides to prefixes and suffixes. The masses of the prefixes and the suffixes are measured by the device in the second step and together with the total masses they are reported to the biologist as a

1 The alphabet of aminoacids and their weights (masses) as integers with the precision 1 Dalton: G(57), A(71), S(87), P(97), V(99), T(101), C(103), l=L(113), N(114), D(115), Q=K(128), E(129), M(131), 1-1(137), F(147), R(156), Y(163), W(186). In bioinformatics a precision 10 -3 resp. 10 -4 Dalton is often used. mass spectrum. The problem is to reconstruct the unknown peptides as strings of amino acids.

Throughout the disclosure, the set {1,2, ...,n} is denoted as [n], N 0 denotes the set of all positive integers together with 0. Under a path in a directed graph we always mean the directed path. Given a string s = (c 1 ..., c n-1 , c n ) in A*, we denote by r(s) the reverse string r(s) = (c n , c n _ 1 , .... c 1 ). The total weight of the string s denoted w(s) is the sum of all character weights. The weight of a prefix resp. suffix of a string s equals to the weight of the corresponding suffix resp. prefix for the reverse string r(s). W(A) means the set of weights for all alphabet symbols. Spectrum is any finite set of nonnegative integer weights; elements of spectra are called peaks. T(s), called a spectrum of s denotes the set of all prefix and suffix weights of s. Given a k-tuple of strings is defined as a union of spectra T(s i ) for all strings, we sometimes say that x is covered by resp. by As we discuss below, we can suppose without a loss of generality that the largest number in our problem is m k ; therefore, log + m k = logm k loglogm k is used to express the complexities related to arithmetic operations and storage manipulations with the numbers.

The weighted alphabet spectral reconstruction problem (WASR) can be formally described in the following way where we assume for the rest of the disclosure that arbitrary but fixed finite weighted alphabet A is given.

Definition 1 (WASR). Let a spectrum and a set M of k positive integer weights be given. K-tuples of strings fulfilling the constraints w(s i ) = m i ∈ M for all i ∈ [k] are called (feasible) solutions. The weighted alphabet spectral reconstruction problem (X, M) is to find a feasible solution s such that is maximal. Such solution will be called optimal solution of the WASR problem.

There are certain additional assumptions on X and M which can be taken without a loss of generality because these assumptions can be easily verified with a low complexity procedure and they do not constitute the real difficulty of the WASR problem. In the rest of the disclosure we suppose the following. i. For all i, i ∈ [k], there is a string s i such that w(s i ) = m i because if there would be no such string, we would immediately know that there is no solution to the WASR problem. There are several ways how to check this condition; however, we will suppose that the algorithm from Lemma 2.3 is used. ii. 0 ∈ X and because these elements are covered by every solution. Therefore, we add them to X if they are not present. iii. The values in X = {x 1 , ...,x n } and M = {m 1 , ... , m k } are increasingly ordered (sorted). iv. The peak x n = m k because no value larger than m k can be in the spectrum of any solution. Therefore, n < m k + 1, k < m k . v. For any element p in X there are strings s 1 and s 2 such that w(s 1 ) = p and w(s 2 ) = m j - p for some j ∈ [k] . If this condition is not fulfilled then p can not be a part of any solution. The condition can be easily verified with the same procedure as the condition i] above.

The goal of the present disclosure is to solve the WASR problem by converting the problem to a special form of convex cost flow problem. After that, depending on the implementation linear programming and/or mixed-integer programming solvers can be used to provide a solution to the problem. The first step towards this goal is to convert the problem to a graph problem. The resulting graph is called a fragment graph. During this step we establish general features of the WASR problem by proving that the fragment graph is sparse, and providing a low complexity construction algorithm of the graph. Moreover, we introduce one of the central concepts of the disclosure, the edge complementarity.

In the next step we want to search for optimal solutions. To that end we convert the fragment graph to a flow network with convex cost function on certain edges. The flow cost is used to find optimal solutions. Such flow network with additional conditions characterizing the optimal solutions of the problem is called a WASR network. Further we develop an ILP formulation of the min-cost flow network problem together with additional conditions.

One way how to reduce the WASR problem to a graph problem is, roughly said, that we assign vertices to peaks and create edges if the difference between the peak values corresponds to some string in A*. However, we introduce additional concepts which ensure that the graph has low density and that it can be later converted to a network which provides a solution for k > 1.

Complementarity and fragment graph

One of the main ingredients which we contribute is called complementarity relation between the edges of the graph. We discuss the concept later in detail (Definition 2.1 iii], the text after the definition, etc.), but in principle we want to ensure that if an edge represents a part of the string prefix than there must be another edge which represents a corresponding part of the suffix and vice versa, independently from the presence of corresponding peaks in the spectrum. To be able to define the complementarity we have to consider both prefix and suffix peaks even if it is well possible that the spectrum contains only a peak related to the prefix or to the suffix but not both. For this purpose, we use colored vertices in our graph approach. This is also related to the fact that for any string s, T(s) = T(r(s)).

More precisely, let us consider a string s with the total weight m i which is a part of a feasible solution. The weight of a prefix and the weight of the corresponding suffix are equally distant to m i /2, i.e. they are symmetric around m i /2. In fact, this is one way how to see the equality T(s) = T(r(s)). If one value of this symmetric pair (prefix weight, suffix weight) is missing we add the value constructing the set X s = X ∪ {(m i - x)|x ∈ XΛ m i ∈ M Λ ∃s 1 , s 2 (x = w(s 1 ) Λ (m i - x) = w(s 2 ))} called an extended set of peaks. We have to consider all m i because a particular peak in X can be symmetric with peaks for different values of m i . It is possible that in a pair of peaks symmetric around a m i /2 both peaks are in X s - X.

The graph in which we can search for optimal solutions of the WASR problem in the first step is defined below and illustrated in Figure 2.

Definition 2.1 (Fragment graph). Given a WASR problem, a directed graph with labeled edges and labeled colored vertices G(V, E) is called a fragment graph iff it equals to a union of graphs G mi (V mi , E mi ) defined for every m i as follows. For m i the set of vertices equals to the set of pairs (x, c) ∈ X s x {B, R}, x ≤ m i , m i - x ∈ X s where c = B if x ∈ X otherwise c = R (B means black, R red). For a vertex v = (x, c), function l(v) returns the label x. A pair e = ((x i , c i ), (x j , c j )∈ G E mi iff the following conditions are fulfilled: i. there is a string s ∈ A such that w(s) = x j - x j > 0 , and a string s 1 such that w(s 1 ) = m i - x j ii. there is no vertex (x 1; c) such that x i < x l < x j and there are strings s 1 and s 2 for which w(s 1 ) = x l — x i and w(s 2 ) = x j - x l iii. there is a pair ((x l c l ), (x o , c o )) such that m i — x o = x i and m i - x l = x j . Two edges fulfilling this condition are called complementary with respect to m i .

The direction is from the vertex of a smaller peak to the vertex of a larger peak. The label of e is s. In case there are more strings with the weight x j - x i any of these strings can be used as the label; however, in the whole graph the same string is used for this weight.

The additional condition ii] has a consequence (as we show later) that the graph is sparse and the condition iii] introduces the complementarity between the edges. The red color is capturing a situation when the peak is in X s - X. Every vertex in the fragment graph lies on a path from (0, B) to some (m i , B) because of the assumption v] after Definition 1. Additionally, from the same assumption it follows that for every vertex v, l(v) is divisible by g = gcd(W(A)). The fragment graph is a DAG and has a natural topological ordering which is given by the ordering of peaks in N o . The graph has 0(kn) vertices because of the size of X s .

In the rest of the disclosure, when we speak about complementary edges, we omit to mention m if it is clear from the context which m is meant. Given an edge ((x i c i ), (x j , c j ) ) and a complementary edge ((x l c l ), (x o , c o )) with respect to m, it follows that x j - x i = x o - x l independently of m. The complementarity can be equivalently seen as that x i is symmetric around m/2 to x o and x j is symmetric to x l An edge can have more complementary edges with respect to different m i ∈ M as for example the edge ((0, B), (71, B)) in Figure 2, an edge can be complementary to itself as the edge ((87, B), (186, B)) and finally the pair ((99, R), (202, R)) has no complementary pair of nodes; therefore, it is not an edge.

The first question which we have to clarify is how are the feasible and optimal solutions of the WASR problem related to the fragment graph. To that end we need the notion of complementary paths in the fragment graph. A pair of paths in the fragment graph is called complementary (e.g. the paths DAS and SAD in Figure 2) with respect to the first edge e 1 = (v 0 , v 1 ) of p i is complementary to the last edge and the j-th edge of Pi is complementary to the (n-j+1)-th edge of all with respect to the same m i . Concatenation of all edge labels along a path constitute a label of the path.

It follows directly from the definition of the fragment graph and the definition of the complementary paths that if we take any k-tuple of pairs of complementary paths where p i and are complementary with respect to m i , then a k-tuple of path labels is a solution of the WASR problem. Moreover, the next lemma shows, that if we find a k-tuple of pairs of complementary paths which covers maximum number of black vertices then such k-tuple of pairs describes an optimal solution of the WASR problem.

Lemma 2.2: Let (X,M) be a WASR problem and let G(V, E) be the corresponding fragment graph. Let be a k-tuple of complementary pairs of paths such that are complementary paths with respect to m i . Let denotes the set of all black vertices in is maximal among all such k-tuples then p describes an optimal solution s of the WASR problem where s = (s 1 , ..., s n ) and the strings Sj are labels of the paths p i .

To see schematically why the lemma is true we can use the following set of ideas. We argue by contradiction where we consider an optimal solution s = (s 1 , ..., s n ) to the WASR problem. The weights of the prefixes of s ; intersected with X s and the weights of the suffixes of s i intersected with X s must correspond to a path and to a complementary path in G. It is not possible that this k-tuple of pairs of complementary paths covers more black vertices than the best k-tuple of pairs of complementary paths. Also, the set of k-tuples of pairs of complementary paths is not empty.

Knowing one optimal solution of the WASR problem provided by the fragment graph, we can construct application domain specific optimal solutions by assigning different labels with the same weight to the edges of the fragment graph thereby fulfilling additional criteria on the optimal solutions.

As a consequence of the facts above, we can search for optimal solutions in the fragment graph and we can reformulate the WASR problem as a problem to find a k-tuple of pairs of complementary paths which covers the maximum number of black vertices in the fragment graph. As we discussed at the beginning, we reduce the graph problem further to a network flow problem. However, before that we need to construct the fragment graph.

For a construction of the fragment graph (to check the condition i] in Definition 2.1 we need an algorithm which given two peaks returns a string with weight equal to the difference m of the peaks or "@" in case such string does not exist. This algorithm is described in the following lemma. Lemma 2.3: For any alphabet A there is an algorithm with complexity O(log + m) which for any integer input m returns a string from A* with the weight m or "@" if no such string exists.

To see schematically why the lemma is true we can use the following set of ideas. If m is divisible by g, the algorithm uses a repetition of the character with the weight m s (a division with m s ) to reach a weight region from which there is always a string because of Lemma 2.4. If m is not divisible by g there is no string with the weight m.

Apart from the simple algorithm in Lemma 2.3, for different applications there are also other possibilities how to generate the string in Lemma 2.3. The algorithm in Lemma 2.3 is based on known ideas about the Subset-Sum problem and on the theory of unbounded knapsacks.

The following statement is summarizing well known facts.

Lemma 2.4: Let A be a finite weighted alphabet, m s is the smallest and m l is the largest weight of an alphabet element and g = gcd(W(A)). There is a constant c ≤ (m s /g - 1)(m l /g - 1) - 1 such that for every m > eg there is a string with weight m if g divides m, otherwise there is no such string.

The fragment graph can be constructed with a low complexity polynomial procedure. G mi can be understood as constructed in three steps: firstly by using only the condition i] from Definition 2, then computing a transitive reduction (introducing the condition ii]). The third condition follows from the conditions i], ii]. However, a better construction algorithm exists (see Lemma 2.5 below) as a consequence of the fragment graph density.

The density of the fragment graph is relevant in the complexity of the overall algorithm because it has consequences for the density of the network flow problem constructed later. Based on Lemma 2.4 we show in Lemma 2.5 below that the graph is sparse and a fast construction algorithm can be designed.

Lemma 2.5: Let (X, M) be a WASR problem and let G(V, E) be the corresponding fragment graph. i. The number of outgoing and incoming edges for every vertex in G mi is bounded by a constant dependent only on the alphabet. G is sparse having 0(kn) edges. i. For every edge (u, v) in G mi , the size of the set S = {x ∈ G mi |l(u) ≤ l(x) ≤ l(v)} is bounded by a constant. |S| ≤ 2cg + 3. ii. G can be constructed with complexity 0(kn log + m k ).

To see schematically why the lemma is true we can use the following set of ideas. As we noted above it is sufficient to consider only the first two conditions i], ii] from Definition 2. First we consider the statements i], ii] and iii] for G mi The statements i], iii] for G follow from the fact that G is a union of G mi i]: Let us consider a vertex v in G mi and a set of all vertices which have an outgoing edge leading to v. If this set is too large there are two elements in the set connected by an edge and they are also connected to v. This is contradicting the transitive reduction property of G mi ii]: a very similar argument as in i] leads to the fact that an edge can “overjump” only a constant number of nodes in G mi iii]: during a scan of a node set to generate edges, we can always stop the scan after a constant number of nodes because from ii] it follows that we cannot generate any new edge after that stage.

From fragment graph to network flow problem As we discussed at the beginning, our final goal is to approach the search for optimal solution of the WASR problem through a network flow model. As an intermediate step, in the previous section we converted the WASR problem to a problem in the fragment graph and we studied the basic features of the fragment graph. In the next step we reduce further the graph problem to a specific network flow problem.

We start by defining a flow network called a WASR network and reducing from the fragment graph to the network flow problem. The network has a feature that an integral min-cost flow, which is fulfilling certain additional flow constrains, is representing an optimal solution to the fragment graph problem from the previous section which in turn represents an optimal solution to the WASR problem.

The network flow problem can be approached with integer linear programming which allows to unify the description of all conditions. The ILP formulation of the problem can be solved for reasonable size of the instances (depending on the mixed-integer solver and ILP heuristics used, up to around k=20) with MIP solvers. The solution can be easily mapped to a set of k strings giving an optimal solution of the WASR problem.

WASR network

The WASR network is constructed in four steps below where the fragment graph serves as a basis for the construction. Schematically, we follow a goal to construct a network where the k-tuples of pairs of complementary path in the fragment graph are represented by the k-tuples of pairs of unit flows along complementary paths. The precise definition of complementarity in the WASR network is developed below.

The WASR network construction step 1 : we take the fragment graph G as the initial underlying graph for the WASR network and we denote as the vertex for which 0, and as the vertices for which Every edge e of this initial graph has one or more complementary edges (could also be itself) and there can be maximally one complementary edge with respect to a fixed m i . The edge e is replicated one or two times the number of complementary edges with respect to different elements m i ∈ M. In the example in Figure 3 we use the variation using one replica per different element of m i ∈ M. The replica cost will be set later accordingly. Moreover, every such replicated edge is replaced with odd number L (an integer larger than 0) of consecutive directed edges where the middle edge is called replica edge (shortly replica) and the incident nodes are called fragment nodes. In the example in Figure 3, L is chosen to be 7 creating 3 auxiliary edges and 2 auxiliary nodes on each side of the replica. The fragment nodes are labeled with l(u) resp. l(v) where u is the starting node of the path replacing the edge replica and v is the ending node (compare Figure 3, to simplify the figure, the direction of replica replacement edges is illustrated only for the last edge of the seven replacement edges). The replica edge originated from an edge in the fragment graph have the same label as the original edge. There is a supply 2k in the node and the demands 2 in the nodes The capacities and costs are set in a way that a minimum cost integral flow represents an optimal solution to the WASR problem. In the example in Figure 3 the capacities are set to 2 and the costs to 0 except for the replicas for a selfcomplementary edge where the capacity is 1. The replica edges are taking over the string labels from the original fragment graph, all other edges have no label.

The additional auxiliary nodes and edges which separate the replica from the peak nodes originating from the fragment graph represent a sparsification of the matrix towards a construction of a totally unimodular system matrix. If the total unimodularity is not used as in the ILP approach, the auxiliary nodes and edges are not necessary. If a flow is passing through a node between two replica edges, we need to count the coverage of the corresponding peak. For this purpose we use pairs of nodes connected with an edge having a convex cost function in the next step of the construction (see Figure 3).

The WASR network construction step 2: Every node u originating from G is replaced (split) with two nodes u i , u o called peak nodes (see Figure 3). Both nodes are connected with a new edge denoted p(u), called a peak edge and labeled with l(u). The edge p(u) has a capacity 2k and the cost function dependent on color and type of u. In the example of Figure 3, for edges and the cost function is identically 0. For other peak edges, if u is black then the cost is a convex function being 0 for flow 0 and -1 for any larger flow. If u is red, the cost is identically 0. The costs are illustrated in panel C of Figure 3. The outgoing edges from u are redirected to leave u°. Supply is placed in the tail of and demands are in the heads of

The WASR network construction step 3: we replace every peak edge having a convex cost function with a pair of linear cost edges. The first edge has capacity 1 and cost -1 and the second edge has capacity 2k - 1 and cost 0 (see the panel T of Figure 3).

Step 2 and 3 above ensure the counting of the black vertices in the fragment graph with a separable convex function. However, they can be replaced with a non-separable condition on the replica edges being equivalent to the separable convex function, for counting each vertex in the fragment graph at most once towards an optimum solution. Said optimum solutions represent a reconstructed sequence from the spectrometry data set. The WASR network construction step 4: The complementarity for replica edges in the WASR network is defined using the labels l( ) of the fragment nodes of replica edges in the same way as in G; however, when we replicate an edge e from the fragment graph to replica edges e 1 , ..., e j , because it has more complementary edges in the fragment graph with respect to different m i , we fix an assignment of every replica edge e i to some complementary replica edge and we define these two edges as being complementary. The assignment is arbitrary but fixed and it is one-to-one because of the construction of the WASR network (illustrated in Figure 3 with the dotted lines). When we speak about complementarity in the WASR network, we mean this assignment. The complementarity is not defined for any other edges except for the replica edges for obvious reasons. We call the flow network constructed above, with the standard flow conservation rules in the nodes together with a fixed complementarity assignment, a WASR network. When needed, we are using the convention that an entering flow is counted positive and a leaving flow negative.

The replica edge complementarity allows us to define the complementarity for paths (p 1 ,p 2 ) from tail of to head of Analogically to G, the paths in the WASR network are complementary if the first replica edge in p 1 is complementary to the last replica edge in p 2 , j-th replica edge in p 1 is complementary to j-th replica edge from the end of p 2 and the last replica edge in p 1 is complementary to the first replica edge of p 2 . All complementarities in the definition of complementary paths must be with respect to the same m i .

A solution to the WASR problem is represented in the WASR network by a unit flow along a k-tuple of pairs of complementary paths. It is straightforward to conclude that the WASR network and its costs are defined in such a way that an optimal solution in the fragment graph (which covers the maximum number of black vertices) corresponds exactly to a min-cost unit flow along k-tuple of pairs of complementary paths. On the other hand, being only min-cost flow in the WASR network is not sufficient to represent an optimal solution. The reason is that such flow must not be integral and even if the flow is integral, the flow must not be along k pairs of complementary paths. These problems can be addressed with specific constrains to the flow.

First, let us consider what are the necessary conditions for an integral feasible flow such that the flow represents a solution to the WASR problem. Below, in Lemma 2.6 we prove that these conditions are also sufficient.

A flow representing a solution consists of k pairs of unit flows along k complementary pairs of paths. Therefore, the flow through any replica edge e must be the same as the flow through a complementary replica edge e c assigned to it. This leads to the following additional conditions:

Condition (1): For every replica edge e the flow through e and the flow through the complementary edge are equal.

Additionally, given a solution to the WASR problem, the incoming flow to a peak edge through replica edges complementary with respect to the same m i must be equal to the outgoing flow through replica edges which posses any complementary edges with the same m i . This is because the solution is represented by a unit flow along k pairs of paths where every pair is complementary with respect to certain For example in Figure 3, there are two replica edges connecting 0 with 87. The flow through the right one denoted as complementary to some other edge with respect to m 1 , must be equal to the flow through all outgoing replica edges complementary also with m 1 . In this case the outgoing total flow through the replica edge 87-186 denoted e 2 must be the same as incoming flow through e 1 . e 2 is the only outgoing edge complementary to some other edge with respect to m 1 Similarly, for the left replica edge g 1 between 0-87 being complementary to some edge with respect to m 2 , the flow equals to the sum of the flows through which are all complementary with respect to m 2 . This leads to the conditions (with special cases for and

Condition (2): For every peak edge except the peak edges and the sum of the in-flow through replica edges which have complementary edge with respect to m i must be equal to the sum of the out-flow through the edges which have complementary edge with respect to the same m i . For the sum of in-flow through replica edges which have complementary edge with respect to m i must be equal to 2 (equal to demand). For any m j > m i , the same rule must be true for this peak edge as in the previous sentence.

In other words, the conditions (2) enforce additional flow conservation rules for peak edges (which correspond to vertices in the fragment graph) where the conservation of flow is restricted to the edges which are complementary to some edge with respect to m i . For there is no rule (2).

Lemma 2.6: An integral feasible flow in the WASR network which fulfills the conditions (1 , 2) represents a solution to the WASR problem. If the flow is additionally min-cost it represents an optimal solution.

To see schematically why the lemma is true we can use the following set of ideas. We can use induction over k where we show that the conditions (1 , 2) enforce a solution structure on an integral feasible flow in the WASR network. The optimality follows in a straightforward way from the definition of costs in the network.

From WASR network to ILP formulation

A straightforward solution to the WASR problem can be obtained by requesting all edges in the WASR network to be integral and using modern mixed-integer (MIP) solver to provide a solution. Additionally, various heuristics to reduce the number of integer edges can be used according to methods from ILP. Depending on the field of application only a subset of replicas or a subset of peak edges or a subset of both is declared to be integral.

Further aspects of the Disclosure - preferred implementation

A computer implemented method according to the disclosure for reconstruction of multiple diverse sequences of weighted building blocks from a spectrometry data set comprises in a preferred variation the following steps. Typically said spectrometry data set comprises a total weight per sequence, partial weights of prefixes and suffixes of the sequences and weight 0. The steps are: generating by a processor weight pairs from the spectrometry data set, each weight pair comprising two weights if the difference between the weights in a weight pair equals to a weight of a sequence of weighted building blocks; ensuring by the processor that for each weight pair there exists at least one complementary weight pair with respect to a total weight by confirming the difference of the smaller weight and the weight 0 in the given pair is equal to a difference in weight between the larger weight of the complementary pair and the total weight and the difference between the smaller weight and the larger weight are the same for both pairs; and calculating by the processor a first and a second sequence as complementary sequences of weight pairs wherein the weight pairs in the first sequence are complementary to the weight pairs in the second sequence and each sequence is representing a reconstructed sequence from the spectrometry data set.

In some variations of the computer implemented method the weighted building blocks are amino acids having a specific mass.

In some variations of the computer implemented method the partial weights of the spectrometry data set are amended by the processor for obtaining a set of partial weights symmetric with respect to the total weights by calculating symmetric weights for each partial weight with respect to each total weight as the difference of the respective total weight and the respective partial weight; and comparing each symmetric weight to the partial weights of the data set and in case of no match, the symmetric weight is appended. Preferably each symmetric weight is labeled as an added weight, which was not present in the original spectrometry data set.

In some variations of the computer implemented method the method comprises the steps of generating by the processor a directed fragment graph as follows: generating for each weight pair two vertices and one edge interconnecting them, said vertices representing the set off all weights (the total weights, the partial weights and zero), and said edges representing the difference in weight of the two vertices in the pair interconnected by the respective edge where the edge represents also the building blocks corresponding to the difference. During this step the processor further ensures that every vertex is connected to the vertex corresponding to weight zero and every vertex is connected to a vertex corresponding to one of the total weights; and that the directed graph consists of complementary edges, wherein a given edge has a complementary edge if the underlying pairs are complementary. Furthermore the processor preferably calculates paths of edges together with a complementary path where the edges in the first sequence are complementary to the edges in the second sequence representing a reconstructed sequence from the data set; and

In some variations the computer implemented method comprises the step of generating by the processor a directed fragment graph comprises the step of reducing the edges of the directed fragment graph by a transitive reduction.

In some variations of the computer implemented method the directed fragment graph is generated by the processor by combining directed fragment graphs constructed for every total weight alone.

In some variations the computer implemented method comprises the step of constructing by the processor a flow network from the directed fragment graph, wherein the flow network construction comprises the steps of: generating pairs of peak nodes and convex cost peak edges between said pairs of peak nodes each representing a vertex of the directed fragment graph; generating replica edges which are connecting the peak nodes through zero or more auxiliary nodes and edges on both sides of the replica edge where the replica edges are representing partial sequences; and ensuring a complementary relation between replica edges using the fragment graph complementarity and ensuring that every replica edge has exactly one complementary replica edge.

Further, setting by the processor costs and capacities of the edges in the network so that a minimal cost integer flow corresponds to reconstructed sequences from the data set; and such that the minimal cost integral flow fulfills one or both conditions: for every replica edge the flow through the edge and the flow through the complementary replica edge are equal; and the sum of inflow through replica edges which have complementary edge with respect to some total weight is equal to the sum of outflow through the edges which have a complementary edge with respect to the same total weight.

Further, calculating by the processor a minimal cost integer flow of the flow network corresponding to complementary paths each representing a reconstructed sequence from the data set, wherein a linear programming or mixed-integer programming or an equivalent method for computing the minimal flow is used.

In some variations of the computer implemented method the processor assigns to each pair of peak nodes having a flow, a unit value for counting each pair at most once towards an optimum solution.

In some variations of the computer implemented method, for each pair of peak nodes the processor assigns a non-separable convex cost condition for the replica edges, in particular contracting the peak edge and one of the peak nodes of the respective pair of peak nodes, for counting each vertex in the fragment graph at most once towards an optimum solution.

In some variations the computer implemented method one or more edges in the flow network are constrained by the processor to be integer flow edges to ensure that the minimal cost flow is integral. This is typically achieved by setting the corresponding parameters for a solver module used.

In some variations of the computer implemented method the spectrometry data set is obtained by the tandem mass spectrometry and the spectrometry data set represents the mass spectrum obtained by measurement of total masses of multiple diverse peptides in a sample and measurement after breaking peptides of a sample into fragments measuring the partial masses of said fragments forming together with the total masses and zero mass the spectrometry data set.

In some variations the computer implemented method comprises the steps of generating by the processor a message comprising sequence information relating to the reconstructed sequences; and/or displaying by a display interconnect to the processor sequence information relating to the reconstructed sequences.

Another aspect of the disclosure is directed to a computer system comprising a processor configured to perform the computer implemented method as described above.

Another aspect of the disclosure is directed to a computer program product comprising instruction for a processor to carry out the steps as described above. In a preferred variation the computer program product comprises instructions, when executed on a processor of a computer system, the processor performs the steps of: obtaining by the processor a mass spectrum generated by a mass spectrometer, the mass spectrum comprising total masses of multiple diverse peptides in a sample and partial masses measured after breaking the peptides of the sample into fragments (partial peptides); generating by the processor a fragment graph comprising vertices and edges respectively interconnecting two vertices, said vertices representing the total masses and the partial masses, and said edges representing the difference in mass of the two vertices interconnected by the respective edge; constructing by the processor a flow network from the fragment graph; and calculating by the processor a minimal cost flow of the flow network corresponding to sequences of fragment masses representing a reconstruction of the mass spectrum. The computer program product can be stored on a non-transitory memory.

Another aspect of the disclosure is directed to an analysis system for analyzing of samples comprising multiple diverse peptides, the analysis system comprising a tandem mass spectrometer and a reconstruction unit. The tandem mass spectrometer is typically configured to receive the sample and to generate therefrom a spectrometry data set comprising a total weight per sequence, partial weights of prefixes and suffixes of the sequences. The reconstruction unit typically comprises at least one processor being configured to obtain the spectrometry data set generated by the tandem mass spectrometer and to carry out the method as described above for reconstructing the sequences in the sample.

In other variations a computer implemented method according to the disclosure for weighted spectral reconstruction of k sequences with k total weights m 1 , m 2 , ..., m k from weights of prefix and suffix fragments of the sequences, wherein the sequences consist of weighted building blocks. The method typically comprises the step of generating a fragment graph where the fragment graph is a union of graphs constructed for every m i by assigning vertices to prefix and suffix weights, and edges to a weight differences between the vertices if the weight differences correspond to a sum of weights of one or more characters and taking a transitive reduction of the graph for every m i .

The method may further comprise of steps of using complementarity of edges in the fragment graph, where two edges (x i ,x j )(y l ,y o )are complementary with respect to m i if m i — y o = x i and m i — y l = X j . In a preferred variation, the method comprises the step of ensuring in the fragment graph that every edge has exactly one complementary edge.

In a preferred variation, the method comprises the step of building a flow network from the fragment graph where minimal cost flow corresponds to the reconstructed k sequences.

Another aspect of the disclosure is directed to a computer implemented method for the reconstruction of a mass spectrum obtained by measurement of total masses of multiple different peptides in a sample and after breaking the peptides of the sample into fragments measuring the fragment masses of said fragments forming together with the total masses the mass spectrum. The method typically comprising the steps of generating a fragment graph comprising vertices and edges respectively interconnecting two vertices, said vertices representing the total masses and the fragment masses, and said edges representing the difference in mass of the two vertices interconnected by the respective edge. Constructing a flow network from the fragment graph and calculating a minimal cost flow of the flow network corresponding to sequences of reconstructed peptides.

The method may further comprise in a variation constructing the fragment graph the step of reducing the number of edges of the fragment graph, in particular by transitive reduction, thereby obtaining a sparse fragment graph.

Another aspect of the disclosure is directed to a computer system comprising a processor configured to perform the step of obtaining a mass spectrum comprising total masses of multiple different peptides in a sample and fragment masses measured after breaking the peptides of the sample into fragments. Generating a fragment graph comprising vertices and edges respectively interconnecting two vertices, said vertices representing the total masses and the fragment masses, and said edges representing the difference in mass of the two vertices interconnected by the respective edge. Constructing a flow network from the fragment graph. Calculating a minimal cost flow of the flow network corresponding to sequences of reconstructed peptides.

Another aspect of the disclosure is directed to a non-transitory medium having stored thereon instructions, when executed on a processor of a computer system, the processor performs the steps of, obtaining a mass spectrum comprising total masses of multiple different peptides in a sample and fragment masses measured after breaking the peptides of the sample into fragments. Generating a fragment graph comprising vertices and edges respectively interconnecting two vertices, said vertices representing the total masses and the fragment masses, and said edges representing the difference in mass of the two vertices interconnected by the respective edge. Constructing a flow network from the fragment graph. Calculating a minimal cost flow of the flow network corresponding to sequences of reconstructed peptides.

It is to be understood that both the foregoing general description and the following detailed description present embodiments, and are intended to provide an overview or framework for understanding the nature and character of the disclosure. The accompanying drawings are included to provide a further understanding, and are incorporated into and constitute a part of this specification. The drawings illustrate various embodiments, and together with the description serve to explain the principles and operation of the concepts disclosed. BRIEF DESCRIPTION OF THE DRAWINGS

The herein described disclosure will be more fully understood from the detailed description given herein below and the accompanying drawings which should not be considered limiting to the disclosure described in the appended claims. The drawings are showing:

Figure 1 : an illustration of the weighted alphabet spectral reconstruction problem for k = 2 with the alphabet from Footnote 1 (page 3);

Figure 2: a fragment graph of the example in Figure 1 ;

Figure 3: a WASR flow network construction from the fragment graph using a construction from the fragment graph in Figure 2 as an example; and

Figure 4: an analysis system according to the disclosure.

Reference will now be made in detail to certain embodiments, examples of which are illustrated in the accompanying drawings, in which some, but not all features are shown. Indeed, embodiments disclosed herein may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Whenever possible, like reference numbers will be used to refer to like components or parts.

Figure 1 shows an illustration of the weighted alphabet spectral reconstruction problem for k = 2. The spectra can be incomplete; some prefix or suffix weights can be missing. The missing weights are represented with dotted markers. Picture (a) shows spectrum of the string AD and Picture (b) the string DAS. Picture (c) shows the combined spectrum of all strings. The problem is to reconstruct the strings (a), (b) from the spectrum (c).

Figure 2 shows a fragment graph of the example in Figure 1. The edge labels belong to the nearest edge to the left. The vertices corresponding to the string weights nij = 186, m 2 = 273 are marked with a fat circle and the red vertices are marked with a dashed circle. The graph edges are drawn with full lines, the fat lines correspond to the strings AD and DAS. The vertex (99, R) is red because 99 is an extended peak to 87 with respect to m-L = 186, (158, R) is red because 158 is an extended peak to 115 with respect to m 2 = 273. ((0, B), (115, B)) has a complementary edge ((158, R), (273, B)), both emphasized with additional dashed lines. There is no edge between (99, R) and (202, R) because the complementary edge does not exist. The path (0, 115, 186, 273) is complementary to (0, 87, 158, 273). The strings (AD, DAS) are an optimal solution to the WASR problem. Another optimal solution is (AD, ADS). If the peak 158 would be in the spectrum, there would be only one optimal solution, namely (AD, DAS). The strings (SV, DAS) also solve the WASR problem; however, not optimally.

Figure 3 shows the flow network (herein also referred to as WASR network) construction for the fragment graph in Figure 2. In "a:b", "a" is the edge capacity and "b" is the edge cost. In this example the costs and the capacities are the following. The peak edges which count the coverage for the black vertices in the fragment graph have convex cost according to the panel "C" and capacity 2k = 4. The edges corresponding to the red vertices have capacity 2k = 4 and cost 0. The supply is 2k = 4 and the demands are 2. All other edges have capacity 2 and cost 0 except for the selfcomplementary replicas where the capacity is 1 . The transformation from a convex cost network to a linear cost network with parallel edges in the special case of the WASR network is shown in the panel "T". The splitting of edges between the vertices of the fragment graph is described in the text as well as the dotted lines which illustrate the complementarity.

Figure 4 shows an analysis system 10 according to the disclosure. The schematically shown analysis system 10 comprises a tandem mass spectrometer 11 and a reconstruction unit 12. The tandem mass spectrometer 11 is configured to receive a sample comprising typically around fifteen to twenty dissimilar peptides formed as a sequence of amino acids. Typically the sample comprises a multiplicity of the same peptide per for each of the peptides. The peptides are used to generate at least two mass spectra one of the peptide as they are in the sample and one mass spectra after the peptides have been fragmented into prefixes and suffixes. These at least two mass spectra from a spectrometry data set comprising a total mass/weight per peptide/sequence and partial masses/weights of prefixes and suffixes of the peptides/sequences. In the shown variation the spectrometry data set is transmitted to a reconstruction unit 12 for reconstruction of the sequences of amino acids. The reconstruction unit 12 comprises at least one processor 9 configured to process the spectrometry data set by carrying out the method as illustrated in Figures 1 to 3 in order to output sequence information.