Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CENTROIDING OF MASS SCAN DATA OBTAINED FROM HIGH-RESOLUTION MASS SPECTROMETRY (HR-MS) INSTRUMENTS
Document Type and Number:
WIPO Patent Application WO/2023/203584
Kind Code:
A1
Abstract:
The present subject matter is related to deep learning-based centroiding of mass scan data generated by high-resolution mass spectrometry (HR-MS) instruments, operated in standalone mode or when connected to other devices including but not limited to gas or liquid chromatography, capillary electrophoresis or ion mobility separation devices. The raw data or analytical data that is received from the HR-MS instruments is first converted into an open source format such as the mzML format, obvious noise is removed by applying ad hoc filters, regions of interest (ROI) are built from the mass scans, ROIs are classified as peak or noise, peak boundaries identified if classified as peak, and centroids recorded as a tuple of median, mean or weighted average mass to charge ratio (denoted as m/z) and area under the curve. The ROI classification and peak boundary detection steps make use of specially trained deep learning methods.

Inventors:
WANGIKAR PRAMOD PRABHAKAR (IN)
NAKRANI PRAJVAL SUSHIL (IN)
PATIL SACHIN (US)
Application Number:
PCT/IN2023/050400
Publication Date:
October 26, 2023
Filing Date:
April 24, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CLARITY BIO SYSTEMS INDIA PRIVATE LTD (IN)
INDIAN INST TECHNOLOGY BOMBAY (IN)
International Classes:
G01N3/02; G01N30/86
Foreign References:
CN112183677B2021-02-23
US20210225626A12021-07-22
CN105334279A2016-02-17
Attorney, Agent or Firm:
GUPTA, Ashish (IN)
Download PDF:
Claims:
We/I claim:

1. A method for centroiding of a peak in a mass scan, of a sample, the method comprising the steps of: a. identifying a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; b. breaking each of the long ROI into one or more normal ROI and/or one or more small ROIs; c. checking if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; d. marking a peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and e. identifying a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

2. The method as in claim 1, wherein a region of interest is a range of mass by charge in the mass scan, such that the intensity values corresponding to those mass by charge values are either continuously non-zero with zero values, if any, being isolated by being preceded and succeeded by a non-zero value and the ROI having an ROI boundary of a predefined consecutive zero intensity values.

3. The method as in claim 1, wherein the first deep learning method comprises the steps of: a. fetching a set of training mass spectrometry data files as the first pre-identified dataset of a first pre-identified data files, focusing on metabolomics, from various vendors available; b. identifying a training region of interest (training ROI) within each of the first pre-identified data file; c. randomly extracting a predefined number of ROIs, being from a spread out range of mass/charge values in said first pre-identified data file; d. repeating steps 3. a to 3. c above for each of the first pre-identified data file chosen in step 3. a.; and e. identifying the ROIs as containing at least one true peak, as such, based on identification by a user, otherwise labeling said ROI as noise ROI. The method as in claim 3, wherein the first deep learning method comprises the steps of, Fetching the training mass spectrometry data file focusing on metabolomics from in-house and open sources. The method as in claim 3, wherein the first pre-identified set of data files is of molecules with mass by charge less than 2000 Da. The method as in claim 3 wherein, the first pre-identified set of data files is acquired from instruments of more than one vendor of mass spectrometry instruments. The method as in claim 3, further comprising the step of, repeating steps 3. a to 3. e till nearly an equal number of examples for both peak and noise labeled ROI are identified. The method as in claim 3, wherein the second deep learning method comprises the steps of: a. fetching the ROIs containing at least one true peak, b. marking the start and end boundary of all the peaks within the confines of the ROI by a user. The method as in claim 1 , wherein a. a target size of the ROI, n, is calculated based on (i) n that is configured to be large enough so that true peaks are rarely longer than n in real life mass scan data generated by mass scan instruments, and (ii) n is small enough to minimize the complexity of the first and the second deep learning method. The method as in claim 1, wherein a. ROIs that are of size k (k being less than n) are forced to be of a target size either by zero padding or by interpolating and resampling. The method as in claim 1, wherein a. ROIs that are larger than a target size are broken down based on the steps of: (i) thresholding the ROI by setting intensities below a set threshold as 'zero' hence breaking the ROIs; b. checking if the ROI length is still more than a predefined ROI size or a target size, and proceed to the next step; c. using a 'Sliding and Overlapping Window' approach by sliding a window of a predefined length of sliding window over a long ROI and storing such windowed regions as ROIs for further analysis. The method as in claim 11, wherein identifying said set threshold for breaking of the ROI’s comprises the steps of: a. listing all local maxima in a given mass spectrometry file by assuming every local maxima to be a peak; b. dividing the intensity range into bins of a fix size and segregate the local maximas based on their intensity; c. identifying a set of bins, particularly at low intensities, where the number of peaks in those bins is significantly more than in other bins; d. marking the identified bins as the true baseline noise level and using the same as a reference of the threshold for noise elimination. The method as in claim 1, further comprising the steps of: identifying an unidentified true peak by: a. fetching the ROIs that contain at least one peak; b. zeroing out the true peak of the ROI already identified; and, c. repeating the step of identifying the presence of at least one true peak in said ROI. The method as in claim 1, wherein the output of the first deep learning method is binary or multivalent, with the output variable having options comprising at least presence of 1) all Noise, 2) at least one Peak and 3) Borderline cases, wherein ROIs which have one or more gaussian characteristics but have, low signal to noise ratio. The method as in claim 1, wherein, prior to Identifying the presence of at least one true peak in said ROI, a. normalizing the intensity values in the ROI, by dividing all the intensity values in the ROI by the maximum intensity value in that ROI, ensuring all intensity values in the ROI he between a value zero and a value one. The method of claim 1, wherein prior to Identifying the presence of at least one true peak in said ROI, preprocessing of the identified ROI, using the steps of Smoothening using a moving average and producing a smoothened intensity vector; taking Derivative of the smoothened intensity vector; producing a derivative vector; and concatenating said smoothened intensity vector and said derivative vector to together form an input feature vector for the deep learning method. The method of claim 1, comprising the steps of obtaining multiple centroids for each ROI, by obtaining a centroid for each peak in the event of presence of multiple peaks within the ROI. An article of manufacture including a non-transitory computer readable storage medium to tangibly store instructions, which when executed by a computer, cause the computer to: a. identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; b. break each of the long ROI into one or more normal ROI and/or one or more small ROIs; c. check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; d. mark a of peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and e. identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary. A computer implemented method for centroiding of a peak in a mass scan, of a sample, the method further comprising instructions which when executed by the computer cause the computer to: a. identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; b. break each of the long ROI into one or more normal ROI and/or one or more small ROIs; c. check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; d. mark a peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and e. identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary. stem for centroiding of a peak in a mass scan, of a sample, comprising a. a region of interest(ROI) identification module, configured to identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; b. an ROI fragmentation module configured to break each of the long ROI into one or more normal ROI and/or one or more small ROIs; c. a first deep learning module, configured to Check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; d. a second deep learning module, configured to mark a peak boundary of the peak based on a second deep learning method, trained on a first pre-identified dataset in the database, configured to identify the ends of a curve associated with the said peak; and e. a centroid identifying module, configured to Identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

Description:
CENTROIDING OF MASS SCAN DATA OBTAINED FROM HIGH-RESOLUTION MASS SPECTROMETRY (HR-MS) INSTRUMENTS

[0001] The present subject matter, in general, relates to mass spectrometry and more particularly to high-resolution mass spectrometry (HR-MS) instruments.

BACKGROUND OF THE INVENTION

[0002] Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is used in many different fields and is applied to pure samples as well as complex mixtures. In a typical MS procedure, a sample, which may be solid, liquid, or gaseous, is ionized, for example by bombarding it with a beam of electrons. This may cause some of the sample's molecules to break up into charged fragments or simply become charged without fragmenting. These are commonly referred to as fragment ions and precursor ions, respectively and collectively as the fragmentation pattern of the molecule.

[0003] These ions are then separated according to their mass-to-charge ratio, for example by accelerating them and subjecting them to an electric or magnetic field: ions of the same mass-to-charge ratio will undergo the same amount of deflection. Results are displayed as spectra of the signal intensity of detected ions as a function of the mass-to-charge ratio (denoted as m/z). The molecules in the sample can be identified by matching the m/z values with those of the known molecules or through a characteristic fragmentation pattern. [0004] Mass spectrometry has become one of the methods of choice in the analytical field due to its high sensitivity and selectivity to retrieve structural information allowing the unequivocal identification of compounds.

[0005] Mass spectrometers are designed to acquire mass scans that can be viewed as two dimensional charts of intensity versus the mass-to-charge ratio (denoted as m/z) of the ions generated by an ion source. When connected to a liquid chromatography (LC), gas chromatography (GC), capillary electrophoresis (CE) or ion mobility separation (IMS) device, the mass scans can be collected continually and as a function of chromatographic retention time or a measure of ion mobility, thereby making this a three-dimensional data. Additionally, attachment of two of these devices to an HR-MS instrument has the potential to generate four-dimensional data.

[0006] A Canadian patent CA2465297C proposes a method of identifying molecules of biological origin. The molecules are identified on the basis of the accurately determined mass to charge ratio of the molecules and an additional physico-chemical property such as elution time or charge state. Further physico-chemical properties may be used. The experimentally determined accurate mass and physico-chemical properties can then be compared with a look-up table of information. The look-up table may be generated or physico-chemical properties of data in a conventional database may be calculated. The ability to recognise and preferably identify the same molecules in two different samples may be used to determine whether a particular biological molecules has been expressed differently in an experimental sample relative to a control sample. Centroiding is used to determine the Intensity Value for all isotopes of all charge-states of any ion exceeding the minimum threshold for ion detection. [0007] Mass scan data generated by HR-MS instruments is sparse. Regions of non-zero intensity values contain noise apart from true signals corresponding to the ionic species generated in the ion source. Intensities in the regions of true signals often show Gaussian-like distribution around expected value of m/z of the ion. Also, identifying single representative m/z and its corresponding intensity is essential to be able to reduce the signal distribution to a single data point, for the ion in a spectrum. This process is commonly referred to as centroiding. Centroiding is an important step for standalone HR-MS instruments as well as those that are connected to GC, LC, CE, or IMS devices or their miniaturized versions such as the nano-LC or nano-CE devices.

[0008] Figure 1 illustrates graphs to assist in understanding the exemplary contemporary centroiding method/s. Representative examples herein illustrate how the current centroiding algorithms fail in delineating true signal from noise in mass scan data obtained from high resolution mass spectrometer (HR-MS). A-F: each panel represents a small region of the mass scan data, arbitrarily chosen from real data and appropriately zoomed along the x and y-axes. The red dots indicate the centroids detected by a currently available centroiding algorithm, which reports local maxima as the centroids. However, only one true peak is present in each of the panels A, B, and D and are denoted by labels 101, 102 and 103, respectively. Two peaks, possibly corresponding to two distinct ionic species, are present in panel E (104 and 105). Panels C and F do not have any true peak. The x-axis in the charts represents the mass-to-charge ratio denoted by m/z while the y-axis represents intensity on arbitrary units (a.u.). In certain examples, the y-axis may represent ion count (a.u.).

[0009] Figure 1 also shows that the entire range of m/z is littered with centroids obtained from current methods without regard to whether they belong to a true signal or noise. Noise removal is important as the noise signals may provide misleading results. Any attempts to remove noise by applying ad hoc filters such as stringent noise thresholding may result in data loss that cannot be recovered in the subsequent steps. Thus, the noise removal is often left to the next steps of the data analysis pipeline leading to increased computational costs and human effort.

[00010] A US patent US9043164B2 proposes a method for generating a mass spectrum, e.g. for Fourier transform mass spectrometry, having improved resolving power. The method includes steps of acquiring a plurality of mass spectra from a mass spectrometer using image current detection determining the centroids of at least some of the peaks which have a sufficient signal-to-noise (S/N) ratio so that the variation of the centroid of each such peak from the plurality of mass spectra is significantly lower than the full-width at half-maximum, dM, of the peak in the m/z domain; and generating a histogram of the centroids determined from the plurality of acquired mass spectra thereby forming a composite mass spectrum. The resultant composite mass spectrum comprises peaks having full-width at half-maximum, dMC, significantly narrower than the peak width, dM, of the corresponding peaks in the plurality of acquired mass spectra.

[00011] However, the currently available centroiding methods are not adequate. Also, the traditional machine learning methods are usually linear thus leading to both data loss and noise propagation in the next steps of data processing.

[00012] There is therefore a need for accurate centroiding of the mass spectrometry data to meet the measurement objectives of experiments involving high-resolution mass spectrometers. Further, there is a need for noise removal at this stage of data processing to minimize the propagation of noise to the next steps that may increase the time needed for the data analysis.

SUMMARY OF THE INVENTION

[00013] Accordingly, the present invention provides a method for centroiding of a peak in a mass scan, of a sample, the method comprising the steps of identifying a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; breaking each of the long ROI into one or more normal ROI and/or one or more small ROIs; checking if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; marking a peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and identifying a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

[00014] In an embodiment, the present invention provides a method wherein a region of interest is a range of mass by charge in the mass scan, such that the intensity values corresponding to those mass by charge values are either continuously non-zero with zero values, if any, being isolated by being preceded and succeeded by a non-zero value and the ROI having an ROI boundary of a predefined consecutive zero intensity values.

[00015] In further another embodiment, the present invention provides a method wherein the first deep learning method comprises the steps of fetching a set of training mass spectrometry data files as the first pre-identified dataset of a first pre-identified data files, focusing on metabolomics, from various vendors available; identifying a training region of interest (training ROI) within each of the first pre-identified data file; randomly extracting a predefined number of ROIs, being from a spread out range of mass/charge values in said first pre-identified data file; repeating steps 3. a to 3.c above for each of the first pre-identified data file chosen in step 3. a.; and identifying the ROIs as containing at least one true peak, as such, based on identification by a user, otherwise labeling said ROI as noise ROI. [00016] In one another embodiment, the present invention provides a method wherein the first deep learning method comprises the steps of, Fetching the training mass spectrometry data file focusing on metabolomics from in-house and open sources. Another embodiment provides the first pre-identified set of data file is of molecules with mass by charge less than 2000 Da. Further, the first pre- identified set of data files is acquired from instruments of more than one vendor of mass spectrometry instruments. It further comprises the step of repeating steps 3. a to 3.e till nearly an equal number of examples for both peak and noise labeled ROI are identified.

[00017] In still another embodiment, the present invention provides a method wherein the second deep learning method comprises the steps of fetching the ROIs containing at least one true peak, and marking the start and end boundary of all the peaks within the confines of the ROI by a user.

[00018] In one embodiment, a target size of the ROI, n, is calculated based on (i) n that is configured to be large enough so that true peaks are rarely longer than n in real life mass scan data generated by mass scan instruments, and (ii) n is small enough to minimize the complexity of the first and the second deep learning method. Further, in one embodiment, the ROIs that are of size k (k being less than n) are forced to be of a target size either by zero padding or by interpolating and resampling.

[00019] In an embodiment, the present invention provides the method wherein ROIs that are larger than a target size are broken down based on the steps of: (i) thresholding the ROI by setting intensities below a set threshold as 'zero' hence breaking the ROIs; (ii) Checking if still the ROI length is more than a predefined ROI size or a target size, and proceed to the next step; (iii) using a 'Sliding and Overlapping Window' approach by sliding a window of a predefined length of sliding window over a long ROI and storing such windowed regions as ROIs for further analysis. [00020] In one another embodiment, the present invention provides a method wherein identifying said set threshold for breaking of the ROI’s comprises the steps of listing all local maxima in a given mass spectrometry file by assuming every local maxima to be a peak; dividing the intensity range into bins of a fix size and segregate the local maximas based on their intensity; identifying a set of bins, particularly at low intensities, where the number of peaks in those bins is significantly more than in other bins; and marking the identified bins as the true baseline noise level and using the same as a reference of the threshold for noise elimination.

[00021] In still another embodiment, the present invention provides the method further comprising the steps of identifying an unidentified true peak by fetching the ROIs that contain at least one peak; zeroing out the true peak of the ROI already identified; and, repeating the step of identifying the presence of at least one true peak in said ROI.

[00022] In an embodiment, the present invention provides a method wherein the output of the first deep learning method is binary or multivalent, with the output variable having options comprising at least presence of 1) all Noise, 2) at least one Peak and 3) Borderline cases, wherein ROIs which have one or more gaussian characteristics but have, low signal to noise ratio. Further, in an embodiment the present invention provides a method prior to identifying the presence of at least one true peak in said ROI, normalizing the intensity values in the ROI, by dividing all the intensity values in the ROI by the maximum intensity value in that ROI, ensuring all intensity values in the ROI lie between a value zero and a value one.

[00023] In further another embodiment, the present invention provides a method wherein prior to Identifying if at least one true peak is present in said ROI, preprocessing of the identified ROI, using the steps of Smoothening using a moving average and producing a smoothened intensity vector; taking Derivative of the smoothened intensity vector; producing a derivative vector; and concatenating said smoothened intensity vector and said derivative vector to together form an input feature vector for the deep learning method. It further comprises the steps of obtaining multiple centroids for each ROI, by obtaining a centroid for each peak in the event of presence of multiple peaks within the ROI.

[00024] In an embodiment, the present invention provides an article of manufacture including a non-transitory computer readable storage medium to tangibly store instructions, which when executed by a computer, cause the computer to identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; break each of the long ROI into one or more normal ROI and/or one or more small ROIs; check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; mark a peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

[00025] In further another embodiment, the present invention provides a computer implemented method for centroiding of a peak in a mass scan, of a sample, the method further comprising instructions which when executed by the computer cause the computer to identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than the target size is a small ROI and an ROI of size equal to the target size is a normal ROI; break each of the long ROI into one or more normal ROI and/or one or more small ROIs; check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; mark a peak boundary of the peak based on a second deep learning method configured to identify the ends of a curve associated with the said peak; and identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

[00026] In another embodiment, the present invention provides a system for centroiding of a peak in a mass scan, of a sample, comprising a region of interest(ROI) identification module, configured to identify a region of interest(ROI) as long ROI, normal ROI or small ROI, wherein, an ROI of size larger than a target size is a long ROI, an ROI of size smaller than a target size is a small ROI and an ROI of size equal to the target size is a normal ROI; an ROI fragmentation module configured to break each of the long ROI into one or more normal ROI and/or one or more small ROIs; a first deep learning module, configured to check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the checking is based upon a first deep learning method trained on a first pre-identified dataset; a second deep learning module, configured to mark a peak boundary of the peak based on a second deep learning method, trained on a first pre-identified dataset in the database, configured to identify the ends of a curve associated with the said peak; and a centroid identifying module, configured to identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS [00027] A clear understanding of the key features of the invention summarized below may be had by reference to the appended drawings and their respective captions, which illustrate the deep learning-based centroiding of mass scan data collected from high-resolution mass spectrometry (HR-MS) instruments, although it will be understood that such drawings depict preferred embodiments herein and, therefore, are not to be considered as limiting its scope with regard to other embodiments which the invention is capable of contemplating. Accordingly:

[00028] Figure 1 illustrates graphs to assist in understanding the exemplary contemporary centroiding method/s.

[00029] Figure 2 illustrates graphs to assist in understanding the working of the deep learning methods as per an embodiment herein.

[00030] Figure 3 depicts a flow chart of a Method for centroiding mass scan obtained from high resolution mass spectrometry (HR-MS) instruments, using deep learning as per an embodiment herein.

[00031] Figure 4 illustrates examples of regions of interest (ROIs) built from an arbitrarily chosen real life HR-MS data, as per an embodiment herein.

[00032] Figure 5 illustrates an example of a long ROI from an arbitrarily chosen, real life HR-MS data, as per an embodiment herein.

[00033] Figure 6 illustrates graphs assisting in understanding of the method of classifying the ROIs into peak and noise, using deep learning methods as per an embodiment herein.

[00034] Figure 7 illustrates graphs assisting in understanding of the method of Identification of peak boundaries in ROIs by using a deep learning-based method, as per an embodiment herein. [00035] Figure 8 illustrates graphs assisting in understanding of the method of the deep learning-based centroiding of arbitrarily chosen, real life mass scan data obtained from an HR-MS instrument.

[00036] Figure 9 shows a system for centroiding a peak in a mass scan of a sample, as per an embodiment, herein.

DETAILED DESCRIPTION OF THE INVENTION

[00037] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well as the singular forms, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.

[00038] Unless, otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one having ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. Further, references to deep-learning may be understood to refer to machine learning as well.

[00039] In describing the invention, it will be understood that a number of techniques and steps are disclosed. Each of these has individual benefits and each can also be used in conjunction with one or more, or in some cases all, of the other disclosed techniques. Accordingly, for the sake of clarity, this description will refrain from repeating every possible combination of the individual steps in an unnecessary fashion.

[00040] Various embodiments herein disclose machine learning-based centroiding of mass scan data collected using high-resolution mass spectrometry (HR-MS) instruments. Also, reference to the deep learning-based centroiding of mass scan data collected using high-resolution mass spectrometry (HR-MS) instruments, as may be understood by a person skilled in the art or otherwise as described herein.

[00041] The centroiding of the mass scan data identified using deep learning methods after thresholding and building regions of interests is described. The input information/data may be obtained from a high-resolution, mass spectrometry (HR-MS) instrument operated in standalone mode or when connected to other devices including but not limited to gas or liquid chromatography, capillary electrophoresis or ion mobility separation devices.

[00042] The embodiments herein also explain the use of certain deep learning techniques for centroiding mass scan data and specially designed feature vectors as input to the method to classify peak and noise in high-resolution mass spectrometers (HR-MS). The deep learning technique uses a feature vector (X) as an input and predicts the class (Y). The feature vector of an ROI can be the intensity vector, normalized intensity vector (values normalized between 0 and 1), a suitably smoothened, normalized intensity vector or a derivative vector of one of the above vectors, typically obtained by a finite difference method. A person skilled in the art may realize the use of alternative feature vectors with corresponding changes to suit the working of the embodiments herein. The input X can also be a tuple based on a suitable combination of the feature vectors described above. X can have the size of n x 1, n x 2 or n x 3 depending on whether the input comprises of one, two or three feature vectors, respectively, from the ones described above. The output Y is a class label such as peak or noise.

[00043] The present invention is also related to the detection of peaks and their boundaries in the mass scans and removal of noise by using deep learning techniques. The deep learning method applies statistically driven threshold to set values below the threshold to be zero, breaks the long vectors of intensity into smaller vectors or regions of interest (ROI), extracts features of the ROIs, submits them to a specially trained deep learning method, predicts peaks and their boundaries and calculates the median, mean or weighted average m/z and area under the curve for each peak and, also eliminating the need to specify the mass extraction window (MEW).

[00044] The deep learning tool is trained to perform the task of detecting peaks and their boundaries. The peaks are of variable heights and widths. Further, the peaks are very narrow compared to the whole m/z range of the mass scans. The present invention breaks down the mass scans into smaller regions of interest (ROI) thereby zooming the ROI along the m/z axis. Further, the intensity is normalized in the range of 0 and 1, thereby zooming along the y-axis. A deep learning method classifies the Region of interest (ROIs) into peak and noise. A separate deep learning method then marks the peak boundaries and allowing the calculation of (i) centroid as the median, mean or weighted average m/z of the peak, (ii) the peak area as area under the curve between the peak boundaries and in terms of intensity*mDa and (iii) FWHM (full width at half maximum).

[00045] The deep learning methods use the normalized intensity vector and a vector of first derivatives of the intensity vector of an ROI as inputs and provide classification of the ROI as peak or noise and further provide peak boundaries if classified as a peak. In one embodiment, the method may be primarily based on convolutional neural networks (CNN). A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.

[00046] The Deep learning-based method is trained by using real examples of ROIs that are labeled by human experts. Labeled data is not available in the public domain and needs to be generated specially by using real life HR-MS data. The ROIs may be generated from the real life, liquid chromatography mass spectrometry (LCMS) data available in public domain that has been generated by different LCMS instruments.

[00047] The centroid m/z values and peak areas from all the mass scans are then processed further to build peaks in the chromatographic retention time (RT) dimension. These are equivalent to extracted ion chromatograms (EIC) obtained by the available methods. The presence of noise in the centroided data results in noisy peaks in the retention time dimension.

[00048] Figure 2 illustrates graphs to assist in understanding the working of the deep learning methods as per an embodiment herein. A mass scan from an arbitrarily chosen real life HR-MS data has been viewed as a chart of intensity (a.u.) vs. mass to charge ratio (m/z) [101], which may have plurality of true peaks and noise. A small region of the mass scan has been zoomed in along the x- and y-axis [202], yet smaller regions from which have been further zoomed in [203 and 204] until the shapes of individual peaks are identifiable and a distinction can be made between peaks, which have Gaussian-like shape and noise. This helps in detecting true peaks from mass scans. The x-axis in the charts represents the mass-to-charge ratio denoted by m/z while the y-axis represents intensity on arbitrary units (a.u.). In certain examples, the y-axis may represent ion count (a.u.). The present invention helps in detecting true peaks from the mass scan data in an automated manner.

[00049] Figure 3 of the present embodiment explains the flow chart of deep learning-based centroiding of mass scan data obtained from high-resolution mass spectrometers (HR-MS), as per an embodiment herein. In an embodiment, the raw data or analytical data that is received from the HR-MS instrument may be converted into an open-source format such as the mzML format by using “MS Convert” or other suitable applications (301) . A person skilled in the art may use input from gas/liquid chromatography, capillary electrophoresis or ion mobility separation devices as well. Then the first mass scan is selected for processing (302) and regions of interest (ROI) are built (303). In one embodiment, a region of interest may be identified as a range of mass by charge in the mass scan, such that the intensity values corresponding to those mass by charge values are either continuously non- zero with zero values, if any, being isolated by being preceded and succeeded by a non-zero value and the ROI having an ROI boundary of a predefined consecutive zero intensity values. In one exemplary embodiment, this number for the predefined consecutive zeros may be two. Alternatively, more consecutive zero may be set as the criteria for being more inclusive. This is further enumerated in reference to figure 4 and related description.

[00050] In one embodiment, as an optional step, threshold is applied to the signal in order to break the ROIs into yet smaller ROIs by breaking at regions of low intensity (304). Characteristics of a long ROI may be further enumerated in reference to figure 5 and related description. The ROIs that are larger than a target size may be broken down based on the steps of: thresholding the ROI by setting intensities below a set threshold as 'zero' hence breaking the ROIs. Further, checking if still the ROI length is more than a predefined ROI size or a target size, and proceed to the next step. In one embodiment, predefined ROI size could be 128 units. Further steps include: using a 'Sliding and Overlapping Window' approach by sliding a window of a predefined length of sliding window over a long ROI and storing such windowed regions as ROIs for further analysis. In one embodiment, ROIs that are of size k (k being less than n) may be forced to be of a target size either by zero padding or by interpolating and resampling. In one embodiment, a target size may be calculated based on the principle that n that is configured to be large enough so that true peaks are rarely longer than n in real life mass scan data generated by mass scan instruments, and n is small enough to minimize the complexity of the first and the second deep learning method.

[00051] As an exemplary reference, complexity of a model may be defined by the number of parameters of a model. In the case of a CNN, the total number of parameters may be proportional to the depth of the model. Depth of the model may be defined as the number of consecutive layers present in the model. An exemplary architecture of a CNN based model is as follows; Input layer, followed by a couple of CNN+Pooling layers, followed by Flattening the feature vector, followed by a Fully Connected (FC) layer. For a model with a large input vector length, multiple CNN+Pooling layers are required before we can Flatten the feature vector before passing on to the FC layer, this means more intermediate layers have to be used which in turn increases the depth of the model. In CNN, the number of parameters is dictated by the shape of the filters (also known as kernels). Say, the shape of the filter is as follows, width w, height h, number of channels of previous layer d and say we have k such filters for a given layer, then, the number of parameters for that layer would be ((w*h*d)+l)*k. Hence, it is easy to imagine, the complexity (number of parameters) of the model increases with the depth. Hence, The input vector length ‘n’ of the model may be chosen such that it is enough to cover at least a single typical real life peak and not much more than that. By doing so, we successfully keep the model complexity in check without sacrificing on the usability. The complexity of a model dictates the computational power that is required to use the model and the time it takes to generate its output. A machine with high computational power can run the model faster but will cost more money, on the other hand, a machine with low computational power will run the model slower, causing it to take more time but will cost less money.

[00052] In one embodiment, identifying said set threshold for breaking of the ROI’s may comprise the steps of listing all local maxima in a given mass spectrometry file by assuming every local maxima to be a peak. Further, dividing the intensity range into bins of a fixed size and segregating the local maximas based on their intensity. Further, identifying a set of bins, particularly at low intensities, where the number of peaks in those bins is significantly more than in other bins may be performed. This may be further followed by marking the identified bins as the true baseline noise level and using the same as a reference of the threshold for noise elimination.

[00053] In one embodiment, the bin size (fixed size for bin) may be such that the baseline noise level is of the same order as that value. For example, if the baseline noise level is say 200, a bin of size 50 makes sense. However, if the bin size is kept as 1000, it is impossible to estimate the noise value close to 200. Further, A local maxima may be a point such that the intensity value at that point is more than the intensity value at the points immediately preceding and succeeding the point. [00054] In one embodiment, then ROIs may be classified as containing a true peak or noise by using specially trained deep learning methods such as a convolutional neural network (CNN) (305). In one embodiment, this may be the first deep learning method. Thus, checking if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the identification is based upon a first deep learning method trained on a first pre-identified dataset is undertaken. This is further enumerated in reference to figure 6 and related description.

[00055] In one embodiment, normalizing the intensity values in the ROI, i.e. dividing all the intensity values in the ROI by the maximum intensity value in that ROI, thus ensuring all intensity values in the ROI lie between a value zero and a value one may take place. In one embodiment, the output of the first deep learning method may be binary or multivalent, with the output variable having options comprising at least presence of 1) all Noise, 2) at least one true Peak and 3) Borderline cases, wherein ROIs which have one or more gaussian characteristics but have, low signal to noise ratio. In the scenario of being identified as a borderline case, further processing to classify as: all noise, or at least one true peak may take place.

[00056] Furthermore, in one embodiment, prior to checking if said ROI contains at least one true peak, preprocessing of the identified ROI, using the steps of Smoothening using a moving average and producing a smoothened intensity vector; taking Derivative of the smoothened intensity vector; producing a derivative vector; and concatenating said smoothened intensity vector and said derivative vector to together form an input feature vector for the deep learning method may be undertaken. In one embodiment, Intensity vector of a ROI may be the array of intensity values corresponding to the mass by charge range that the ROI comprises.

[00057] In one embodiment, the first deep learning method may comprise the steps of fetching a set of training mass spectrometry data files as the first pre-identified dataset of a first pre-identified data files, focusing on metabolomics, from various vendors available. Taking the first pre-identified set of data files from instruments of more than one vendor of mass spectrometry instruments helps make the method and the system vendor agnostic. The training mass spectrometry data file focusing on metabolomics may be taken from in-house and open sources. Further, in one embodiment, the first pre- identified set of data files is of molecules with mass by charge less than 2000 Da.

[00058] In one embodiment, the further steps in the first deep learning method may comprise of identifying a training region of interest (training ROI) within each of the first pre-identified data file. Further, randomly extracting a predefined number of ROIs, being from a spread out range of mass/charge values in said first pre-identified data file. The above steps may be repeated for each of the first pre-identified data file chosen out of the dataset.

[00059] Further, in one embodiment, identifying the ROIs as containing at least one true peak, as such, based on identification by a user, otherwise labeling said ROI as noise ROI. The first deep learning method may repeat till nearly an equal number of examples for both peak and noise labeled ROI are identified. This may be done because if the dataset is skewed, that is, if either of the labels have significantly more examples than the other, it can hamper the training of the model. If a label is over represented in the training data, a deep learning model trained on that data is likely to be biased towards that label at the time of inference. If that is the case, the above exercise should be repeated to procure more examples of the required label until a nearly equal number of examples for both labels (all noise, at least one true peak) are obtained. In an exemplary scenario, for the dataset a randomly selected fixed number of ROIs (say 500) may be taken, ensuring them being from a spread out range of m/z, so that we don’t accumulate a lot of examples from a small range of m/z.

[00060] Next, peak boundaries are identified for ROIs that are classified as peaks by using a deep learning method such as CNN (306). In one embodiment, this marking of a peak boundary of the peak may be done based on a second deep learning method configured to identify the ends of a curve associated with the said peak. In one embodiment, this may comprise of fetching the ROIs containing at least one true peak, and marking the start and end boundary of all the peaks within the confines of the ROI by a user. The principles of choice of dataset as done for the first deep learning method may be followed.

[00061] Further, median, mean or weighted average m/z value and area under the curve are recorded as a tuple for each peak is performed (307). Identifying a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak may be undertaken. The representative mass by charge value may be calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak may be calculated as the area under the curve of the said peak, within the identified peak boundary. This is further enumerated in reference to figure 8 and related description.

[00062] The process is repeated for all the scans in the mzML file (308). In one aspect, the method improves upon existing solutions by identifying an unidentified true peak by fetching the ROIs that contain at least one peak, and zeroing out the true peak of the ROI already identified. Further, repeating the step of checking if said ROI contains at least one true peak takes place. This helps in identifying the small peak/s that may have been sitting next to large peak/s hence getting shadowed. This is further enumerated in reference to figure 7 and related description.

[00063] A person skilled in the art may realize the presence of multiple centroids for each ROI, by obtaining a centroid for each peak in the event of presence of multiple peaks within the ROI.

[00064] Steps shown in 301-308 may be automated on a computing device by writing software programs in languages such as, for example Python or C++. Additionally, public domain libraries such as TensorFlow, Numpy, Scipy and Keras may be used to deploy the steps of 301-308 on a computer. A person skilled in the art may use other public domain libraries with obvious modifications for suitability with the teachings of the embodiments herein. Further, an article of manufacture including a non-transitory computer readable storage medium to tangibly store instructions, which when executed by a computer, cause the computer to perform the steps as disclosed in the method above may be made. Also, in one embodiment, a computer implemented method for centroiding of a peak in a mass scan of a sample may be provided. The method further comprises instructions which when executed by the computer cause the computer to perform the steps enumerated above.

[00065] Figure 4 provides examples of regions of interest (ROIs) built from an arbitrarily chosen real life HR-MS data. HR-MS data is sparse and thus ROIs are essentially regions of non-zero intensity values. Certain tunable parameters may be used to build ROIs, which may then be subjected to ad hoc filters to eliminate instances of noise that can be identified readily. In the illustrative example shown here, six ROIs are initially formed [401, 402, 403, 404, 405, 406], of which five ROIs [402, 403, 404, 405, 406] are eliminated in an exemplary scenario based on an ad hoc filter of minimum ROI size of 4. The ROI [401] is then passed on to the deep learning method.

[00066] Figure 5 of an embodiment provides an illustrative example of a long ROI from an arbitrarily chosen, real life HR-MS data. The example shows a region with several true peaks that are separated by long regions of low but non-zero intensity values. The ROIs that are longer than a certain predetermined target length are then broken at natural breaks points or forced to break by a sliding and overlapping window approach. At this stage, a suitably chosen threshold value may be used so that intensities below it are set to zero. The said threshold may be arrived at by analyzing the distribution of the intensity values of the points of local maxima in the mass scan data.

[00067] Figure 6 of this embodiment illustrates example ROIs that are classified into peak or noise by using a deep learning-based method. The classification is solely based on the shape of the ROI and not based on the value of intensity. The figure shows a total of 8 ROIs, 4 containing at least 1 species peak and the other 4 comprising pure noise. The method takes the sequences of smoothened, normalized intensity values (normalized between 0 and 1) and its first derivative, stacked into a 2 x k matrix, with k being a predetermined value. ROIs that are shorter than k are padded with zeros on either side while ROIs that are longer than k are interpolated and resampled or forcefully broken using a sliding window approach. We apply a convolutional neural network with several layers but other architectures of neural networks may be applicable. In an embodiment the method may be first trained with real life data comprising of ROIs that are labeled by humans as peak or noise. Alternatively, a special purpose application or a method using computer programs may be used for the said purpose.

[00068] Figure 7 explains the identification of peak boundaries in ROIs by using a deep learning-based method. Once an ROI is classified as a peak, it may contain one or more peaks and the peak boundaries need to be identified correctly. An ROI [701] obtained from an arbitrarily chosen real life mass scan has been used as an illustrative example. The method takes the sequences of smoothened, normalized intensity values (normalized between 0 and 1) [702] and its first derivative [703], stacked into a 2 x k matrix, with k being a predetermined value. ROIs that are shorter than k are padded with zeros on either side while ROIs that are longer than k are interpolated and resampled or forcefully broken using a sliding window approach. We apply a convolutional neural network with several layers but other architectures of neural networks may be applicable. The method is first trained with real life data comprising of ROIs that are labeled by humans as peak or noise along with accurately marked peak boundaries. In the present example, the method detects two peaks [704, 705] within the ROI and marks the peak boundaries. Hence, the problem of inaccuracies with respect to marking of a rightly identified centroid may be resolved by training a U-NET (CNN) model on a dataset of 2500 ROIs(exemplarily), where in the peak boundaries in each ROI were manually marked by a human expert.

[00069] Figure 8 of this embodiment illustrates examples of the deep learning-based centroiding of arbitrarily chosen, real life mass scan data obtained from an HR-MS instrument. Each panel comprises of an ROI that has been classified as a peak by the deep learning-based classification method. The lines represent the original values of intensity (a.u.) on the primary y-axis (shown on the left y-axis in each panel) plotted vs m/z.

[00070] Figure 9 shows a system for centroiding a peak in a mass scan of a sample, as per an embodiment, herein. The system may comprise a region of interest (ROI) identification module 901, configured to identify a region of interest(ROI) as long ROI, normal ROI or small ROI. An ROI fragmentation module 902 configured to break each of the long ROI into small ROIs or normal ROIs. A first deep learning module 903, configured to check if said ROI contains at least one true peak, as that caused due to the presence of a substance in the sample and not because of an instrument noise, wherein the identification is based upon a first deep learning method trained on a first pre-identified dataset in a database 920 may be provided. The said database may be coupled to a processing module 922, that may be configured to take up processing related to the calculations in the identified steps. Further, the preprocessing steps may be configured to be undertaken by the said processing module in communication with other modules named above as needed. Furthermore, the database 920 may be split into two parts: a cloud based database and a local database. This may be done to provide for cloud service with deep learning methods (the first and the second) configured into said cloud based database for a local database housed with a lab to access remotely.

[00071] In one embodiment, a second deep learning module 904, configured to mark a peak boundary of the peak based on a second deep learning method, trained on a first pre-identified dataset in the database, configured to identify the ends of a curve associated with the said peak may be provided. Further, a centroid identifying module 905, may be provided that may be configured to identify a centroid of a peak as a pair of a representative mass by charge value and a representative abundance value for the identified peak, wherein the representative mass by charge value is calculated as the average of the mass by charge values of the peak, within the identified boundary, weighted by the corresponding intensity values and representative abundance value for the identified peak is calculated as the area under the curve of the said peak, within the identified peak boundary.

[00072] Important aspects of this invention may include but are not limited to: (i) the method to select true peaks in mass scans without any user-defined parameters, (ii) the use of normalized intensity and first derivative of intensities as input vectors for the deep learning method, (iii) the ability to classify peaks and noise and accurately mark peak boundaries and (iv) the ability to remove noise with minimal loss of data and without human intervention. The invention also saves time and effort. Further, As the presence of noise in the centroided data results in noisy peaks in the retention time dimension. Thus, the present invention significantly reduces the noise in the centroided data thereby significantly improving the quality of the EICs of biological molecules. This in turn improves quantifiability of the biological molecules.

[00073] While the subject matter may be susceptible to various modifications and alternative forms, specific embodiments have been shown by the way of figures/ examples in the drawings and have been described herein. Alternate embodiments or modifications may be practiced without departing from the spirit of the subject matter. The drawings shown are schematic drawings and may not be to the scale. While the drawings show some features of the subject, some features may be omitted. In some other cases, some features may be emphasized while others are not. Further, the methods disclosed herein may be performed in manner and/or order in which the methods are explained. Alternatively, the methods may be performed in manner or order different than what is explained without departing from the spirit, metes and bounds of the present subject matter. It should be understood that the subject matter is not intended to be limited to the particular forms disclosed. Rather, the subject matter is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the subject matter as described above and enumerated in claims below.