Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR DATA ANALYSIS
Document Type and Number:
WIPO Patent Application WO/2019/094507
Kind Code:
A1
Abstract:
A method of feature selection and machine learning for wide data.

Inventors:
SORENSEN MATTHEW (US)
NILSSON ERIK (US)
Application Number:
PCT/US2018/059681
Publication Date:
May 16, 2019
Filing Date:
November 07, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
PATAIGIN LLC (US)
International Classes:
G01R23/16; G06F9/44
Foreign References:
US20060217911A12006-09-28
US20120109537A12012-05-03
US7027933B22006-04-11
Attorney, Agent or Firm:
PEPE, Jeffrey, C. et al. (US)
Download PDF:
Claims:
CLAIMS

We claim:

1. A method of transforming data, comprising (i) identifying at least one pattern in a data structure resulting from or resembling data resulting from one or more stochastic mechanisms, wherein the stochastic mechanism comprises at least one stochastic process, the stochastic process comprising a collection of variables, and (ii) transforming the data structure into a different data structure by constructing a synthetic variable from the pattern or selecting a variable from the pattern.

2. The method of claim 1, wherein the data structure comprises data selected from spectral data, multidimensional data, wherein at least one dimension of the multidimensional data has the properties of a spectrum; a peak list; a vector; a matrix, wherein at least one dimension of the vector or matrix is a representation of a spectrum, or peaks from a spectrum.

3. The method of claim 1 or 2, wherein the data from the data structure represent a mass spectrum and the one or more stochastic mechanisms result from or resemble one or more of isotopic distribution, charge state convolution, or chemical distribution.

4. The method of claims 1-3, wherein peaks or other patterns in the data are represented in a binary tree data structure, and patterns are identified by searching the binary tree data structure.

5. The method of claims 1-4, where the data is derived from tandem mass spectrometry, multidimensional mass spectrometry, ion-mobility mass spectrometry, two-dimensional liquid chromatography-tandem mass spectrometry, an electrospray mass spectrometer, a time-of-flight mass spectrometer, a quadrupole mass spectrometer, a triple quadrupole mass spectrometer, a magnetic sector mass spectrometer, an ion trap mass spectrometer, a quadrupole trap mass spectrometer, an orbitrap mass spectrometer, a gas chromatograph mass spectrometer, a matrix-assisted laser desorption/ionization mass spectrometer, an ion mobility mass spectrometer, a plasma chromatograph, an inductively-coupled plasma mass spectrometer, a mass cytometer, an accelerator mass spectrometer, a Fourier transform mass spectrometer, a Fourier- transform ion cyclotron resonance mass spectrometer, a mass spectrometer using an ambient ionization method such as direct analysis in real time, another type of mass spectrometer, laser induced fluorescence spectroscopy, atomic absorption spectroscopy, atomic emission spectroscopy, flame emission spectroscopy, acoustic resonance spectroscopy, cavity ring down spectroscopy, circular dichroism spectroscopy, Raman spectroscopy, coherent anti-Stokes Raman spectroscopy, cold vapor atomic

fluorescence spectroscopy, nuclear magnetic resonance spectroscopy, electrical impedance spectroscopy, electron phenomenological spectroscopy, electron

paramagnetic resonance spectroscopy, Fourier-transform spectroscopy, laser-induced breakdown spectroscopy, photoacoustic spectroscopy, photoemission spectroscopy, photothermal spectroscopy, spectrophotometry, vibrational circular dichroism spectroscopy, gamma spectroscopy, flow cytometry, and/or some other type of spectroscopy; or by means of a scintillation detector, scintillation counter, Geiger counter, ionization chamber, gaseous ionization detector, or any combination thereof.

6. A method of identifying features in training data, comprising (i) partitioning a data set into regions by convolution with a kernel, (ii) identifying partitions of the data containing features in at least a minimum number or proportion of samples in the data set, and (iii) constructing synthetic variables of the partitions so identified.

7. The method of claim 6, wherein the data is partitioned into non- overlapping regions of equal or unequal width.

8. A method of analyzing data comprised of spectral data,

multidimensional data, wherein at least one dimension of the multidimensional data has the properties of a spectrum; a peak list; a vector; or a matrix, wherein at least one dimension of the other data is a representation of a spectrum, peaks from a spectrum; the method comprising applying at least one integral transform to the data with or without other operations on the data.

9. The method of claim 8, wherein the at least one integral transform is selected from a Fourier transform, an inverse Fourier transform, a Laplace transform, a Z-transform, a cepstrum, a real cepstrum, a phase cepstrum, a complex cepstrum, or Riemann integral, or any combination thereof.

10. The method of claims 1-9, wherein the different data structure is applied to machine learning.

11. The method of claims 1-10, further comprising repeating the method by starting with the different data structure to obtain a second different data structure.

12. The method of claim 10, further comprising repeating the method by starting with the second different data structure to obtain a third different data structure.

13. The method of claims 1-12, wherein the method is used to, or is part of a system that is used to, detect, diagnose, predict, measure, measure the severity of, predict relapse of, or monitor treatment of an infectious or noninfectious disease of humans, other animals, plants, or other organisms.

14. The method of claims 1-12, wherein the method is used to or is part of a system that is used to detect, diagnose, predict, measure the severity of, or monitor the presence of an organism in an environmental or situational sample, optionally wherein the environmental or situational sample comprises soil, air, factory surfaces or materials, food, agricultural products, biomaufacturing products, biomanufactured products, pharmaceuticals, cosmetics, soap, detergent, harvest goods or samples, manufacture goods or samples, refining goods or samples, storage goods or samples, transport goods or samples, or consumption of goods or samples.

15. The method of claims 1-14, wherein the samples are analyzed by mass spectrometry to detect the presence of pathogens by detecting or detecting patterns of proteins, lipids, carbohydrates, nucleic acids, or any combination thereof.

16. The method of claims 1-15, wherein the samples are processed via or derived from a protein fingerprint identification system, optionally a MALDI Biotyper or Vitek-MS.

17. The method of claims 1-15, wherein the samples are processed via or derived from a lipid fingerprint identification system, optionally a B ACLIB.

18. The method of any one of claims 1-17, wherein the collection of variables of the stochastic process comprise one or more observed variables, one or more latent variables, or a combination thereof.

19. The method of claim 18, wherein variables are all latent variables.

Description:
METHOD FOR DATA ANALYSIS

BACKGROUND

[0001] The present invention is in the technical field of Data Analysis.

[0002] Many areas of Data Analysis, such as data mining and machine learning, present problems of overfitting. "Wide" data containing a large number of variables is one type of data prone to overfitting. A large number of variables often causes problems beyond overfitting, and it is often important to reduce the number of variables, even if the data is not particularly wide. Thus, methods referred to as regression analysis and feature selection among other names are used to reduce the number of variables or reduce the impact of the number of variables. However, these methods are often insufficient. Situations where current methods are insufficient include a large number of variables, correlated variables, and multi-collinear variables.

[0003] By "variable" we mean some numerical or other value that is associated with some unit or sample in a data set. For example, in a set of data recording resting heart rates for some or all subjects in some study, each subject should have associated with them one or more records in a data set, and at least some of these records should have a value associated with the subject's resting heart rate. In this case "resting heart rate" is a variable in that data set. By way of further example, a mass spectrum from a mass spectrometer can be considered a vector of values, where each position in the vector corresponding to a specific mass value is a variable.

SUMMARY

[0004] The present invention is a method for addressing data containing multiple variables. (Sometimes called "independent variables," "features," or

"attributes.")

[0005] Often, a subset of the variables will be selected by for example eliminating some of the variables or projecting the data into a different variable space. The present invention is a new method of feature selection or feature synthesis comprising multiple methods that can be used individually and in various combinations. The methods can be combined to produce results that none of the methods can produce individually.

[0006] Some aspects of the present invention addresses data such as mass spectrometry (MS) data containing correlated variables that arise because of stochastic processes, including biological and/or chemical variation in a sample.

[0007] Some aspects of the present invention can be used to select a subset of variables from a data set based on the relationship of these variables to multiple estimators applied to the data set. It is specifically not necessary that all or even any of the estimators are of interest in actually modeling the data, but estimators used in this selection method can also be subsequently used in modeling the data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] Fig. 1 shows a mass spectrum for a collection of fatty acid molecules, containing both isotopic distribution peak groups and chemical distribution peak groups.

[0009] Fig. 2 shows a mass spectrum for a collection of fatty acid molecules, containing both isotopic distribution peak groups and peaks not belonging to any apparent peak group.

[0010] Fig. 3 shows mass spectra for 6 samples analyzed separately, wherein, a power cepstrum transform has been applied to each spectrum.

[0011] Fig. 4 shows mass spectra for 3 samples, showing various patterns of alignment between peaks of different spectra.

[0012] Figs. 5A and 5B show Python computer code for identifying stochastic peak groups using the binary -tree method.

[0013] Fig. 6 shows a method for feature mapping from spectra.

[0014] Fig. 7 shows a method for applying a feature map to a group of spectra.

[0015] Fig. 8 shows a method for analyzing spectra with one or more integral transforms.

[0016] Fig. 9 shows graphs of the results of consensus feature learning applied to MALDI mass spectrometry data. [0017] Fig. 10 shows results from subjecting Mass spectrometry MALDI data from the BACLIB assay of bacterial lipids to an integral transform, consensus feature learning, and ensemble learning.

DETAILED DESCRIPTION

[0018] The present invention is primarily directed towards the analysis of "wide" data on a multiplicity of samples, wherein the data for each sample of said multiplicity of samples contains a large number of variables. The number of variables is "large" whenever the number of variables is more numerous than is preferred for machine learning. The threshold for "large" is usually greater than 14, but wide data can contain thousands, millions, or even billions of variables.

[0019] An important category of wide data is "spectral" data, in which the variables for a sample are arranged into one or more dimensions, and at least one dimension is a discrete series of measurements of a phenomenon that is or can be conceived of as a continuous phenomenon, or can be classified in terms of position on some scale. For example, the possible wavelengths of electromagnetic radiation can be conceived of as a continuous phenomenon. A modern spectrophotometer measures wavelengths of light, producing a vector of data comprising intensities for a series of discrete wavelength intervals, comprising a spectral dimension. As a further example, the possible mass values of molecules are not a continuous phenomenon, but the possible flight times of molecules in a time-of-flight mass spectrometer can be conceived of as a continuous phenomenon. Furthermore, the mass values of molecules can be classified in terms of position on a mass scale. A mass spectrometer produces a vector of intensities of for a series of discrete estimated mass-per-charge (m/z) values, comprising a spectral dimension. As a further example, an instrument may measure the emission of one or more photons from a specific chemical reaction, such as DNA synthesis, taking place in a large number of wells or locations, at each such location there is taking place a different such reaction. The operation of the aforementioned instrument is in important respects discrete: reactions only take place at certain locations, and for any given time interval, a reaction either takes place or does not. However, if the number of locations for reactions is large, the data resulting from the aforementioned instrument may nonetheless sufficiently resemble a spectrum so that data from the aforementioned instrument may be treated as spectral data with regard to the present invention. For the purposes of the present invention, spectral data may comprise data relating to measurements of continuous spectra or equally may comprise data relating to measurements of discrete spectra, as those terms are ordinarily used by those skilled in the art.

[0020] Spectral data may have one or multiple dimensions. One, some, or all dimensions of said spectral data may be spectral dimensions. Spectral dimensions may have but are not required to have a linear relationship to any physical quantity. It is not necessary for the present invention that a function mapping position in a spectral dimension be known, or that such a function exist or for it to be possible that such a function exists. Spectral data may be referred to as such, as a "spectrum," or as a particular type of spectrum, for example a mass spectrum.

[0021] In Fig. 1 can be seen a mass spectrum for a collection of fatty acid molecules. The vertical axis of the spectrum is intensity and the horizontal axis is mass. The fatty acid molecules have, for practical purposes, identical structure, except that the number of carbon atoms in one or more carbon chains of the molecule varies, producing 9 groups of peaks 10-26 separated one from the next by approximately 14 Da (Daltons). For example, a dimension marking 28 shows the difference in mass between one of the peaks 16 and another of the peaks 18 as equal to the mass of a carbon atom plus two hydrogen atoms, or C+2H. The mas of C+2H being equal to approximately 14.01468 Da without accounting for binding energy, which to a lesser precision is approximately 14 Da. Herein, we refer to peaks having the property shown in Fig. 1 for the peaks 10- 26 as "chemical distribution peaks" and we refer to a group of such peaks as a

"chemical distribution peak group." Herein, we refer to the mixture of molecules that gives rise to the measurement of a chemical distribution peak group as a "chemical distribution."

[0022] The average separation between chemical distribution peaks in a chemical distribution peak group can be called the "spacing" of the chemical distribution peak group. This spacing can also be thought of as the "wavelength" of the chemical distribution peak group. [0023] It will be appreciated by one skilled in the art that the spacing of a chemical distribution peak group may be approximately 14 Da or the spacing of a chemical distribution peak group may have some other value. It will be further appreciated by one skilled in the art in any particular case, the spacing of a chemical distribution peak group depends on the specific biological and/or chemical process that created the chemical distribution and also depends on the specific measurement made, producing data in which chemical distribution peak group is detected. In some cases, the spacing of the chemical distribution peak group may be less than 14 Da, whereas in other cases the spacing of the chemical distribution peak group may be more than 14 Da.

[0024] Each of the peaks 10-26 is itself comprised of peaks that are sometimes referred to by those skilled in the art as "isotopic peaks," a group of such peaks will herein be called an "isotopic peak group." The isotopic peaks in this case are 1 Da apart, and are produced by the natural distribution of carbon of about 99% 12C and 1% 13C. The present invention does not require that the mass spectrum contain isotopic peaks, nor does the present invention require a mass spectrum with sufficient resolution to resolve isotopic peaks, nor does the present invention require a sample that contains elements present with a mixture of isotopes, such isotopes potentially resulting in the a mass spectrum containing isotopic peaks. However, the presence of isotopic peaks in a spectrum or similar aspects in other types of measurement data does not interfere with the function of the present invention.

[0025] It will be appreciated by one skilled in the art that chemical distribution peaks will be produced by a variety of biological and synthetic samples, not just fatty acids, and herein the molecules are identified as fatty acids to provide a more detailed example, not by way of limiting the present invention to any particular type or types of molecules. It will be appreciated by one skilled in the art that in biological samples, chemical distributions can arise from the natural variation in enzymatic behavior, the action of multiple enzymes with overlapping function, successive or varying post- translation modification, successive or varying degradation of molecules by various mechanisms, or one or more of these and/or other mechanisms. It will be appreciated by one skilled in the art that chemical distributions can arise in synthetic molecules by variation in the synthesis or degradation of molecules, and/or by other mechanisms.

[0026] One skilled in the art will appreciate that molecules being analyzed may also be modified by sample preparation and analysis processes, resulting in a chemical distribution peak group. Such modifications may be confounding variables, and so in some cases it may be desirable to suppress or convolve the signal associated with a chemical distribution peak group. Alternatively, such modifications may provide information about the chemical analytes, and so it may be desirable to extract the signal associated with the chemical distribution peak group. It is possible for both situations to arise simultaneously in a single sample: a chemical distribution peak group provides useful information, while at the same time the same or a different chemical distribution peak group convolves a different signal of interest. Beyond the resolution and range limits of the instruments used, there is in principle no limit to the number of chemical distribution peak groups that in principle can be found in a measurement or

measurements of a sample, so beyond the aforementioned instrument limits, there is in principle no limit to the number of chemical distribution peak group signals that can be addressed in measurements from a sample.

[0027] Chemical distributions may be analyzed to gain information on the process that caused one or more chemical distributions to be present in a sample.

Alternatively or in addition, the one or more chemical distributions may be deliberately suppressed, to gain information about other aspects of the sample.

[0028] For example, Gram-negative bacteria typically have variation in the carbon chain length in lipid A molecules. The carbon chains vary in length by one or more carbon. Each such varying carbon typically has two hydrogens bonded to it. So, the mass of a natural lipid A typically shows variation by 14 Da. Consequently, lipid A extracted from a sample of a Gram-negative bacteria and analyzed for example by mass spectrometry will typically produce a spectrum with chemical distribution peak groups at a spacing of 14 Da. Other organisms have different typical or characteristic patterns of chemical distributions which will typically result in different typical chemical distribution peak grounds in mass spectra of said other organisms. In at least one embodiment of the present invention, patterns of chemical distribution peak groups can be used to identify a sample as belonging to a species, a genus, a domain, or any other clade or taxonomic group more specific than genus, between genus and domain in specificity, or more general than domain.

[0029] A chemical distribution peak group often has a spacing significantly greater than 1 Da, such as the approximately 14 Da spacing shown in Fig. 1. One skilled in the art will appreciate that such a signal can be readily detected by an instrument or experimental system with mass resolution significantly below 1 Da. For example, modern mass spectrometers typically have sufficient mass resolution to detect any chemical distribution peak groups present. One skilled in the art will appreciate that chemical distribution peak groups can be reliably detected even when precision of the x-axis measurement is extremely poor. For example a mass spectrometer with an error in mass measurement of several Da would still produce a detectable chemical distribution peak group as shown in Fig. 1.

[0030] In at least one embodiment of the present invention, analysis of a mass spectrum or similar data in the frequency domain, as by applying a Fourier transform, allows detection of chemical distribution peak groups at extremely low ratios of signal to noise. One skilled in the art will appreciate that while the feature selection technique described herein is most obviously applicable to data that is or can be conceived of as in the time domain, the feature selection technique described herein may be applied to any data representing a series of measurements or the like, even if said series of

measurements did not occur over time.

[0031] Fig. 1 graphically shows a vector of a single dimension of

measurements. At least one embodiment of the present invention operates on a vector of a single dimension such as in Fig. 1. However, at least one embodiment of the present invention operates on data resulting from multidimensional measurements, such as tandem mass spectrometry, multidimensional mass spectrometry, ion-mobility mass spectrometry, and/or two-dimensional liquid chromatography-tandem mass

spectrometry.

[0032] A chemical distribution peak group typically has a spacing that is wider than some important sources of noise in many instrument systems. For example, in mass spectrometry, chemical distribution peak groups typically have a spacing, considered as a wavelength, that is wider than the important wavelength bands for shot noise. Moreover, a chemical distribution peak group typically has a spacing, considered as a wavelength, that is shorter than some important sources of noise in many instrument systems. For example, chemical distribution peak groups typically have a shorter spacing than signals in mass spectrometry that are typically referred to as "baseline disturbances" or simply "the baseline." This typical separation in wavelength between chemical distribution peak groups and noise allows signals of chemical distribution peak groups to be separated from noise by wavelength, providing a further means for chemical distribution peak groups to be separated from noise, and allowing chemical distribution peak groups to be detected in data with even lower signal to noise ratios.

[0033] In Fig. 2 can be seen a mass spectrum for a collection of fatty acid molecules. The vertical axis of the spectrum is intensity and the horizontal axis is mass, as with Fig. 1. The mass resolution of the spectrum in Fig. 2 is too low for isotopic peaks to be resolved. Fig. 2 shows a chemical distribution peak group comprising peaks 30-38. Fig. 2 shows additional peaks 40-44 that are not part of the chemical distribution peak group. From the mass labels on the peaks, it can be seen that peaks 30-38 have a spacing of about 14 Da. Even though the mass resolution of the spectrum in Fig. 2 is low and mass accuracy of the spectrum in Fig. 2 may be low, the chemical distribution peak group can still be identified by the peak spacing of about 14 Da. In contrast, the distance from each of peaks 40-44 to any other peak in the spectrum is not an integral multiple of 14 DA. By observation, one skilled in the art can appreciate that peaks 30- 38 can plausibly comprise a chemical distribution peak group, whereas none of peaks 40-44 is part of the chemical distribution peak group comprising peaks 30-38, nor are any of peaks 40-44 part of any other chemical distribution peak group present in the spectrum in Fig. 2.

[0034] In mass spectrometry, the "principal ion" is known as the ion composed of the most abundant isotopes of each of the atoms comprising the ion. A

"monoisotopic mass peak" is a peak produced by a principal ion. The subject of isotopic convolution is well covered in the literature, so a simplified discussion will suffice here. We will consider spectra of molecules where the only atom present in an isotopic distribution is carbon, and only two isotopes of carbon are present, C and C. Unless otherwise noted, we assume for purposes of illustration that 99% of carbon in any sample is 12 C and the remaining 1% of carbon is 13 C. As a practical matter, this simplification is sufficient in many cases for analysis of biological molecules, but at least one embodiment of the present invention operates on multiple stochastic peak groups superimposed in such a way that they are not obviously distinct, such as peaks resulting from multiple isotopic distributions. A natural sample containing a multitude of molecules all with the same chemical formula, said chemical formula containing at least one carbon atom will produce multiple peaks, often called "isotopic distribution peaks." If there is a single carbon atom, then given sufficiently low noise relative to signal, two peaks will be produced: a monoisotopic peak, and a second peak 1 Da heavier that the monoisotopic peak, and roughly 1% of the intensity of the

monoisotopic peak. If two carbon atoms are present, then up to three peaks will be produced, each 1 Da heavier than the preceding peak; the second peak about 2% of the intensity of the first peak, and the third about 0.5% of the intensity of the second peak. As the number of carbon atoms in the molecule increases, more peaks are possible, and the relative intensities of the peaks change, so that a line passing through the height of each peak can resemble a bell shape or a discrete pseudo-Gaussian shape, as can be seen in Fig. 1.

[0035] Some forms of mass spectrometry exhibit a further convolution with regard to charge state. The spectra shown in Fig. 1 and 2 appear to all be from singly- charged ions, but many forms of mass spectrometry will produce spectra containing multiply charged ions, where for a charge state higher than 1, the m/z values of an ion's peaks are divided by the charge state, resulting in the spectrum being compressed and shifted to a lower mass value.

[0036] Atomic isotopes of a particular atom differ from each other in mass by very nearly 1 Da. Molecules identical to each other except for charge differ from each other by one integral charge. (Charged molecules can often differ from each other by the mass of a proton as well.) In contrast, the broader category of convolutions considered here, including chemical convolution, involve changes in mass that are not required to be an integral multiple of 1 Da, nor do said convolutions represent the scaling of all of a molecule's peaks by an integral factor.

[0037] The present invention applies to mass spectra that exhibit either or both of charge state and isotopic convolution, but the present invention differs from methods that have been developed to date to specifically address charge state and isotopic convolution, in that the present invention addresses a broader class of convolutions in data, including data not produced by a mass spectrometer. For example, chemical distributions can arise from a situation similar to isotopic distributions, where two or more chemical structures are present in a sample at some ratio, a chemical structure that has a particular adduct added to it a variable number of times, or a similar process. Thus, one important distinction of the present invention is that convolutions at peak spacings other than 1 Da can be considered.

[0038] A further important characteristic of the present invention is that chemical distributions can arise from processes with distribution characteristics that are substantially different from the characteristics of adding integral multiples some value to an existing spectral value. For example, chemical degradation or post-translational modification can remove one or more carbons from a fatty acid chain, resulting in a distribution of peaks with lower mass or other spectral values than the most prominent peak, or some other pattern, so that the resulting peak distribution has significantly different peak amplitudes compared to processes resembling isotopic convolution.

[0039] A further important characteristic of the present invention is that peaks for a single molecule (or similar single object of interest) can exhibit multiple convolutions with different spacings that are not integral multiples of each other, which can result in a complex pattern of overlain convolved peaks. Furthermore, a single molecule or similar object of interest can be subjected to one or more processes that increase mass (or other spectral value) and to one or more processes that decrease mass or other spectral value, so that the resulting peak distribution has significantly different peak amplitudes compared to processes resembling isotopic convolution, such as a bimodal distribution or some other shape.

[0040] Some processes that produce one or more chemical distributions are stochastic processes. For example, some of the aforementioned processes have the following properties: a chemical species may be modified multiple times, each instance of modification is substantially statistically independent of other instances of the same modification, and each instance of the modification changes the mass of the species by a fixed increment. The aforementioned processes can often be best understood as stochastic processes, such as Levy processes. We will therefore refer to such mechanisms as "stochastic mechanisms." We refer herein to peaks produced by stochastic mechanisms as "stochastic peak groups." When stochastic peak group has an exemplar or nominal peak and/or when a stochastic peak group has been replaced by a smaller number of peaks (possibly a single peak) we refer herein to the replacement peaks as "stochastic base peaks."

[0041] One skilled in the art will appreciate that, for any particular stochastic peak group, the actual process or processes comprising the corresponding stochastic mechanism may not be known. For example, an isotopic distribution may be caused by the complex interaction of physical processes in a large area of space over the last several billion years. Said processes aren't known for certain and can never be described in great detail. Yet the result is still identifiable as a stochastic process.

Stochastic processes when thought of as a collection of variables may comprise observed variables, latent variables, or a combination of observed and latent variables. Therefore, stochastic mechanisms corresponding to one or more stochastic processes may be attributable to latent variables, observed variables, or both.

[0042] It will be appreciated by one skilled in the art that stochastic mechanisms produce modifications in systems other than chemical systems and in data other than spectral data or mass spectral data.

[0043] In certain preferred embodiments of the present disclosure, a method comprises applying an integral transform to spectral data, wherein the spectral data comprises one or more MALDI mass spectra and the integral transform is an autocorrelation, and peaks in autocorrelation space are selected as features, producing new synthetic variables. In further embodiments, any of the aforementioned methods further comprise applying consensus feature learning to the new synthetic variables. In still further embodiments, any of the aforementioned methods further comprise training support vector machine ensembles on new synthetic data, either with or without application of consensus feature learning, to produce binary classifiers.

[0044] Processes described herein such as isotopic substitution of elements with two significant stable isotopes and enzymatic production of fatty-acid chains are stochastic mechanisms. It will be appreciated by one skilled in the art that stochastic mechanisms normally produce distinctive stochastic peak groups in mass spectra: a series of peaks spaced at approximately fixed mass intervals relative to each other, with the relative intensities of peaks in the peak group tending to approximate some fixed distribution. Many stochastic mechanism tend to produce a stochastic peak group where the peak intensities follow a unimodal distribution, which herein we call "unimodal stochastic mechanisms." However, some stochastic peak groups will exhibit a distribution that is not unimodal, because of the stochastic mechanism, because the stochastic peak group is caused by a composition of stochastic mechanisms, because of noise, or because of any of various other reasons familiar to those skilled in the art.

[0045] In one embodiment of the present invention, stochastic peak groups are restricted to a unimodal distribution, so that a bimodal stochastic peak group can be recognized as the superposition of two unimodal stochastic peak groups, for example a bimodal distribution at 1 Da can be recognized as a superposition of two isotopic distributions with two different principal ions. One skilled in the art will appreciate that the use of this restriction depends, among a variety of factors, on the characteristics of the stochastic mechanism, the analytical task, the instrumentation used, and

characteristics of the samples analyzed.

[0046] Returning to Fig. 1 by way of example, each of the peaks 10-26 comprises a stochastic peak group with a spacing of about 1 Da, which can be explained by multiple isotopic distributions in the sample. Further considering Fig. 1 by way of example, the peaks 10-26 as a whole comprise a stochastic peak group with a spacing of about 14 Da, which can explained by a chemical distribution.

[0047] The relationship between the intensities of the peaks in a stochastic peak group often varies little depending on the absolute intensity of the stochastic peak group, so that for example the average ratio of the intensities between the largest and second largest peaks of a chemical distribution peak group tends to not change very much for different absolute peak intensities. Thus, the individual peaks in stochastic peak groups, considered as features, are often strongly correlated features.

[0048] If a group of peaks is observed to have the characteristics described above for a stochastic peak group, it is often useful to assume the group of peaks is in fact a stochastic peak group, and was caused by a stochastic mechanism, even if no stochastic mechanism can be definitely shown to have caused the peak group, and even if no plausible stochastic mechanism can be identified that could have caused the peak group. Therefore, if a group of peaks has the characteristics of a stochastic peak group, it is often useful, when considering the peaks as features, to treat the peaks as likely highly correlated features, even if such correlation is not demonstrated, known, or knowable. Herein, we will consider that groups of peaks having the characteristics of stochastic peak groups as being actual stochastic peak groups, even if they have not been shown to be caused by a stochastic mechanism.

[0049] The peaks in a stochastic peak group, when considered as features, are likely to be highly correlated variables, and a feature selection process should usually avoid having all or many of the peaks in a stochastic peak group being selected as features. In at least one embodiment of the present invention applicable to the aforementioned situation, the highest intensity peak in a stochastic peak group is used alone as a feature. However, for some stochastic processes, because of random variation, the peak in the stochastic peak group with the highest intensity may vary from one spectrum to the next.

[0050] In at least one embodiment of the present invention, "stochastic grouping" is applied to the aforementioned situation is to construct a synthetic variable from the stochastic peak group, and use this synthetic variable as a feature. For example, in mass spectrometry, stochastic grouping can be used to construct a synthetic "peak" variable with a peak intensity that is the sum (or some other mathematical function) of the peak intensities of the stochastic peak group, and said synthetic peak has a mass that is equal to the mass value of the "principal" peak of the stochastic peak group; said principal peak may be the median peak, nominal peak, or principal ion of the stochastic peak group, depending on the relative ratios of peak intensities in the stochastic peak group. The peaks in the stochastic peak group can then be replaced in the data with the synthetic peak.

[0051] In different embodiments of the present invention, a synthetic variable is constructed by use of one of a variety of mathematical functions. One skilled in the art will appreciate that the choice of the mathematical function used depends on the characteristics of the stochastic mechanism and the resulting stochastic peak group.

[0052] Identifying groups of peaks at fixed intervals as stochastic peak groups may thus facilitate selection, from data containing stochastic peak groups, of a set of features that are not highly correlated, even when the underlying chemical or physical mechanisms that produced the stochastic peak group is not known.

[0053] In one embodiment of the present invention, stochastic peak groups are identified in a spectrum or in spectra by first constructing a binary -tree data structure containing a node for every peak in said spectrum or spectra, according to the following rules: the intensity of the peak corresponding to a given node is larger than the intensities corresponding to every child node; and the mass corresponding to any node is larger than the masses of every node contained in its left subtree, and smaller than the masses of every node contained in its right subtree.

[0054] If a binary tree with the properties described above has been constructed for a spectrum, stochastic peak groups of a particular spacing can be identified by recursively traversing the binary tree. In Fig. 5 can be seen Python computer code for identifying stochastic peak groups using the binary -tree method described herein. One skilled in the art will immediately appreciate that a flow chart may be extracted from the computer code in Fig. 5, beginning with the function periodic repeat structure.

[0055] Further describing the operation of the code in Fig. 5 and the equivalent flow chart, each tree traversal identifies stochastic peak groups at a particular spacing. Therefore, the traversal process may be repeated multiple times to identify different stochastic peak groups, without rebuilding the tree. Furthermore, patterns of peaks at a given spacing may be identified, and replaced in the spectra with a new peak with a mass and intensity determined as a function of the corresponding stochastic peak group. After the replacement process is carried out, the process of Fig. 5 may be repeated as many times as desired with different spacings to identify more stochastic peak groups. [0056] For example, considering some mass spectrum, all stochastic peak groups with a spacing of 1 Da (corresponding to isotopic peak groups, as described above) may be identified and replaced with a peak corresponding to the principal ion of the isotopic distribution, and the resulting spectrum then searched for stochastic peak groups at 14 Da spacing, corresponding to chemical distribution peak groups of one carbon and two hydrogen atoms, such chemical distribution peak groups replaced by a principal peak as described above. In at least one embodiment of the present invention, stochastic peak groups found in a spectrum are deleted from the spectrum, replaced in the data by a single peak, and/or replaced the data by some other feature or features.

[0057] One skilled in the art will appreciate that the data structure described above as a "binary tree" need not strictly be a binary tree, and different embodiments of the present invention perform the preceding method using one of a variety of data structures.

[0058] As described above, a stochastic peak group in any particular spectrum may not contain certain peaks present in an equivalent stochastic peak group in one or more other spectra, even if all of said spectra are produced by practically identical samples. Additionally, any given sample may produce additional peaks that may appear to be part of a stochastic peak group, as a result of noise in any part of the analysis process, and/or contamination during sample preparation or any other part of the analysis process. Furthermore, the measured mass or other spectral value of any peak will vary from spectrum to spectrum, due to random variation, machine calibration, and other causes known to those skilled in the art. Thus, the peaks or other features used to develop a model can usually be best determined from analysis of multiple spectra from multiple samples.

[0059] One technique for selecting relevant mass regions, as described elsewhere in this proposal, involves taking the mass of every peak present in a group of spectra, optionally after substitution of stochastic peak groups as described above, and identifying regions in in the group of spectra where at least some number of spectra or proportion of spectra contain at least one peak or at least some other number of peaks. For example, a clustering algorithm such as DBSCAN [Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. 1996. A density -based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), Evangelos Simoudis, Jiawei Han, and Usama Fayyad (Eds.). AAAI Press 226-231], can be used to identify spectral intervals in which some number of spectra or some proportion of the spectra contain at least one peak or at least some other number of peaks.

[0060] Some partitions of the type described above produced data that can be described as "binned" or "quantized" data.

[0061] In at least one embodiment of the present invention, spectral intervals as described above may be selected from the partition of a spectrum (or equivalently, a spectral dimension) into continuous intervals of equal size, then selecting intervals meeting the criteria described. In at least one embodiment of the present invention, spectral intervals as described above can be constructed by partitioning a spectrum into intervals of unequal size, disjoint intervals, intervals that overlap other intervals, or intervals defined by some other topology.

[0062] In at least one embodiment of the present invention, a spectrum is partitioned into intervals by convolving the spectrum or some interval of the spectrum with a function or kernel. For example, in at least one embodiment of the present invention, a first spectrum is convolved with a series of Gaussian functions to produce a second spectrum with fewer variables comprising overlapping intervals of the first spectrum. Depending on the Gaussian functions chosen, the second spectrum may have desirable properties including any or all of less sensitivity to noise or error in the spectral dimension, a fewer number of variables, and less noise in one or more nonspectral dimensions such as an amplitude.

[0063] In addition to the 1 -dimensional clustering described above, at least one embodiment of the present invention is applied to higher-dimensional data. For example, in at least one embodiment of the present invention, each peak in a mass spectrum is represented as a two-dimensional vector, with coordinates of mass and intensity, with intensity scaled relative to the total ion current of the spectrum, the intensity of the base peak of the spectrum, or some other measure. Further, at least one embodiment of the present invention employs a two-dimensional clustering algorithm to identify two-dimensional regions containing some minimum number of spectra, similar to as described in the 1-dimentional case above. Further, in at least one embodiment of the present invention, each two-dimensional region is then used to determine a corresponding 2-dimensional or 1 -dimensional interval, which may or may not be disjoint.

[0064] As a further example, in at least one embodiment of the present invention, an integral transform such as a Fourier transform is applied to a mass spectrum, resulting in a vector of complex numbers. In at least one embodiment of the present invention, the resulting data is treated as a vector of complex variables, but equally in at least one other embodiment of the present invention, the resulting data is treated as a 2-dimensional array, one axis of the array being frequency and the other axis being the real vs. imaginary component of frequency.

[0065] As a further example, in at least one embodiment of the present invention, a multi-dimensional dataset is constructed from multidimensional chromatography, from tandem mass spectrometry, from MS n mass spectrometry, from hyphenated systems such as LC-MS, from combinations of such techniques, or from other techniques yielding multidimensional data.

[0066] In at least one embodiment of the present invention, a set of mass regions identified by the method outlined above is used to select features from a spectrum. For example, in at least one embodiment of the present invention, a many- dimensional vector is constructed, with each mass interval corresponding to a component of the vector, with the value of the largest-intensity peak present in the spectrum contained in the corresponding spectral region. Alternately, in at least one other embodiment of the present invention, the aforementioned vector component may be taken to be the sum or other function of all signal or of peak intensities in the spectrum inside the corresponding mass region. In at least one embodiment of the present invention, the aforementioned vector may be normalized to a fixed length using one or more of a variety of techniques, such as various p-norms.

[0067] Alternately, in at least one embodiment of the present invention, the value of a vector component as described above is determined as a synthetic feature, calculated as the definite numerical integral of a spectrum or of spectra over a spectral region, by the definite numerical integral of a function applied to the spectral region, or by numerical calculation of any functional norm of the class commonly referred to as Lebesgue p-norms, including those with various weight factors.

[0068] In Fig. 3 can be seen can be seen mass spectra for 6 samples analyzed separately, the resulting spectra shown in one graph to facilitate explanation. In Fig. 3, a power cepstrum transform has been applied to the spectra shown. The 6 samples shown in Fig. 3 predominantly comprised lipid membranes of three different microorganisms, two different samples per organism, as follows: Escherichia coli 46 and 48, Francisella novicida 50 and 52, Pseudomonas aeruginosa 54 and 56. One skilled in the art will appreciate that different vertical offsets have been applied to the spectra for clarity.

[0069] Considering each pair of spectra for a single species, it can be seen in Fig. 3 that some peaks in the transformed spectra for one sample are aligned with peaks in the transformed spectra for the other sample. For example, transformed spectra 54 and 56 have peaks shown at 58 that are aligned. Not all peaks appear in all transformed spectra; for example, transformed spectrum 54 has a peak shown at 60, but transformed spectrum 56 does not. If the amplitude of the original spectrum from which transformed spectrum 56 was derived had been greater, then it's possible that a peak would be seen at 60 for spectrum 56. Regardless, the present invention does not require that transformed spectra such as 54 and 56 have identical peaks.

[0070] It may be further seen in Fig. 3 that some peaks but not all peaks in a single particular transformed spectrum are aligned with peaks in transformed spectra from species that are different than the single particular transformed spectrum. For example, transformed spectrum 54 has a peak at 58, but transformed spectra 46-54 do not have a peak at 58. Transformed spectra 46 and 48 have a peak at 62, but the horizontal value of the peak 62 is offset about 10% from the peak at 58. Some peaks in a particular transformed spectrum are aligned with peaks in one or more transformed spectra from different species. For example, spectrum 56 has a peak at 64 that is aligned with peaks in transformed spectra 46-54.

[0071] One skilled in the art will appreciate that because the spectra in Fig. 3 have been subjected to a transform in the frequency domain, a significant change in the accuracy of the data in a spectral dimension does not produce as significant a change in the transformed spectra. For example, if the original mass spectra used to produce Fig. 3 were shifted left or right, as would happen from a change in calibration of the instrument, the effect on Fig. 3 would be minimal, even if the shift was several Da or more. If the original spectra were compressed or stretched, then the spectra shown in Fig. 3 would show a corresponding compression or stretch, but in at least one embodiment of the present invention, such a compression or stretch is compensated for by a compensating transformation, and such compensating transformation determined for a particular class of sample by the spacing of stochastic peak groups in

untransformed spectra.

[0072] One skilled in the art will appreciate that changes in the relative amplitudes of peaks in an original spectrum will have comparatively less effect on spectra such as those shown in Fig. 3 as compared to untransformed spectra. Data such as mass spectrometry and other spectral data often show significant random variation in peak intensity (measured as the height of the peak or in a variety of other ways that will be familiar to one skilled in the art). The method of Fig. 3 is less affected than other approaches by amplitude variation in data, which is an advantage of the present invention.

[0073] In at least one embodiment of the present invention, a simple dot product is used to compare spectra such as those shown in Fig. 3. However, in at least one embodiment of the present invention, spectra such as those in Fig. 3 are compared using other methods comprising the present invention, and/or neural networks, support vector machines, decision trees, random forests, and/or other methods familiar to those skilled in the art.

[0074] In Fig. 4 can be seen mass spectra for 3 samples analyzed separately, the resulting spectra shown in one graph to facilitate explanation. The 3 samples shown in Fig. 4 predominantly comprised certain chemical components of lipid membranes of three different microorganisms as follows: Enterococcus faecalis 66, Staphylococcus epidermidis 68, Escherichia coli 70. One skilled in the art will appreciate that, as is sometimes done for clarity, spectrum 66 has been offset vertically by +700 intensity and spectrum 70 has been inverted and offset vertically by -200 intensity. These offsets are merely to show the spectra clearly without overlapping. [0075] It can be observed in Fig. 4 that most peaks are aligned between the three spectra. For example, every prominent peak in spectrum 66 is well-aligned with a similar peak in spectrum 70. It can be further observed that not every peak in spectrum 66 is well-aligned with a similar peak in spectrum 68. However, peaks in the three spectra are more aligned than not aligned, even though the spectra were produced by samples of 3 different organisms, and these and other samples can be distinguished as to what organisms they contain on the basis of these spectra and such spectra as these.

[0076] We say that two peaks in two different spectra are "aligned" when the mass values of the peaks are substantially the same. For example, in two mass spectra, if the difference in mass value between two peaks, one in each of the two spectra, is less than the mass resolution of the two spectra, the two peaks can be said to be 'aligned'. Equally, two peaks that are aligned can also be said to "correspond." If the difference in mass values between two peaks is less than the difference in mass value between two other peaks or between one of the first two peaks and some other peak, the first two peaks can be said to be "well-aligned" compared to the other pairs of peaks. If the difference in mass values between two peaks in two spectra is less than the average difference in mass value between any peak in either of the two spectra and the nearest peak in the other spectrum, then the peaks may said to be "well-aligned in the two spectra," or simply to be "well-aligned."

[0077] In discussing Fig. 4 and throughout this application, reference is made to "mass," "mass peaks," and the like. It will be appreciated by one skilled in the art that mass spectra are used in this application by way of example, and that Fig. 4 is a graphical display of a vector of numbers, each position in the vector often referred to as a "variable" in statistics. To make the discussion easier to follow, it is often not noted herein that mass is only one type of category that could comprise the variables of the vector, shown as different horizontal positions in Fig. 4 and elsewhere. For example, the variables could instead be different frequencies of electromagnetic radiation, the graph could be the spectrum produced by a spectrophotometer, and the peaks identified could be emission and/or absorption peaks in the spectrum. As a further example, the variables could be different points in time or intervals of time. As an instance example of this further example, the graph could be a graph of chromatography. Alternatively, the variables could be or represent something else entirely. The present invention applies to all of these cases and to any such variables, and is not limited to mass spectrometry nor to mass spectra.

[0078] It can be observed in Fig. 4 that the most intense peaks in each spectrum are not aligned with the most intense peaks in either of the other two spectra. For example, the largest peak 72 in spectrum 66 is not aligned with the largest peak in either spectrum 68 or 70. As a further example, the largest peak 74 in spectrum 70 is not aligned with any apparent peak in spectrum 66 or 68. It will be appreciated by one skilled in the art that if the method used to produce such spectra results in similar spectra for each species when a sample of that species is analyzed, then consistent differences in the spectra can be used to distinguish a sample containing one species from a sample not containing that species. Further, in at least one embodiment of the present invention, such spectra are used to uniquely identify a species, detect when more than one species is present in a sample, and/or identify some or all of the species present in a sample containing multiple species.

[0079] However, it may happen that not all peaks can be consistently or confidently said to be present in all samples. For example, at 76 can be seen a small increase in signal intensity of spectrum 66. In spectrum 66, it would not normally be possible to tell if there is a peak at 76 or merely noise. If, however, the overall intensity of spectrum 66 was greater or if noise was less, it may be possible to determine if a peak is present at 76. Thus, in the case of Fig. 4, the overall signal intensity of a sample affects which peaks can be identified in a spectrum of the sample. It will be appreciated by one skilled in the art that peaks may be present or absent for any of a variety of other reasons. For example, preparation of the sample may cause chemical degradation, and the chemical degradation may show random variation or be influenced by a wide range of factors such as the temperature of the sample. Furthermore, such chemical degradation may cause peaks that would otherwise be present in a spectrum of the sample to disappear. Alternatively, such chemical degradation may cause peaks to appear that otherwise would not be present in a spectrum of the sample. As a further alternative, degradation may cause both the appearance and the disappearance of peaks. Consequently, variation in chemical degradation may cause variation in what peaks are apparent in a spectrum for a sample. There are other possible mechanisms that may occur in a multitude of combinations instead of or in combination with variation in sample preparation, including variation in a biological specimen or its environment.

[0080] One skilled in the art will thus appreciate that attempting to exactly match a sample mass spectrum to a mass spectrum from a known species will often not identify the species, because the peaks in the two spectra will not always or perhaps not ever be identical. One skilled in the art will further appreciate that data such as mass spectra are often wide data sets, typically containing thousands of distinct mass variables, or more. Mass spectra and similar data can contain hundreds of thousands or in some cases even millions of variables, each representing an intensity of ions with a specific mass value. Therefore, although spectra such as 68-70 have obvious differences from each other, if a spectrum similar to those shown in Fig. 3 is obtained from a sample of an unknown species, it is difficult to identify with high sensitivity (low false negative rate) and high specificity (low false positive rate) which species is or are present. If an unknown sample is obtained containing a species that may be one of the three species shown in Fig. 4 but might be some other species, then identifying one of the three species if they are present is even more difficult.

[0081] Returning to Fig. 3, it can be seen that a transform has been applied to Fig. 3 that causes the data in Fig. 3 to have very different properties than the data in Fig. 4. However, it can be seen that some peaks in Fig. 3 are aligned, so the presently discussed technique can be applied to data sets obtained and prepared as in Fig. 3, obtained and prepared as in Fig. 4, or obtained and prepared in a wide variety of methods.

[0082] An important aspect of the problem described above is "overfitting." For example, a model can be trained on multiple spectra, some spectra from samples containing a specific species and some samples not containing the said specific species. The model training process typically searches for a set of parameters of the model that best separate positive samples (containing the species to be identified) from negative samples (not containing the species to be identified). When the number of variables is large, and especially when the number of variables is larger than the number of samples, then just by chance some variables will be correlated with positive status, even though in a different data set from different samples related to the same species, the same variables are unlikely to be correlated with positive status. This effect of a model distinguishing samples partly or entirely using variables that are correlated by chance in the training data is often called by those skilled in the art, "overfitting."

[0083] An important method of reducing overfitting or the chance of overfitting is "feature selection." In feature selection, certain variables are eliminated or reduced in importance in the training data, prior to or as part of the model training process.

[0084] An important aspect of the present invention is "consensus feature learning," a method for feature selection on data upon which at least two models could be trained, such two models would be trained to estimate or model at least one dependent variable in each model, where the union of said dependent variables is a variable that can in principle be modeled on the data set as a whole.

[0085] For example, the spectra in Fig. 4 are from different species of organisms. A data set containing multiple spectra from each of the three species could be used to train three models, one to recognize each species. If a thing is a specific species, then said thing is also an organism, so the union of the dependent variables is a variable "contains Enterococcus faecalis, Staphylococcus epidermidis, and/or

Escherichia coli." This union is a subset of the variable "contains an organism."

[0086] In this example, the above hypothetical data set would not produce a sensitive model for detecting organisms, because only three species of organisms are in the training data, and there are a very large number of organisms. As this example shows, it is specifically not necessary to the present invention that a data set be available that contains data for all or even many of the things comprising the class of the union of models. Nor is it necessary for the data to be sufficient to train model of any particular type or with any particular characteristics for any particular entire dependent variable or union of the outputs of anticipated models. It is not even necessary that an entire dependent variable be chosen, known, or knowable. It is furthermore not necessary that all of the dependent variables of interest be knowable or chosen, nor is it necessary that models that are desired to be trained on available or future data sets be knowable or chosen, nor is it even necessary that there is training data available for all or any anticipated models. [0087] To the contrary, this presently discussed method of the present invention requires only that (i) there exists a set of data, (ii) at least two dependent variables are identified for which models could be trained on the aforementioned training data, (iii) at least one "entire dependent variable" (possibly unknown) exists that a model could be trained on using the entire training data, (iv) the union of the at least two dependent variables is a subset of the entire dependent variable.

[0088] When conditions i-iv are met, the presently discussed method of the present invention can be applied. The result will be a mathematical map from a the starting set of data to a space of synthetic features. Herein, we will call the

aforementioned map a "consensus feature map." Herein, we will call the mathematical space produced by a consensus feature map a "synthetic feature space."

[0089] In a particular situation, consensus feature learning may produce better results if the number of known models is increased, the number of known models for which training data is available is increased, and/or the amount of training data is increased, but there are no specific requirements for amount of data, number of models (beyond a minimum of two), or types of models for the present invention.

[0090] Consider a set of spectra S produced by some system, for example mass spectrometry. There is a space S such that every spectrum s e S is a vector in S. We call ^ the "spectrum space." A feature map M for S consists of a synthetic feature space F and a transform vector T. F is a vector space, where each dimension j of the vector is synthetic feature. T is a vector of functions Ti, such that Ti(s) is a function that produces the ith synthetic feature f of any s e S.

[0091] Feature mapping is the determination of a feature map M for some S. For example, let S be a set of spectra ranging from 0 to 100 Da in discrete 0.1 Da increments and F be a space of 100 synthetic variables, each corresponding to a unique 1 Da window from 0 to 100 Da. Let s be a vector where Sj is the jth spectral positon of the spectrum represented by s. Ti is then a unit pulse function: Ti(s j ) = 1, if i < j < i+1; Ti(s j ) = 0 otherwise.

[0092] It will be appreciated by one skilled in the art that while the

aforementioned example is a practical example, the functions Ti are not limited to unit pulse functions and can in principal be any function of s. [0093] Herein, "consensus feature learning" is the determination of a feature map M for some S using the method described above for determining one or more synthetic features where a "consensus" of positive sample data produces the same value for each said synthetic feature.

[0094] Variables selected via consensus feature learning will tend to produce different models than will variables selected randomly or by some other method. For example, variables selected via consensus feature learning, when applied to a training data set (the data set from which the features were learned from or some other data set) may result in models that are better classifiers than if different variables are used. In addition or alternatively, the result may be models that are less overfit than if different variables are used.

[0095] When we talk herein about one group of models being better by some measure than another group of models, we mean that the average performance of the former group is better than that of the latter group, that the variance of the former group is lower than that of the latter, that the former group on average or as a whole have some other advantageous property compared to the latter, or some combination of the preceding.

[0096] The union of the at least two dependent variables may be exactly the entire dependent variable, but this is not necessary and often is not possible. Returning to Fig. 4 by way of example, a hypothetical data set may exist (for i) containing mass spectra from samples of a multitude of species of microbial organisms. It is possible that it is known what species are present in each sample, but it is also possible that some of the species in the aforementioned samples are known while some of the species in the aforementioned samples are not known. Further by way of example, we might select two dependent variables, "contains Escherichia coli " and "contains Staphylococcus epidermidis" for which models could be trained (for ii). Whether it is known or not, "contains an organism" is an entire dependent variable for the aforementioned hypothetical data set, and the data set could be used to train a model for this dependent variable (for Hi) and the union of "contains Escherichia coli" and "contains

Staphylococcus epidermidis " is a subset of "contains an organism" (for iv). Thus, the conditions of the present invention are met in this example. [0097] In at least one embodiment of the present invention, the aforementioned feature selection technique is applied to a particular vector or multidimensional array, such vector or multidimensional array representing a series or other collection of measurements. In at least one embodiment of the present invention, the aforementioned feature selection technique is applied to a data set constructed by performing one or more mathematical operations on one, some, or all dimensions of a particular vector or multidimensional array as described above. In at least one embodiment of the present invention, each case of the aforementioned mathematical operation is an integral transform, a Fourier transform, an inverse Fourier transform, a Laplace transform, a Z-transform, a cepstrum, a real cepstrum, a phase cepstrum, a complex cepstrum, or some other mathematical operation. In at least one embodiment of the present invention, the aforementioned feature selection technique is applied to synthetic variables that are a mathematical function of variables present in the original data, or to a mixture of original and synthetic variables. One skilled in the art will appreciate that, in any particular data, some or all of the aforementioned synthetic variables may be synthetic peaks corresponding to stochastic peak groups, as described above.

[0098] Herein, except where noted or obvious by context, by "model" we mean a Machine Learning Model, a structure and corresponding interpretation that summarizes or partially summarizes a set of data. Often, a model is a computational or mathematical rule which may be called a "function" or "mapping." Depending on the type of model and how it is used, a model may be or may be a critical component of a "classifier," "regressor," or "estimator".

[0099] Herein, except where noted or obvious by context, by "modeling" of data, we mean the construction of a model. Depending on the modeling technique used and the modeling task, modeling may be described as or comprise "training,"

"learning," "machine learning," or "induction." The learning process may be supervised or unsupervised, as those terms are used in machine learning by those skilled in the art.

[0100] Herein, except where noted or obvious by context, by "estimator," we mean an estimator in the machine learning sense: a function for calculating a particular dependent variable from one or more independent variables. The independent and dependent variables may be of any kind, including continuous, binary, or categorical variables. Some estimators can be constructed via machine learning.

[0101] Fig. 6 illustrates a method for feature mapping from spectra. The operations of the method in Fig. 6 are intended to be illustrative. In at least one embodiment of the present invention, feature mapping may be accomplished by performing one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of feature mapping are illustrated in Fig. 6 and described below is not intended to be limiting.

[0102] In some embodiments, feature mapping may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of feature mapping in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of feature mapping.

[0103] In Fig. 6, feature mapping begins with spectra 78. From the spectra 78, stochastic grouping 80 produces spectra with stochastic base peaks 82. From spectra with stochastic base peaks 82, consensus feature learning 84 produces a feature map 86.

[0104] Fig. 7 illustrates a method for applying a feature map to a group of spectra. The operations of the method in Fig. 7 are intended to be illustrative. In at least one embodiment of the present invention, applying a feature map to a group of spectra may be accomplished by performing one or more additional operations not described, and/or without performing one or more of the operations discussed. Additionally, the order in which the operations of applying a feature map to a group of spectra are illustrated in Fig. 7 and described below is not intended to be limiting.

[0105] In some embodiments of the present invention, applying a feature map to a group of spectra may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of feature mapping in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of applying a feature map to a group of spectra.

[0106] Applying a feature map to a group of spectra may apply to a group of any number of spectra, including a single spectrum.

[0107] In Fig. 7, applying a feature map to a group of spectra begins with spectra 88. From the spectra 88, stochastic grouping 90 produces spectra with stochastic base peaks 92. From spectra with stochastic base peaks 92, consensus feature learning 94 produces a new data set with new features 98, said features derived from spectra with base peaks 92. Feature selection 94 makes use of a feature map 96. In some embodiments of the present invention, feature map 96 is produced by the process of feature mapping as shown in Fig. 6. The new data set with new features 98 may resemble spectra, or may have different mathematical properties. The new data set with new features 98 are input to a model 100. Machine learning can be used to train the model 100 on the new data set with new features 98. Alternatively, the model 100 may already be trained, so the model 100 can be used to estimate or classify the spectra 88 via the data set with new features 98.

[0108] Fig. 8 illustrates a method for analyzing spectra with one or more integral transforms. The operations of the method in Fig. 8 are intended to be illustrative. In at least one embodiment of the present invention, the method for analyzing spectra with one or more integral transforms may be accomplished by performing one or more additional operations not described, and/or without performing one or more of the operations discussed. Additionally, the order in which the operations of analyzing spectra with one or more integral transforms are illustrated in Fig. 8 and described below is not intended to be limiting.

[0109] In some embodiments, analyzing spectra with one or more integral transforms may be implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, an analog circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices executing some or all of the operations of feature mapping in response to instructions stored electronically on an electronic storage medium. The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of analyzing spectra with one or more integral transforms.

[0110] In Fig. 8, analyzing spectra with one or more integral transforms begins with one or more spectra 102. An integral transform 104 is applied to the spectra 102, producing transformed spectra 106. Said integral transform may be combined with one or more other mathematical operations. For example, in at least one embodiment of the present invention, an inverse integral transform is used in addition to or instead of an integral transform. As a further example, in at least one embodiment of the present invention, a cepstrum transform comprising the squared magnitude of the inverse Fourier transform of the logarithm of the squared magnitude of the Fourier transform of the spectra 102 is performed as step 104.

[0111] The transformed spectra 106 are input to a model 108. Machine learning can be used to train the model 108 on the new data set. Alternatively, the model 108 may already be trained, so the model 108 can be used to estimate or classify the spectra 102 via the transformed spectra 106.

[0112] In some embodiments of the present invention, the model 108 comprises a linear or non-linear model. In some embodiments of the present invention, the model 108 comprises either or both of the processes shown in Fig. 6-7. In some embodiments of the present invention, the spectra 102 comprise a data set with new features 98 and/or spectra with stochastic base peaks 92.

[0113] In Fig. 9 can be seen graphs of the results of consensus feature learning applied to MALDI mass spectrometry data. In Fig. 9, (a) and (b) show sensitivity and specificity of for data from The MALDI Biotyper (MBT), while (c) and (d) show sensitivity and specificity for B ACLIB lipid data. [0114] In each graph, a pair of whisker charts is shown for support vector machines (SVMs) for each of several species of bacteria. In each left hand whisker plot, spectra were presented as is to SVMs for training and classifying. In each right hand whisker plot, spectra were first subjected to consensus feature learning, then presented to SVMs for training and classifying. The graphs show that consensus feature learning generally improves the quality of machine learning results using the described data.

[0115] In Fig. 10 can be seen a graph of results from subjecting Mass spectrometry MALDI data from the BACLIB assay of bacterial lipids to an integral transform, consensus feature learning, and ensemble learning.

[0116] The BACLIB process was used to extract lipids from bacterial samples, producing lipid samples. Said lipid samples were analyzed on a MALDI mass spectrometer, producing spectra. Said spectra were transformed using a cepstrum transform, described elsewhere herein. Said transformed spectra were subjected to consensus feature learning, producing a feature map. Said feature map was used to select features in the aforementioned transformed spectra, said selected features were then used to train an SVM ensemble. Said SVM ensemble then used to classify selected features as described above in other data sets derived from bacterial samples, and described above. The results show that the method is effective. Sensitivity results from species that appear both in Figs. 9 and 10 can be compared.

[0117] The various embodiments described above can be combined to provide further embodiments. All of the U.S. patents, U.S. patent application publications, U.S. patent applications, foreign patents, foreign patent applications and non-patent publications referred to in this specification and/or listed in the Application Data Sheet, including U.S. provisional patent application Serial No. 62/584,618, filed November 10, 2017, are incorporated herein by reference, in their entirety. Aspects of the

embodiments can be modified, if necessary to employ concepts of the various patents, applications and publications to provide yet further embodiments.