Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD OF CONSTRUCTING A REUSABLE LOW-DIMENSIONALITY MAP OF HIGH-DIMENSIONALITY DATA
Document Type and Number:
WIPO Patent Application WO/2018/165530
Kind Code:
A1
Abstract:
In one embodiment, there is a computer-implemented method of constructing a reusable low- dimensionality map. The method may comprise the steps performed at one or more computing devices. The one or more computing devices may create a plurality of high-dimensionality bins using a binning algorithm and using a subset of the measured variables representing each object. The one or more computing devices may calculate high-dimensionality bin centroid coordinates for each of the high-dimensionality bins. The one or more computing devices may embed the high- dimensionality bin centroid coordinates in a low dimensionality space using a dimensionality reduction algorithm. The one or more computing devices may produce a low dimensionality map representing the high-dimensionality space, the low-dimensionality map including a plurality of bin representations. The one or more computing devices may apply the mapping from the high- dimensionality space to the low-dimensionality space to a new data set comprised of one or more samples, previously withheld or unavailable, generated with a data collection method consistent with the data collection method from which the plurality of high-dimensional bins was generated.

Inventors:
ROGERS WADE (US)
Application Number:
PCT/US2018/021711
Publication Date:
September 13, 2018
Filing Date:
March 09, 2018
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CYTOVAS LLC (US)
International Classes:
G01N21/64; G01N33/49; G01N33/50; G06F17/18; G06F17/30
Foreign References:
US20160070950A12016-03-10
US20090307248A12009-12-10
US20090097733A12009-04-16
US5739000A1998-04-14
Other References:
LUGLI ET AL.: "Data Analysis in Flow Cytometry: The Future Just Started", CYTOMETRY PART A, vol. 77, no. 7, July 2010 (2010-07-01), pages 705 - 713, XP055154784, Retrieved from the Internet
ROGERS ET AL.: "Cytometric fingerprinting: quantitative characterization of multivariate ' distributions", CYTOMETRY PART A, vol. 73, no. 5, 27 March 2008 (2008-03-27), pages 430 - 431, XP055154560, Retrieved from the Internet
CARTER ET AL.: "Dimensionality reduction of flow cytometric data through information preservation", 2008 IEEE WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 16 October 2008 (2008-10-16) - 19 October 2008 (2008-10-19), pages 462 - 467, XP055549300, Retrieved from the Internet
FINN ET AL.: "Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects", CYTOMETRY PART B: CLINICAL CYTOMETRY, vol. 76, no. 1, 18 July 2008 (2008-07-18), pages 1 - 7, XP055154746, Retrieved from the Internet
LEE ET AL.: "Cha"Graph-based dimensionality reduction."", IMAGE PROCESSING AND ANALYSIS WITH GRAPHS: THEORY AND PRACTICE, 2012, CRC press, pages 351 - 382, XP055549329, Retrieved from the Internet
Attorney, Agent or Firm:
RYAN, Michael et al. (US)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method of constructing a reusable low-dimensionality map, the low-dimensionality map being constructed from a high-dimensionality data set comprising one or more data samples, each of the data samples comprising a plurality of objects, each object being represented by three or more measured variables, the dimensionality of the low-dimensionality map being less than the number of measured variables representing each object, said method comprising the steps of:

at one or more computing devices:

creating a plurality of high-dimensionality bins using a binning algorithm and using a subset of the measured variables representing each object, each high-dimensionality bin representing a subset of the objects in the high-dimensionality data set, each high-dimensionality bin representing a bounded volume in a multi-dimensional space;

calculating high-dimensionality bin centroid coordinates for each of the high- dimensionality bins, the high-dimensionality bin centroid coordinates representing an average location of the objects of the respective bin in a high dimensional space; embedding the high-dimensionality bin centroid coordinates in a low dimensionality space using a dimensionality reduction algorithm;

producing a low dimensionality map representing the low-dimensionality space, the low-dimensionality map including a plurality of bin representations, each bin representation corresponding to a high-dimensionality bin representing the respective subset of the objects in the high-dimensionality data set.

2. The computer-implemented method of claim 1, further comprising:

at the one or more computing devices:

computing an average value of each measured variable of the data samples for each high-dimensionality bin,

displaying one or more graphical depictions of the low dimensionality map, each graphical depiction using a graphical representation characteristic to illustrate:

the average value of the respective variable across the low dimensionality map and the number of individual objects represented by each high-dimensionality bin representation.

3. The computer-implemented method of claim 1, further comprising:

at the one or more computing devices:

after embedding the high-dimensionality bin centroid coordinates in the low dimensionality space using the dimensionality reduction algorithm, clustering portions of the bin representations. 4. The computer-implemented method of claim 3, wherein each of the clustered portions of the bin representations represents collections of similar objects in the high-dimensionality data set.

5. The computer-implemented method of claim 3, further comprising:

at the one or more computing devices:

after clustering the portions of the bin representations:

defining a plurality of feature vectors based on the clustered portions of the bin representations, wherein each feature vector comprises a plurality of clustered portions of the bin representations, and

assigning subsets of the data samples of the high-dimensionality data set to one or more classes based on the plurality of feature vectors, each class comprising a subset of the data samples of the high-dimensionality data set.

6. The computer-implemented method of claim 5, further comprising:

at the one or more computing devices:

after assigning subsets of the data samples of the high-dimensionality data set to one or more classes based on the plurality of feature vectors, receiving a new data sample, and classifying the new data sample as corresponding to one or more classes of the high-dimensionality data set. 7. The computer-implemented method of claim 1, wherein the high-dimensionality data set is flow cytometry data.

8. The computer-implemented method of claim 7, wherein the flow cytometry data are based on a panel comprised of one or more tubes, each tube stained with a plurality of markers.

9. The computer-implemented method of claim 8, wherein the panel comprises more than one tube and each tube contains at least one common marker and at least one variable marker, wherein the at least one common marker is a marker shared between all tubes.

10. The computer-implemented method of claim 1, wherein the binning algorithm is probability binning.

11. The computer-implemented method of claim 1, wherein the binning algorithm is uniform binning.

12. The computer-implemented method of claim 1, wherein the binning algorithm is oblique recursive subdivision binning.

13. The computer-implemented method of claim 1, wherein the dimensionality reduction algorithm is a linear dimensionality reduction algorithm. 14. The computer-implemented method of claim 13, wherein the linear dimensionality reduction algorithm is principal component analysis.

15. The computer-implemented method of claim 1, wherein the dimensionality reduction algorithm is a non-linear dimensionality reduction algorithm.

16. The computer-implemented method of claim 15, wherein the non-linear dimensionality reduction algorithm is a manifold learning algorithm.

17. A non-transitory computer readable storage medium having stored thereon computer- executable instructions which, when executed by a processor, perform the steps of any of claims 1- 16.

18. A system comprising: one or more memory units each operable to store at least one program; and at least one processor communicatively coupled to the one or more memory units, in which the at least one program, when executed by the at least one processor, causes the at least one processor to perform the steps of and of claims 1-16.

Description:
METHOD OF CONSTRUCTING A REUSABLE LOW-DIMENSIONALITY MAP

OF HIGH-DIMENSIONALITY DATA

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Patent Application No.

62/469,014 filed March 9, 2017 entitled "METHOD OF CONSTRUCTING A REUSABLE LOW- DIMENSIONALITY MAP OF HIGH-DFMENSIONALITY DATA", which is incorporated by reference herein in its entirety.

BRIEF SUMMARY

[0002] In one embodiment, there is a computer-implemented method of constructing a reusable low-dimensionality map. The low-dimensionality map may be constructed from a high- dimensionality data set comprising one or more data samples. Each of the data samples may comprise a plurality of objects and each object may be represented by three or more measured variables. The dimensionality of the low-dimensionality map may be less than the number of measured variables representing each object. The method may comprise the steps performed at one or more computing devices. The one or more computing devices may create a plurality of high- dimensionality bins using a binning algorithm and using a subset of the measured variables representing each object. Each high-dimensionality bin may represent a subset of the objects in the high-dimensionality data set. Each high-dimensionality bin may represent a bounded volume in a multi-dimensional space. The one or more computing devices may calculate high-dimensionality bin centroid coordinates for each of the high-dimensionality bins. The one or more computing devices may create the high-dimensionality bin centroid coordinates representing an average location of the objects of the respective bin in a high dimensional space. The one or more computing devices may embed the high-dimensionality bin centroid coordinates in a low

dimensionality space using a dimensionality reduction algorithm. The one or more computing devices may produce a low dimensionality map representing the low-dimensionality space, the low- dimensionality map including a plurality of bin representations. Each bin representation may correspond to a high-dimensionality bin representing the respective subset of the objects in the high- dimensionality data set. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0003] The foregoing summary, as well as the following detailed description of embodiments of the devices and methods described herein, will be better understood when read in conjunction with the appended drawings of an exemplary embodiment. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.

[0004] In the drawings:

[0005] Figs. 1 A-1C are graphical representations of binning data in accordance with an exemplary embodiment of the present invention;

[0006] Fig. 2 is a graphical representation of a binning result in accordance with an exemplary embodiment of the present invention;

[0007] Fig. 3 A-3D are graphical representations of embedding data and constructing a reusable low dimensionality map in accordance with an exemplary embodiment of the present invention;

[0008] Figs. 4A-4D are graphical representations of data clustering in accordance with an exemplary embodiment of the present invention;

[0009] Figs. 5A-5C are graphical representations illustrating classification of data samples using feature vectors in accordance with an exemplary embodiment of the present invention;

[0010] Fig. 6 is an example of AML case study data in accordance with an exemplary embodiment of the present invention;

[0011] Fig. 7 is an example of a t-SNE map of bin centers in accordance with an exemplary embodiment of the present invention;

[0012] Fig. 8 is an example of a t-SNE map of bin centers with marker expression in accordance with an exemplary embodiment of the present invention;

[0013] Fig. 9 is an example of phenotype representation in accordance with an exemplary embodiment of the present invention;

[0014] Fig. 10 is an example of bivariate graphs in accordance with an exemplary embodiment of the present invention;

[0015] Fig. 11 is an example of an AML case study in accordance with an exemplary embodiment of the present invention; and

[0016] Fig. 12 is an example of clustering of resulting binning data in accordance with an exemplary embodiment of the present invention. DETAILED DESCRIPTION

[0017] Flow Cytometry (FC), may represent complex, high dimensional aspects of data including high dimensional aspects of cells and other organelles (which may be referred to herein as "events"). Using Flow Cytometry, samples may be stained with markers, including but not limited to antibody - fluorophore conjugates, that specifically label molecules expressed on or inside of the cell or organelle. Different cells may express different combinations of molecules. By labeling cells with different markers, many different cell types and sub-types may be differentiated and counted. An advantage of Flow Cytometry can be that it is capable of measuring many thousands of events per second, enabling the study of large numbers of individual events. Another advantage of Flow Cytometry may be that modern instruments are capable of measuring up to 30, and in some cases even more, simultaneous markers.

[0018] Although this high dimensional capability can allow for the enumeration of many different cell types, sometimes the enumeration is not enough. To overcome this limitation, researchers sometimes create multiple combinations of markers, and stain separate tubes of samples with each of several marker combinations. The entire collection of tubes is often referred to as a "panel", which contains n tubes, and in each tube, m markers are measured. Thus, in shorthand researchers may refer to an «-tube, w-marker panel. Sometimes, in order to correlate information about cells across the n tubes, a subset of the m markers may be repeated in each of the tubes. These are typically referred to as "backbone" or "common" markers.

[0019] FC data may be analyzed using graphically driven approaches. Subsets of cells (events) may be delineated usually in one-dimensional or two-dimensional histograms or "dotplots" in a procedure termed "gating." Gates of differing shapes including rectangular, circular, elliptical, or arbitrary polygonal contours may be specified. The gating process may be frequently applied in a sequential fashion, with the numbers of events inside successive gates falling monotonically from step to step. Subsets determined via gating may then be quantified with respect to their expression patterns in the dimensions of multi-parameter space not utilized for gating, by counting proportions of the subsets that are positive or negative for each of the markers of interest for that subset.

[0020] Manual gating can be a highly effective means for analyzing flow cytometry data, especially in cases where the application of expert judgment in the visual design of gating strategies may be able to isolate events of biological interest in the presence of confounding experimental (or biological) variations that will be difficult to account for automatically. Nevertheless, manual gating can have three main drawbacks. First, the choice of gates may be subjective, particularly in the not- unusual situation where the distribution may be broad and smooth. This lack of objective criteria may be problematic, especially when different samples may show different types of "excursions" from the average/normal case. Second, because gates may be specified by manually drawing regions on a graph using a computer mouse, the process may be very labor intensive and time consuming. Finally, because gating and regions of interest may be determined by the data analyst based on his or her experience, there may be interesting and informative features that exist within the full ungated multivariate distribution of events but that nevertheless escape detection.

[0021] Sometimes, even this high dimensional approach is not enough. Some embodiments of the present invention may provide the following: (a) eliminate the need for manual sequential gating, (b) provide a convenient, intuitive visual representation that greatly facilitates the

interpretation of high-dimensionality data, and (c) provide a pathway to construct a computational classifier capable of assisting in the pattern recognition task which may be crucial to diagnosing disease or to other scientific research goals.

[0022] Referring to the drawings in detail, wherein like reference numerals indicate like elements throughout, there is shown in Figure 1A-12 methods and devices for constructing a reusable low-dimensionality map in accordance with exemplary embodiments of the present invention.

[0023] In some embodiments, there is a computer-implemented method, implemented at one or more computing devices, for constructing a reusable low-dimensionality map from a high- dimensionality data set. The low-dimensionality map may have at least one less dimension than the number of dimensions that comprise the high-dimensionality data set. For example, the low- dimensionality map may be a two dimensional map representing a data set of at least three dimensions. Alternatively, the low-dimensionality map may be a three dimensional map

representing a data set of at least four dimensions.

[0024] Figures 1 A-4D are plots and maps that illustrate a computer-implemented method of constructing a reusable-low dimensionality map, according to at least some embodiments of the invention.

[0025] In some embodiments, the high-dimensionality data set may comprise one or more data samples. The data samples may comprise a plurality of objects. The objects may be represented by three or more parameters (e.g., measured variables). In some embodiments, the data samples may be taken from test subjects (e.g., humans). The objects may be components of the test subjects (e.g., cells). The three or more measured variables may be markers expressed on or in the cells. [0026] In some embodiments, the computing device 1200 creates a plurality of high- dimensionality bins using a binning algorithm and using at least a subset of the measured variables representing each object. In some embodiments, a binning algorithm is a technique that subdivides a high-dimensionality space into a set of smaller high-dimensionality regions, called bins, that occupy portions of the volume of the high-dimensionality space. Examples of binning algorithms may include, but are not limited to, Cytometric Fingerprinting, probability binning, uniform binning and oblique recursive subdivision binning.

[0027] The high-dimensionality bins may represent a subset (optionally, less than all) of the objects in the high-dimensionality data set. The high-dimensionality bins may represent a group of objects that share similarities between one or more variables. In some embodiments, each high- dimensionality bin may represent a bounded volume based on the corresponding values of the measured variables in a multi-dimensional space. In some embodiments, the high-dimensionality bins may be non-overlapping in the multi-dimensional space. In some embodiments, at least two high-dimensionality bins may overlap.

[0028] Figures 1A-1C show plots of a multi-dimensional space and illustrate a Cytometric

Fingerprinting binning algorithm that uses a subset of parameters, according to at least some embodiments of the invention. Figure 1 A shows a plot of a plurality of objects 10 in a

multidimensional space r. In this example, the multi-dimensional space (also referred to herein as a region) may be a two dimensional space because the plurality of objects 10 (e.g., objects 10a- 10c) have two measured variables PI, P2. Each of the objects 10 is plotted based on the corresponding values of each of the two measured variables.

[0029] In some embodiments, the computing device 1200 may calculate a variance of the objects among each of the measured variables. After calculating the variances, the computing device 1200 may determine which of the measured variables for the objects in the region r has the highest variance. After determining the highest variance, the computing device 1200 may calculate a median of the measured variable having the highest variance for objects in the region r. After calculating the median, the computing device 1200 may divide the region r into two subregions at the median of the measured variable having the highest variance. In Figure IB, the computing device 1200 determined that measured variable P2 has a higher variance than measured variable PI . The computing device 1200 then calculates a median of 550 for the objects in the region r for measured variable P2 and divides region r into sub-regions rl, r2. In Figure 1C, the computing device 1200 repeats, but instead uses each of the sub-regions rl, r2 as an individual region. In Figure 1C, for sub-region rl, the computing device 1200 determines that measured variable PI has the highest variance and calculates a median of 580. The computing device 1200 then divides sub region rl into sub-regions rl 1, rl2 at 580 for the objects in the sub-region rl of measured variable PI . For sub-region r2, the computing device 1200 determines that measured variable P2 has the highest variance and calculates a median of 500 for the objects in the sub-region r2 for measured variable P2 and divides sub region r2 into sub-regions r21, r22. The computing device 1200 repeats the preceding steps, recursively, until a number of objects 10 in a sub-region is less than a predetermined threshold (e.g., 10 objects). Once the process is completed, each sub region may correspond to a high-dimensionality bin in a multi-dimensional space. Each high-dimensionality bin may be bounded in the multi-dimensional space r. For example, bin rl 1 may be bounded by a value of 580 for P2 and a value of 600 for PI, such that bin rl 1 includes objects having values above 580 for measured variable P2 and less than 600 for measured variable PI .

[0030] While Figures 1 A-1C illustrate a Cytometric Fingerprinting binning algorithm, it is understood that other binning algorithms may be used to generate the high-dimensionality bins.

[0031] It will be appreciated by those skilled in the art that although Figs. 1 A-1C are depicted in two dimensions, the same binning algorithm may be applied to a data set in a space of any dimensionality, and particularly in three or more dimensions.

[0032] In some embodiments, the computing device calculates high-dimensionality bin centroid coordinates for each of the high-dimensionality bins. The high-dimensionality bin centroid coordinate may represent the average location (e.g., average, median, mean) of all of the objects in a high-dimensionality bin in the multi-dimensional space. In some embodiments, the computing device 1200 calculates the high-dimensionality bin centroid coordinate by determining the position of each object in the bin, and calculating the median value. For example, Figure 2 shows a plot of a plurality of bins ra-rp, and their bin centroid coordinates 20a-20p, in a multi-dimensional space r. In Figure 2, bin ra (upper left corner) has a bin centroid coordinate of (400, 700) for measured variables PI, P2. This bin centroid coordinate (400, 700) represents the average position of the objects in bin ra.

[0033] In some embodiments, the computing device 1200 embeds the high-dimensionality bin centroid coordinates in a low dimensionality space using a dimensionality reduction algorithm. The dimensionality reduction algorithm, as used herein, may refer to a process of reducing the number of random variables required to represent a distribution of objects. Examples of dimensionality reduction algorithms include, but are not limited to linear dimensionality reduction algorithms such as principal component analysis, non-negative matrix factorization, kernel principal component analysis and linear discriminant analysis , or non-linear dimensionality reduction algorithms such as t-distributed Stochastic Neighbor Embedding, generalized non-linear discriminant analysis, Isomap, Locally Linear Embedding, Modified Locally Linear Embedding, Hessian Eigenmapping, Spectral Embedding, Local Tangent Space Alignment and Multi-Dimensional Scaling.

[0034] In some embodiments, the computing device produces a low dimensionality map representing the high-dimensionality space. The low-dimensionality map may include a plurality of bin representations. Each bin representation may correspond to a high-dimensionality bin representing the respective subset of the objects in the high-dimensionality data set. Fig. 3 A shows a representation of a low-dimensionality map produced after the computing device 1200 embeds the high-dimensionality bin centroid coordinates in a low dimensionality space, according to at least some embodiments of the invention. In Fig. 3A, a 2-dimensional embedding of a 3-dimensional set of centroid coordinates of each high-dimensionality bin has been performed using a dimensionality reduction algorithm known as the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm. In this example, the number of high-dimensionality bins created using the binning algorithm was 256 bins. Each high-dimensionality bin included 3-dimensional centroid coordinates. The t-SNE embedding produced bin representations, each bin representation having 2-dimensional coordinates representing a low (e.g., two) dimensional embedding of a high (e.g., three) dimensional space. Thus, each of the 256 points (i.e., points 30a-30c) in Figure 3A represents one of the bins that was previously created by the computing device 1200.

[0035] In some embodiments, the computing device 1200 computes an average value of each measured variable of the data samples for each high-dimensionality bin to determine a set of coordinates that represent the location of the high-dimensionality bin. The average value for each measured variable may be calculated as the mean value of each dimension of the set of objects contained within the high-dimensionality bin. Alternatively, the average value for each measured variable may be calculated as the median value of each dimension of the set of objects contained within the high-dimensionality bin. In some embodiments, the computing device 1200 displays one or more graphical depictions of the low dimensionality map, each graphical depiction using a graphical representation characteristic to illustrate: the average value of the respective measured variable across the low dimensionality map and the number of individual objects represented by each high-dimensionality bin representation. Figures 3B-D show, respectively, the relative expression levels of three of the measured variables (e.g., markers) in a data sample (e.g., panel), according to at least some embodiments of the invention. FS Lin, CD45-ECD and CD2-PC7 may each correspond to one of the measured variables. The size of the corresponding dots represents the relative marker expression in each of the 256 bins, with large dots (e.g., dots 31a-31c) indicating high relative expression and small dots (e.g., dots 33a-33c) indicating low relative expression.

[0036] In some embodiments, after embedding the high-dimensionality bin centroid coordinates in the low dimensionality space using the dimensionality reduction algorithm, the computing device 1200 clusters portions of the bin representations having collections of similar objects in the high-dimensionality data set (e.g., linear distance-based clustering methods). Because the high-dimensional manifold has been learned using t-S E, in this example, the distance relationships between the various bin centroids are much more amenable to linear distance-based clustering methods.

[0037] Figure 4 A shows clustering of the bin representations, according to at least some embodiments of the invention. In Figure 4A, the clustering was performed using hierarchical agglomerative clustering in the 2-dimensional map space, resulting in clusters (e.g., clusters 40a- 40b). Hierarchical agglomerative clustering may be a method of cluster analysis in which each bin representation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy, until arriving at a desired number of clusters. Figures 4B-4D show the same clusters superimposed on the marker expression maps as shown in Figures 3B-3D, according to at least some embodiments of the invention. From Figs. 4B-4D it is evident that the relative marker expressions within each cluster are fairly homogeneous. For example, Cluster 40b (in the upper right corner of Figures 4A-4D) consists of points in which "FS Lin" has low to intermediate expression, as represented by relatively small dots, "CD45-ECD" has high expression, as represented by larger dots, and "CD2-PC7" has high expression, as represented by large dots. Conversely, Cluster 40a consists of points in which "FS Lin" has very low expression, as represented by small dots, "CD45- ECD" has low to moderate expression, as represented by small to intermediate dots and "CD2-PC7" has very low expression as represented by very small dots.

[0038] In some embodiments, after clustering the portions of the bin representations, the computing device 1200 defines a plurality of feature vectors based on the clustered portions of the bin representations. Each feature vector may comprise a plurality of clustered portions of the bin representations. Each feature vector may also comprise the number of objects of a data sample in each clustered portion of bin representations. In some embodiments, the computing device 1200 may assign subsets of the data samples of the high-dimensionality data set to one or more classes based on the plurality of feature vectors, each class comprising a subset of the data samples of the high-dimensionality data set. [0039] Figures 5A-5C are graphical map representations illustrating classification of the data samples using the feature vectors. In some embodiments, given a fixed binning model, the binning model can be applied to data from any data sample (provided that the data sample was generated using a compatible panel). When a model is applied to a data sample, the computing device 1200 may count the objects (e.g. cells) for the data sample in each high-dimensionality bin. The number of objects in each bin can be represented by the size of the corresponding dot in the low- dimensionality map.

[0040] In some embodiments, the binning model may be created from a dataset that had some data samples belonging to a first class "Class 1", and some other data samples belonging to a second class "Class 2". Class 1 may identify a group of data samples that have data to indicate a potentially problematic or abhorrent characteristic (e.g., AML, described below). Class 2 may identify a group of data samples that have data to indicate a normal characteristic (e.g., normal bone marrow, described below).

[0041] Figure 5A shows a low-dimensional embedding of a high-dimensional dataset comprising the two classes Class 1 and Class 2. In this example, there are 15 clusters of bins in this low-dimensionality map. Cluster 50 is an example of one of the clusters of bins.

[0042] Figure 5B shows a low-dimensionality map illustrating a representation of a data sample from the dataset that was a member of Class 1, while Figure 5C shows a low-dimensionality map illustrating a representation of a data sample from the dataset that was a member of Class 2. In these representations the dot size indicates the number of objects in each corresponding bin, thus we can see that the objects are distributed in high-dimensional space differently in the two samples shown in Figure 5B and 5C respectively.

[0043] A feature vector describing the sample depicted in Figure 5B can be derived by counting the objects in each of the 15 clusters. Likewise, another feature vector can be derived for the sample depicted in Figure 5C. Because the number of objects of the size of the dots in Figures 5B and 5C represent the number of objects in each bin, it can be visually clear to one of ordinary skill in the art from the differently-sized dots in figures 5B and 5C that these two feature vectors will be different.

[0044] If a new data sample is obtained or received by the computing device and that data sample was not part of the data set from which the binning model was derived, the computing device 1200 may apply the binning model to the new sample, and derive a feature vector describing the new sample analogously to the feature vectors for samples in the original dataset. The new feature vector can be then be compared with those feature vectors derived from data samples of Class 1 and with data samples of Class 2. The new data sample can thus be easily classified as being a member of either Class 1 or Class 2 by virtue of similarities of its feature vector to those of the members of the respective classes.

[0045] To better understand the advantages of at least some embodiments of the present invention, a case study is provided herein. Acute Myeloid Leukemia (AML) may be characterized by the presence of cells with abnormal expression patterns of markers that are not generally found, or found in very low numbers, on healthy cells. In order to determine first, the presence of AML, and second, the specific subtype of AML, a large number of markers may be profiled. Figure 6 shows an example of a panel used to characterize AML, according to at least one embodiment of the invention. This may be a 4-tube 8-color panel. In this panel, 5 of the markers may be common across all four tubes (backbone markers), facilitating correlation of, for example, CD56 in Tube 1 with CD133 or with CD15 in Tube 4. Analysis of data from a patient sample derived from this panel may consist of sequential gating. Considerable expertise and training may be required, first, to construct and adjust appropriate gates, and second, to interpret the resulting graphical patterns to visually detect aberrant expression patterns.

[0046] At least some embodiments of the present invention may be carried by implementing at least one of the following steps provided below (or alternatively one of the preceding steps described herein):

1. Create a Cytometric Fingerprinting (CF) binning model or models using a subset of the available parameters and a subset of available samples comprising one or more data sets. 2. Compute the bin centroids for all bins in the models, representing bin locations in a high dimensional space.

3. Perform an imbedding of the high dimensional bin coordinates in a lower dimensional space using a dimensionality reduction algorithm such as t-S E, producing a low

dimensionality (which may be 2-dimensional) "map" of the bins in the model or models. 4. Compute the average marker expression for all markers in the panel for each bin of the models.

5. Display a sequence of graphical depictions of the maps, using graphical attribute (e.g. color or dot size) to encode the expression of individual markers across the map as well as the numbers of cells in each bin, as a means of visual interpretation of expression patterns for normal versus aberrant expression.

6. Perform clustering (which may involve the use of agglomerative hierarchical clustering) in the 2-dimensional embedding, to create groups of bins of similar phenotype. 7. Train a supervised classifier (which may involve the use of linear discriminant analysis, support vector machine, etc.) to derive expression patterns of clusters that relate to disease classification.

8. Use the classifier developed in Step 7 to automatically classify prospective patient samples.

[0047] For this AML case study, Step 1 as described above may be performed by (a) aggregating several AML samples and several normal bone marrow ( BM) samples, and (b) computing CF binning models separately for the AML aggregate and the NBM aggregate. With reference to Figure 6, the 5 backbone markers, plus the Side Scatter (SSC) signal, may be used as the parameters of the CF binning model. After computing the bin centroids with respect to these 6 markers (Step 2), t-SNE embedding may be performed (Step 3), which may result in the map in Figure 7. Each red (blue) dot in the map may represent a bin in the AML (NBM) model.

[0031] Figure 8 may show the expression of all of the markers in the panel for each model bin (Steps 4 and 5). In addition, agglomerative hierarchical clustering within the 2-dimensional embedding, resulting in 77 clusters of bins may be performed. Cluster names may be shown in the central pane of Figure 8. Clusters may be assigned AML, NBM or both depending on whether the preponderance of bins in the cluster may be derived from the AML model, or the NBM model, or that they were roughly equally represented.

[0032] In addition to showing color-coded maps as a means of summarizing a very large quantity of data pertaining to a sample, it may be possible to depict in a more quantitative fashion the phenotypes of individual clusters of bins. Figure 9, may show such a depiction. In this figure, each bargraph may be a representation of the expression pattern of the bins belonging to a single cluster. In this Figure, all of the AML clusters are shown, and each bargraph may quantitatively show the expression of each marker in the panel with respect to its cluster. By consulting the AML literature, or an expert in the field, it may be possible to assign names to most, if not all, of these clusters. For example, cluster AML 5 may correspond to erythroid cells by virtue of the fact that this cluster is negative for all markers in the panel (and the panel lacks erythroid markers such as CD235a). Other clusters, including AML 72-77 may correspond to various sub-populations of immature blasts, a hallmark of AML.

[0033] It may be appreciated by those skilled in the art that the spread of pictures in Figure 8 depicts an extremely complex multivariate probability distribution in a highly organized, intuitive way that enables a visual interpretive analysis to be rapidly performed by a physician or researcher. In contrast, the same information may be depicted in a large series of bivariate plots derived from a sequential gating analysis of the same data (refer to Figure 10). However, the number of bivariates that need to be examined may be large, and integrating information gleaned from them requires a great deal of training and skill.

[0034] Figure 11 may represent a map, color coded according to the number of cells in each CF bin. From this figure it may be appreciated that (a) there are regions that are largely unpopulated among the BM samples, (b) there are regions that are under-populated among all of the AML samples, (c) the NBM distributions may be highly similar, and (d) the AML distributions may be highly heterogeneous. None of these observations will be surprising to those skilled in the art of AML diagnosis; however, the ease and obviousness of these observations given the methods of the present invention may serve to illustrate at least part of its utility.

[0035] A computer program may be used as a method of creating a computed map to visualize and analyze FC, and more particularly, Mass Cytometry (MC) data, using t-SNE. The computer program may be applied in the setting of AML. However, a key limitation may be that the computed map changes as different data are fed into it. This is intrinsic to the nature of the t-SNE algorithm in particular, and other non-linear dimensionality reduction algorithms in general. Thus, in order to compare multiple samples, events from all samples may need to be collected and aggregated, and used to compute a low-dimensionality embedding of the high-dimensional coordinates of the events. By keeping track of which events came from which samples, the distributions of the events in the samples may be compared to each other. However, the map so produced has no relevancy or utility with respect to any samples other than those that were collected and aggregated for the purposes of sample comparison. This may significantly limit the utility of a computer program implementing t-SNE for the prospective evaluation of samples not available at the time of model building. By contrast, the present invention, rather than embedding coordinates of individual events, may embed the coordinates of CF bins. These bins may be computed from a given set of samples (which may be referred to as the training set or training sets) and stored for later reuse. Once these binning models and t-SNE maps are stored, additional samples that are not members of the training sets may be rapidly projected onto the stored maps, thereby allowing perpetual reuse of the maps to analyze new data prospectively.

[0036] In one embodiment, the computing device 1200 includes one or more computers having one or more processors and memory (e.g., one or more nonvolatile storage devices). In some embodiments, memory or computer readable storage medium of memory stores programs, modules and data structures, or a subset thereof for a processor to control and run the various systems and methods disclosed herein. In one embodiment, a non-transitory computer readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, perform one or more of the methods disclosed herein.

[0037] Figure 12 shows a block diagram that illustrates a computing device 1200 according to at least one embodiment of the present invention. Examples of a computing device 1200 may include a smart phone, tablet or a personal computer, among others. In one embodiment, computing device 1200 may represent multiple client devices, each of which is capable of performing the functions specified for computing device 1200.

[0038] Computing device 1200 may include communication infrastructure 1211, processor 1212, memory 1213, user interface 1214 and communication interface 1215.

[0039] Processor 1212 may be any type of processor, including but not limited to a special purpose or a general-purpose processor. Processor 1212 is connected to a communication infrastructure 1211 (for example, a bus or network) that also connects memory 1213, user interface 1214 and communications interface 1215. Various software implementations are described in terms of this exemplary computer system.

[0040] Memory 1213 may include at least one of: random access memory (RAM), a hard disk drive and a removable storage drive, such as a floppy disk drive, a magnetic tape drive, or an optical disk drive, etc. The removable storage drive reads from and/or writes to a removable storage unit. The removable storage unit can be a floppy disk, a magnetic tape, an optical disk, etc., which is read by and written to a removable storage drive. Memory 1213 may include a computer usable storage medium having stored therein computer software programs and/or data to perform any of the computing functions of computing device 1200. Computer software programs (also called computer control logic), when executed, enable computing device 1200 to implement embodiments of the present invention as discussed herein. Accordingly, such computer software programs represent controllers of computing device 1200.

[0041] Memory 1213 may include one or more data stores that store data such as test data, images, software files or any other types of data files.

[0042] User interface 1214 may be a program that controls a display (not shown) of computing device 1200. User interface 1214 may include one or more peripheral user interface components, such as a keyboard or a mouse. The user may use the peripheral user interface components to interact with computing device 1200. User interface 1214 may receive user inputs, such as mouse inputs or keyboard inputs from the mouse or keyboard user interface components.

[0043] Communication interface 1215 allow data to be transferred between computing device 1200 and external devices. Examples of communication interface 1215 may include a modem, a network interface (such as an Ethernet card), a communication port, a Personal Computer Memory Card International Association (PCMCIA) slot and card, etc. Data transferred via communication interface 1215 are in the form of signals, which may be electronic, electromagnetic, optical, or other signals capable of being transmitted or received by communication interface. These signals are provided to or received from communication interface 1215 via network 1230.

[0044] Network 1230 connects computing device 1200 and with external devices by carrying signals. Network 1230 may be implemented using wire or cable, fiber optics, a phone line, a wireless link, a cellular phone link, a radio frequency link, or any other suitable communication channel. For instance, network 1230 may be implemented using a combination of channels.

Network 1230 may be implemented as an intranet and/or an internet.

[0045] It will be appreciated by those skilled in the art that changes could be made to the exemplary embodiments shown and described above without departing from the broad inventive concepts thereof. It is understood, therefore, that this invention is not limited to the exemplary embodiments shown and described, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the claims. For example, specific features of the exemplary embodiments may or may not be part of the claimed invention and various features of the disclosed embodiments may be combined. Unless specifically set forth herein, the terms "a", "an" and "the" are not limited to one element but instead should be read as meaning "at least one".

[0046] It is to be understood that at least some of the figures and descriptions of the invention have been simplified to focus on elements that are relevant for a clear understanding of the invention, while eliminating, for purposes of clarity, other elements that those of ordinary skill in the art will appreciate may also comprise a portion of the invention. However, because such elements are well known in the art, and because they do not necessarily facilitate a better understanding of the invention, a description of such elements is not provided herein.

[0047] Further, to the extent that the methods of the present invention do not rely on the particular order of steps set forth herein, the particular order of the steps should not be construed as limitation on the claims. Any claims directed to the methods of the present invention should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the steps may be varied and still remain within the spirit and scope of the present invention.