Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR DATA FUSION
Document Type and Number:
WIPO Patent Application WO/2022/029218
Kind Code:
A1
Abstract:
The present invention is in the field of data fusion. In particular, the present invention provides a method for fusing images from a sample, for example a tissue sample. The present invention also relates to uses of the method for fusing medical from a sample and computer programs configured to perform the method for fusing medical images from a sample.

Inventors:
DE KEYSER TOM (BE)
DE MOOR BART (BE)
SMETS TINA (BE)
WAELKENS ETIENNE (BE)
Application Number:
PCT/EP2021/071852
Publication Date:
February 10, 2022
Filing Date:
August 05, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV LEUVEN KATH (BE)
International Classes:
G06T3/00
Domestic Patent References:
WO2019186965A12019-10-03
WO2013177189A12013-11-28
Other References:
WALID M. ABDELMOULA ET AL: "Automatic Generic Registration of Mass Spectrometry Imaging Data to Histology Using Nonlinear Stochastic Embedding", ANALYTICAL CHEMISTRY, vol. 86, no. 18, 26 August 2014 (2014-08-26), pages 9204 - 9211, XP055767903, ISSN: 0003-2700, DOI: 10.1021/ac502170f
SKRÁSKOVÁ KAROLINA ET AL: "Precise Anatomic Localization of Accumulated Lipids inMfp2Deficient Murine Brains Through Automated Registration of SIMS Images to the Allen Brain Atlas", JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, ELSEVIER SCIENCE INC, US, vol. 26, no. 6, 28 April 2015 (2015-04-28), pages 948 - 957, XP035498798, ISSN: 1044-0305, [retrieved on 20150428], DOI: 10.1007/S13361-015-1146-6
WALID M. ABDELMOULA ET AL.: "Automatic generic registration of mass spectrometry imaging data to histology using nonlinear stochastic embedding", ANAL. CHEM., vol. 86, 2014, pages 9204 - 9211, XP055767903, DOI: 10.1021/ac502170f
FATMA EI-ZAHRAA AHMED EL-GAMAL ET AL., CURRENT TRENDS IN MEDICAL IMAGE REGISTRATION AND FUSION, 2016
Attorney, Agent or Firm:
DE CLERCQ & PARTNERS (BE)
Download PDF:
Claims:
23

CLAIMS

1. A computer-implemented method for fusing data sets obtained from images from a sample, the method comprising the steps of: al) receiving a (nxp) first data set from the sample; a2) receiving a (mxq) second data set from the sample; b) transforming, preferably reducing, the (nxp) first data set to a (nxq') first data set, using a manifold learning algorithm comprising a manifold approximation step and a manifold projection step, wherein q' is smaller than or equal to q; c) using a registration algorithm to construct a (mxn) correspondence matrix of the data received in steps al) and a2); d) transforming the (mxq) second data set to a (mxq') transformed second data set, using a manifold learning algorithm comprising a manifold approximation step and a correspondence-aware manifold projection step; thereby obtaining a fused image for the sample; wherein the correspondence-aware manifold projection step in step d) comprises at least 3 inputs, comprising: input from the (mxn) correspondence matrix obtained in step c); input from the (nxq') first data set obtained in step b); and input from the (mxq) second data set after the manifold approximation step; and, wherein the correspondence-aware manifold projection step in step d) comprises a cost function that is constrained based on these at least 3 inputs.

2. The method according to claim 1, wherein q' is equal to 3.

3. The method according to any one of claims 1 or 2, wherein q is equal to 3 and wherein p is not equal to 3.

4. The method according to any one of claims 1 to 3, wherein the first data set from the sample comprises hyperspectral data, preferably mass spectrometry data; preferably wherein step al) comprises the step of obtaining a set of hyperspectral data using a hyperspectral method, preferably MSI.

5. The method according to any one of claims 1 to 4, wherein the second data set from the sample comprises microscopy data, preferably obtained from staining, more preferably hematoxylin and eosin stains; preferably wherein step a2) comprises the step of obtaining a set of microscopy data using microscopy, preferably by staining, preferably hematoxylin and eosin stains.

6. The method according to any one of claims 1 to 5, wherein step b) comprises a nonlinear dimensionality reduction algorithm to project all information from the first data set, preferably comprising hyperspectral data, onto the RGB space.

7. The method according to any one of claims 1 to 6, wherein step b) and/or step d) comprises using a manifold learning algorithm selected from the group comprising: UMAP, t-SNE, LargeVis, Isomap, neural network based approaches such as autoencoders; preferably UMAP.

8. The method according to any one of claims 1 to 7, wherein the cost function takes into account the differences between the two data sets, preferably the difference between the two images.

9. The method according to any one of claims 1 to 8, wherein the cost function is subjected to the distances between the colours associated with the corresponding instances.

10. The method according to any one of claims 1 to 9, wherein the fused image is the result of minimising the local distances within the manifold of the second data set, preferably from microscopy data, and the differences with the target colours from the first data set, preferably from hyperspectral data.

11. The method according to any one of claims 1 to 10, wherein the correspondence- aware manifold projection step in step d) aims to minimise the repulsion between corresponding points, whereby the repulsion may be defined as the average difference between corresponding points in the two data sets. The method according to any one of claims 1 to 11, comprising the step of: visualising molecular information obtained at a higher spatial resolution for i) molecular trends present in an entire data set and/or for ii) a single feature or molecule of interest (MOI). The method according to any one of claims 1 to 12, comprising the step of: expanding a distribution of molecules to regions not measured. Use of the method according to any one of claims 1 to 13 in pathology and/or multi- omics data fusion. A data processing device comprising means for carrying out the method according to any one of claims 1 to 13.

Description:
METHOD FOR DATA FUSION

TECHNICAL FIELD

The present invention is in the field of data fusion. In particular, the present invention provides a method for fusing images from a sample, for example a tissue sample. The present invention also relates to uses of the method for fusing medical images from a sample and computer programs configured to perform the method for fusing medical images from a sample.

BACKGROUND

High-resolution microscopy images (such as H&E, immuno- and fluorescent stainings) give insight into the correlation between structure and function of cells and tissues. Pathologists have been relying on morphology-based methods for decades to study and diagnose diseases. Microscopic approaches are however constrained by a limited number of markers that can be evaluated in a single tissue slide. Moreover, an object less than half the wavelength of the microscope's illumination source will not be visible under that microscope. Molecular measurements are needed to get insight into pathway defects, to improve stratification of patients, improve prediction of prognosis, survival etc. Furthermore, H&E images require an experienced pathologist to interpret them.

Molecular measurements such as Mass Spectrometry Imaging (MSI) and Spatial transcriptomics (ST) enable the assessment of thousands of molecules simultaneously without specifying these molecules of interest upfront. In addition to measuring the molecular components, these spatial omics techniques also take the spatial context into account. All this information should be taken together to understand the complex interactions taking place in biological systems. Measuring such a large amount of information also comes with some disadvantages; one well-known disadvantage being the curse of dimensionality. Omics measurements are typically expensive, and generate large amounts of data, which poses data management and interpretation challenges.

Moreover, multi-omics data measurements are complementary when obtained from the same tissue sample, and their integration or data fusion can reveal complex and heterogeneous molecular interactions that would stay undetected when analysing the data sources by themselves. Walid M. Abdelmoula et al., "Automatic generic registration of mass spectrometry imaging data to histology using nonlinear stochastic embedding", Anal. Chem., 2014, 86, 9204-9211, relates to an automated generic approach for registering mass spectrometry imaging data to histology.

Therefore, there remains a need for data fusion approaches for the integration of these heterogeneous data sources to understand complex biological systems. Currently, specific methods are limited. Existing methods typically only take one single molecular feature into account when building a data fusion model. Moreover, such methods are based on linear models such as for example PLS (Partial Least Squares) regression. Given the inherent nonlinear nature of biological systems, properly taking into account these non-linearities is important to obtain reliable results. A general challenge in data fusion methods concerns the registration between different modalities due to the large differences in terms of resolution between different measurement modalities (e.g. in the case of molecular and microscopic images).

SUMMARY OF THE INVENTION

The present invention overcomes one or more of these issues. Preferred embodiments of the present invention overcome one or more of these issues.

It is an advantage of the present invention, or embodiments thereof, that molecular information obtained from spatial omics data can be (in its entirety) combined with the microscopy image of the same tissue slide or sample. The fused image can provide a single representation including the complete feature space that constitutes the molecular trends in 3D with the corresponding microscopy data. The combination makes it possible to interpret the molecular measurements at a cellular resolution; as such morphological parameters can be taken into account when evaluating spatial omics measurements. This can support a pathologist in identifying invasive tumour cells or groups of cells, whereas without data fusion these molecular measurements can only be evaluated at the tissue level due to limited spatial resolution of the molecular measurement instruments. Combining molecular and morphological information at cellular resolution is important to improve the stratification of patient populations to assess survival and prognosis prediction or the identification of potential treatments. It is an advantage of the present invention, or embodiments thereof, that the biological trends observed in the high-dimensional omics data can be visualised at a higher spatial resolution. In addition, given the high costs associated with omics measurements and the limited amount of tissue that is typically available, this method can also be used for out-of-sample prediction.

It is an advantage of the present invention, or embodiments thereof, that elements can be revealed that would not be detected taking into account the separate data modalities individually.

It is an advantage of the present invention, or embodiments thereof, that the registration step has a limited impact on the result. The invention exploits the construction of the manifold to take into account the local information or pairwise information between pixels. The method as described herein is therefore robust to small deviations due to registration difficulties which is particularly challenging for data of different resolutions (/.e. the resolution of microscopy data varies around 0.2 micrometre and the molecular measurements have for example a resolution of 10 to 200 micrometre).

It is an advantage of the present invention, or embodiments thereof, that the use of nonlinearity enables the preservation of complex interactions.

It is an advantage of the present invention, or embodiments thereof, that the need to build a model in advance for each feature can be avoided. Instead, the complete feature space can be used in the preferred number of dimensions, typically 3 dimensions when visualising data. This makes the method faster and subject to fewer computational constraints.

It is an advantage of the present invention, or embodiments thereof, that colour can be transferred to black and white images. In this way for example, a coloured image can be obtained from a black and white image, for example in case use is made of old images when coloured images were not available yet.

The present invention relates to a computer-implemented method for fusing data sets, preferably obtained from images from a sample, such as a tissue sample.

The method preferably comprises the steps of: al) receiving a (nxp) first data set from the sample; a2) receiving a (mxq) second data set from the sample; b) transforming, preferably reducing, the (nxp) first data set to a (nxq') first data set, using a manifold learning algorithm comprising a manifold approximation step and a manifold projection step, wherein q' is smaller than or equal to q; c) using a registration algorithm to construct a (mxn) correspondence matrix of the data received in steps al) and a2); d) transforming the (mxq) second data set to a (mxq') transformed second data set, using a manifold learning algorithm comprising a manifold approximation step and a correspondence-aware manifold projection step; thereby obtaining a fused image for the sample; wherein the correspondence-aware manifold projection step in step d) comprises at least 3 inputs, comprising: input from the (mxn) correspondence matrix obtained in step c); input from the (nxq') first data set obtained in step b); and input from the (mxq) second data set after the manifold approximation step; and, wherein the correspondence-aware manifold projection step in step d) comprises a cost function that is constrained based on these at least 3 inputs.

According to a preferred embodiment q' is equal to 3.

According to a preferred embodiment q is equal to 3.

According to a preferred embodiment p is not equal to 3; preferably larger than 3.

In some preferred embodiments, the first data set from the sample comprises hyperspectral or high-dimensional omics data, preferably mass spectrometry data.

In some preferred embodiments, step al) comprises the step of obtaining a set of hyperspectral data using a hyperspectral method, preferably MSI.

In some preferred embodiments, the second data set from the sample comprises microscopy data, preferably obtained from staining, more preferably hematoxylin and eosin stains.

In some preferred embodiments, step a2) comprises the step of obtaining a set of microscopy data using microscopy, preferably by staining, more preferably hematoxylin and eosin stains. In some preferred embodiments, the first data set from the sample comprises hyperspectral or high-dimensional omics data, preferably mass spectrometry data; and the second data set from the sample comprises microscopy data, preferably obtained from staining, more preferably hematoxylin and eosin stains.

In some preferred embodiments, step al) comprises the step of obtaining a set of hyperspectral data using a hyperspectral method, preferably MSI; and step a2) comprises the step of obtaining a set of microscopy data using microscopy, preferably by staining, more preferably hematoxylin and eosin stains.

In some preferred embodiments, step b) comprises a non-linear dimensionality reduction algorithm to project all information from the first data set, preferably comprising hyperspectral data, onto the RGB space.

In some preferred embodiments, step b) and/or step d) comprises using a manifold learning algorithm selected from the group comprising: UMAP, t-SNE, LargeVis, Isomap, neural network-based approaches such as autoencoders; preferably UMAP.

In some preferred embodiments, the cost function takes into account the differences between the two data sets, preferably the difference between the two images.

In some preferred embodiments, the cost function is subjected to the distances between the colours associated with the corresponding instances.

In some preferred embodiments, the difference may be captured by a distance measure, including but not limited to Minkowski style metrics, Euclidean, manhattan, Chebyshev, minkowski, Canberra, braycurtis, haversine, mahalanobis, wminkowski, seuclidean, cosine, correlation, Hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, and/or yule distance measures. Said distance measure may also be obtained in the form of a custom distance metric through the application of a separate metric learning step.

In some preferred embodiments, the fused image is the result of minimising the local distances within the manifold of the second data set, preferably from microscopy data, and the differences with the target colours from the first data set, preferably from hyperspectral data. In some preferred embodiments, the correspondence-aware manifold projection step in step d) aims to minimise the repulsion between corresponding points, whereby the repulsion may be defined as the average difference between corresponding points in the two data sets.

In some preferred embodiments, the method comprises the step of: visualising molecular information obtained at a higher spatial resolution for i) molecular trends present in an entire data set and/or for ii) a single feature or molecule of interest (MOI).

In some preferred embodiments, the method comprises comprising the step of: expanding a distribution of molecules to regions not measured (/.e. out-of-sample prediction).

The present invention also relates to the use of the method as described herein, or embodiments thereof, in pathology and/or multi-omics data fusion.

The present invention also relates to a computer program, or a computer program product directly loadable into the internal memory of a computer, or a computer program product stored on a computer readable medium, or a combination of such computer programs or computer program products, configured for performing the method as described herein, or embodiments thereof.

DESCRIPTION OF THE FIGURES

The following description of the figures of the invention is only given by way of example and is not intended to limit the present explanation, its application or use.

FIG. 1 illustrates a conceptual overview of an embodiment of the method according to the invention.

FIG. 2 illustrates the application of a method according to an embodiment of the invention.

FIG. 3A illustrates the low dimensional representation of a lymphoma MSI data set (500000 pixels x 8000 m/z features, 10 pm resolution). In FIG. 3B, the fused results show that the molecular trends in the data can be visualised at a much higher resolution. FIG. 4 illustrates the use of the method according to an embodiment of the invention to obtain an out-of-sample prediction.

FIG. 5A shows an H&E staining at 0.2 pm resolution. In FIG. 5B, the low dimensional representation of a spatial transcriptomics data set is shown at 200 pm resolution. The data fusion results obtained for inputs from FIG. 5A and FIG. 5B is shown in FIG. 5C, illustrating that the molecular trends in the data can be visualised at the resolution of the microscopy image.

FIG. 6 illustrates how it is possible to highlight differences between cells and link these to their molecular function.

FIG. 7 illustrates blood vessels surrounded by a muscle layer or a single endothelial cell layer in the fused result and corresponding microscopy or H&E image.

FIG. 8 illustrates the CAML (Correspondence-aware manifold learning) validation and generality.

DETAILED DESCRIPTION OF THE INVENTION

As used below in this text, the singular forms "a", "an", "the" include both the singular and the plural, unless the context clearly indicates otherwise.

The terms "comprise", "comprises" as used below are synonymous with "including", "include" or "contain", "contains" and are inclusive or open and do not exclude additional unmentioned parts, elements or method steps. Where this description refers to a product or process which "comprises" specific features, parts or steps, this refers to the possibility that other features, parts or steps may also be present, but may also refer to embodiments which only contain the listed features, parts or steps.

The enumeration of numeric values by means of ranges of figures comprises all values and fractions in these ranges, as well as the cited end points.

The term "approximately" as used when referring to a measurable value, such as a parameter, an amount, a time period, and the like, is intended to include variations of +/-10% or less, preferably +/-5% or less, more preferably +/-1% or less, and still more preferably +/-0.1% or less, of and from the specified value, in so far as the variations apply to the invention disclosed herein. It should be understood that the value to which the term "approximately" refers per se has also been disclosed. All references cited in this description are hereby deemed to be incorporated in their entirety by way of reference.

Percentages as used herein may also be noted as dimensionless fractions or vice versa. A value of 50% may for example also be written as 0.5 or %.

Unless defined otherwise, all terms disclosed in the invention, including technical and scientific terms, have the meaning which a person skilled in the art usually gives them. For further guidance, definitions are included to further explain terms which are used in the description of the invention.

The present invention relates to a computer-implemented method for fusing data sets, preferably obtained from images from a sample.

The method preferably comprises the steps of: al) receiving a (nxp) first data set from the sample; a2) receiving a (mxq) second data set from the sample; b) transforming, preferably reducing, the (nxp) first data set to a (nxq') first data set, using a manifold learning algorithm comprising a manifold approximation step and a manifold projection step, wherein q' is smaller than or equal to q; c) using a registration algorithm to construct a (mxn) correspondence matrix of the data received in steps al) and a2); d) transforming the (mxq) second data set to a (mxq') transformed second data set, using a manifold learning algorithm comprising a manifold approximation step and a correspondence-aware manifold projection step; thereby obtaining a fused image for the sample; wherein the correspondence-aware manifold projection step in step d) comprises at least 3 inputs, comprising: input from the (mxn) correspondence matrix obtained in step c); input from the (nxq') first data set obtained in step b); and input from the (mxq) second data set after the manifold approximation step; and, wherein the correspondence-aware manifold projection step in step d) comprises a cost function that is constrained based on these at least 3 inputs.

As used herein, the notation "(nxq)" refers to the dimensions of the data set being n times q according to classic matrix notation, whereby n and q are (non-zero) natural numbers. The term "correspondence-aware manifold projection", as known to the person skilled in the art, refers to the fact that some kind of mapping (also called registration) is performed between the two different datasets/image before data fusion takes place. The source of the correspondence may thus be the pixel coordinates of each dataset and the correspondence matrix may be the result after mapping or registration of the two images/datasets to each other.

The manifold is then built by taking into account the information per corresponding pair of pixels. In the case of combining MSI with histology data, this means that the molecular information can be taken into account together with the morphological details enabling the molecular distributions to be interpreted at the single cell level, where this is not possible based on evaluating the molecular information by itself.

Steps b) and c) may be performed interchangeably or (partially) in parallel. Step d) is performed after steps b) and c).

Preferably, q' is equal to 3. Preferably, q and q' are equal to 3.

The present method is applicable to a wide variety of imaging modalities as long as a correspondence between the modalities is present, for example when they are obtained from the same subject and/or the same sample. Therefore, the first and second data sets (for example MSI data and H&E stains) are preferably obtained from the same sample, preferably a tissue sample (for example when used in pathology or biology). According to the present method for example, the colours corresponding to different molecular compositions can be reflected in the fused image. According to a specific embodiment, when a sample for example comprises specific cells have a cytoplasm with a green colour and are for example known to produce certain immunoglobulins, some of these cells can be observed in other regions as infiltrating cells, when fusing molecular data obtained at for example a scale of 10 micrometre or higher, with for example microscopic information at a much smaller scale, for example 0.2 micrometre. It is thus clear that for such embodiments of the samples the method is advantageous by fusing MSI and microscopy data.

According to a preferred embodiment p is not equal to 3; preferably larger than 3. In some preferred embodiments, the first data set from the sample comprises hyperspectral data, preferably mass spectrometry data.

In some preferred embodiments, step al) comprises the step of obtaining a set of hyperspectral data using a hyperspectral method, preferably MSI.

As used herein, the term 'hyperspectral data' comprises spatial transcriptomic and proteomic studies, herein also referred to as 'spatial omics'.Mass spectrometry imaging (herein also referred to as MSI) is a technique that uses mass spectrometry measurements to visualize the spatial distribution of molecules, metabolites, peptides or proteins by their mass-over-charge (m/z) values. The technique is used in pathology and the pharmaceutical industry for the detection of biomarkers and drug development. Mass spectrometry data is usually large in pixel space and in feature space.

However, the proposed method is not only suited for MSI, but can also be used for other highdimensional omics data such as spatial transcriptomics. The method is capable of dealing with data where the dimensionality of the pixel space is much larger than that of the feature space (MSI) but also the other way around (spatial transcriptomics).

For example, the method could be used for any combination of datasets where there is a correspondence present; for example to fuse MRI, PET, CT images, fluorescent microscopy images and the like. For example, correspondence may be present if the datasets are obtained from the same subject, or from the same tissue. The method is particularly advantageous when use is made of a dataset comprising molecular data with a higher spatial resolution, as such datasets are easier to fuse with another dataset.

In some preferred embodiments, the second data set from the sample comprises microscopy data, preferably obtained from staining, more preferably hematoxylin and eosin stains.

In some preferred embodiments, step a2) comprises the step of obtaining a set of microscopy data using microscopy, preferably by staining, more preferably hematoxylin and eosin stains. As stated above, the method could be used to fuse any combination of datasets (e.g. images) where there is a correspondence present. However, a second data set comprising microscopy data, with a sufficiently high spatial resolution is advantageous, preferably a spatial resolution in the range of 5 micrometre up to and including 0.2 micrometre. A significant difference in resolution between the data modalities will ensure exploitation of the complementarity when dealing with lower resolution (molecular) measurements on the one hand, and a higher spatial resolution of the second modality on the other hand.

Hematoxylin and eosin stains are herein referred to as 'H&E stains'.

In some embodiments, q' is equal to q. In some embodiments q is equal to 3. In some embodiments, q' is equal to 3. In some preferred embodiments, q is equal to 3 and q' is equal to q, or in other words q' is also equal to 3. The microscopy data usually already comprises 3D data, so in that case there is no dimensionality reduction for the second data set.

The method as described herein is a general data fusion method, while the combination of hyperspectral and microscopy data is preferred. Therefore, the first data set from the sample comprises hyperspectral data, preferably mass spectrometry data; and the second data set from the sample comprises microscopy data, preferably obtained from staining, more preferably hematoxylin and eosin stains.

Nevertheless, the method may be applied to any combination of images, preferably when one or more conditions are fulfilled:

(i) the images have some underlying shared latent space (/.e. they are obtained from the same source, patient, sample,...);

(ii) registration between the modalities is possible; and/or,

(iii) there is complementarity present (e.g. high spatial resolution+ low molecular resolution vs. low(er) spatial resolution + high(er) molecular resolution).

The same principle could apply to other measurements, for example without a spatial component and not limited by 3 dimensions.

The advantage of fusing images is that one can immediately and visually evaluate the fused result (in the preferred case, the higher resolution of the MSI data may allow to actually see the location of the cell nuclei). When applying the method to measurements without a spatial component, the interpretation may become more difficult, but the result may still be valuable. This could be compared to the application of a dimensionality reduction method which is typically used to reduce an n-feature space to 2 or 3 dimensions, because this enables visualisation of the outcomes. However, sometimes m larger than 2 or 3 dimensions is used to work with representations of data. In some embodiments, the dimensionality of the first data set is not reduced in step b). For example, the method may be used starting from an existing image. Therefore, in some embodiments, in step b) p equals q and q' equals q. In this case step b) can be performed efficiently.

In some embodiments, the dimensionality of the first data set is reduced in step b). In some preferred embodiments, in step b) the dimensionality of the first data set is reduced. Therefore, in some preferred embodiments, in step b) p>q. More preferably q equals 3 and/or q' equals 3, which is advantageous for visualisation purposes. In such situations, step b) allows to have a 3D representation of the first data set, for example the hyperspectral data. Since dimensionality is reduced in 3 dimensions, it can be visualised in RGB.

In some preferred embodiments, step b) comprises a non-linear dimensionality reduction algorithm to project all information from the first data set, preferably comprising hyperspectral data, onto the RGB space. While the goal of data fusion is to identify elements that would not be identified based on a single modality, being able to take into account a complete high-dimensional molecular dataset is a strong advantage in this regard. In FIG. 6 of the example section, it is illustrated that the green colour corresponds to the cytoplasm of immunoglobulin producing cells. It can be seen that some of these cells behave invasively, as such, the method can support pathologists in assessing the invasiveness of individual tumour cells. In other words, the methodology turns dimensionality reduction into a data fusion approach to improve the interpretability of molecular trends measured at a lower spatial resolution through the enrichment with cell shape information. Another example is shown in FIG. 7 where blood vessels with a single cell endothelial layer, a blood vessel surrounded by a muscle layer and some collagen structures can be distinguished.

In some embodiments, q'>3. This may depend on the downstream application. The manifold approximation is applied with the aim of fusing the data, not necessarily with the aim to reducing it to a lower dimensional space (which does not have to be 3 dimensions unless for visualization purposes). In some embodiments, the method is used to fuse the data to any number of dimensions as a pre-processing step for clustering (an example). The choice between a transformation step and a reduction step depends on the dimensionality of the input data associated with the second dataset. The registration algorithm in step c) has several advantages. An advantage of the present method is that an imperfect registration is not a limiting factor. The method exploits the construction of the manifold to take into account the local information or pairwise information between pixels. The method is therefore robust to small deviations due to registration difficulties which is particularly challenging for data of different resolutions (/.e. the resolution of microscopy data varies around 0.2 micrometre, while the molecular measurements may have a resolution of 10 to 200 micrometre). It would be difficult using prior art methods to obtain a fused result at the microscopic resolution, because registration between these modalities would be challenging.

For example, after registration of the microscopy image with the hyperspectral image, a series of corresponding pixels is found. The pixel values from the hyperspectral RGB space may be the target colours.

In some preferred embodiments, registration step c) occurs prior to step b). In some preferred embodiments, registration step c) occurs in parallel with step b). In some embodiments, registration step c) occurs after step b).

The registration step c) can, for example, be performed as set out below.

First and second data sets (for example MSI data and H&E stains) are gathered from the same tissue sample. Although the resolution is different, their spatial similarities make it possible to find pairs of pixels in both data sets that correspond to one another.

Finding matching pairs can be achieved with image registration techniques. Geometric transformation algorithms can estimate a projection between the two coordinate spaces using only a small set of matching pixel pairs. The output is a transformation matrix that can be applied to warp all other pixels from one image to the other.

For example, the matrix A n x P may denote the flattened high-resolution data and the matrix Bmxq may denote the low-resolution spectral data. The correspondence between these two data sets may be recorded in the matrix C n x m such that: It is clear to the person skilled in the art that alternative embodiments are possible of setting up the correspondence matrix.

Manifold learning is preferably used for non-linear dimensionality reduction, and it does so by focusing on preserving local distances between instances. Manifold learning algorithms can be used to solve dimensionality reduction and data visualisation tasks. The basic idea behind manifold learning for dimensionality reduction is that a high-dimensional data set can be represented by a lower-dimensional counterpart that shares a similar underlying topological subspace. The manifold approximations as used herein therefore allow for global and local relationships. These improved relationships provide for an improved fused image.

Applying manifold learning methods to real world data sets however poses additional problems. Usually data is not uniformly distributed and therefore the 'true' underlying manifold can only be approximated. In addition, high-dimensional omics data like MSI and also high-resolution images such as H&E stains are usually large, often going up to multiple GBs of data.

Non-linear dimensionality reduction methods such as t-SNE are often used for the visualisation of biological data. Not only are these methods capable of detecting non-linear trends, they can also capture the complete feature space when reducing data to two or three components, which is not always the case for linear methods such as for example PCA. UMAP is part of this family of methods, but shows major improvements in terms of scalability, enabling the analysis of large spatial omics data such as MSI. With growing data sizes, scalability is considered crucial to analyse state-of-the-art data sets. The application of scalable dimensionality reduction approaches such as UMAP therefore makes it possible to analyse and fuse state-of-the art datasets which would be impossible when using for example t-SNE. As used herein, the term UMAP refers to uniform manifold approximation and projection. UMAP may be replaced by similar algorithms like t-SNE or LargeVis because these methods belong to the same family of algorithms as UMAP. However UMAP has proven to work well with hyperspectral data and has some other advantages such as the aforementioned scalability, out-of-sample prediction and parametric learning. Therefore, in some preferred embodiments, step b) comprises using a manifold learning algorithm selected from the group comprising: UMAP, t-SNE, LargeVis, Isomap, neural network based approaches such as autoencoders; preferably UMAP.

In some preferred embodiments, step d) comprises using a manifold learning algorithm selected from the group comprising: UMAP, t-SNE, LargeVis, Isomap, neural network based approaches such as autoencoders; preferably UMAP.

Step b) and step d) most preferably use the same manifold learning algorithm. Alternatively yet less preferred, the method could also work by combining methods over different step.

UMAP creates a topological structure that represents the high-dimensional data by assembling approximations of local manifolds, and assembles an equivalent topological structure for a low-dimensional representation of the data. It then optimizes the lowdimensional representation to the high-dimensional data by minimising the cross entropy between the two topological structures.

In general, UMAP fits well within the family of algorithms such as t-SNE or LargeVis that focus on preserving local distances over global distance. These algorithms rely on different mathematical principles although their implementations have lots of common ground. Like t- SNE and LargeVis, manifold approximations are implemented as weighted k-neighbour graphs. Any of these algorithms would suit the data fusion method explained here.

UMAP works especially well for MSI data, making it an excellent choice as a general purpose algorithm for high-dimensional omics data. UMAP is therefore preferably used and extended to fit the desired data fusion goals. In the present method, UMAP can be used as a dimensionality reduction algorithm for high-dimensional omics data, and adapted to fuse the resulting low-dimensional representation with high-resolution imaging. Further, according to the present method, UMAP provides for the further advantages of custom distance metric support, better conservation of global relationships while also modelling the local relationships.

After the registration step c), the target embedding of step b) is used as a constraint to embed the second data set (preferably a microscopy image) in step d), resulting in data fusion. For example, step d) can result in data fusion of the molecular information with the histological or microscopy data. More generically, the method may be used beyond images and/or general molecular information.

As used herein, the terms "registration" and "fusion" do not refer to the same process. The registration step c) occurs prior to the fusion step d). Fatma El-Zahraa Ahmed El-Gamal et al. in "Current trends in medical image registration and fusion", 2016 describe the difference between registration and fusion as follows:

"The intent of image registration is to align images with respect to each other. The input for this process is two images: the original image is known as the reference image while the image that will be aligned with the reference image is known as the sensed image. The result for this step can help in further analysis processes including image fusion. Image fusion is in turn the process of producing more informative and better descriptive images based on the input ones."

Such a method allows visualising the molecular information obtained at a higher spatial resolution either for i) the molecular trends present in an entire data set or for ii) a single feature or molecule of interest (MOI). Therefore, in some preferred embodiments, the method comprises the step of: visualising the molecular information obtained at a higher spatial resolution either for i) the molecular trends present in an entire data set or for ii) a single feature or molecule of interest (MOI). Instead of only visualizing one molecule only based on a priori information about a sample, the method now makes it possible to fuse two complete data sets, thereby obtaining a fused image with a reduced representation.

In some preferred embodiments, the method comprises the step of: expanding the distribution of molecules to regions not measured, which may also be referred to as out-of- sample prediction. The distribution of these molecules can be expanded to regions not measured by the molecular measurement technique such that for example a complete microscopy slide can be enriched with the biological signals or omics measurements available. This is advantageous as molecular measurements are expensive. When there is only available molecular information for a small area, it now for example also becomes possible to verify a hypothesis in the neighbouring tissue.

The correspondence-aware manifold projection (CAMP) step in step d) performs the fusion of the two data sets based on correspondence information that maps instances of one data set onto the other. The goal of the CAMP step in step d) is to create a fused representation of the two data sets such that the corresponding instances of both data sets lie close to each other in the fused representation in order to exploit the complementarity between data sources.

The CAMP step may fuse information while projecting a manifold approximation onto a lowdimensional representation. Specifically, the CAMP step may take the first data set and align it to the second data set based on the corresponding instances. It optimises the fused representation according to the manifold approximation of the first data set and constrains the result to move closer to corresponding instances from the second data set. The fused representation therefore has the same resolution (or another desirable property depending on the application) of the first data set.

In some embodiments, the concept of distance between corresponding points is captured as an interplay of attraction and repulsion. In some preferred embodiments, the CAMP step aims to minimise the repulsion between corresponding points, whereby the repulsion may be defined as the average difference between corresponding instances in the two data sets. Such a way of modelling, the data is advantageous as it retains the topological data structures as good as possible.

For example, in order to formally define repulsion between A and B, the information of the correspondence matrix C n x m may be reconstructed as a mapping between the index sets of both data sets.

Definition ! /‘'i fuit - : / > 2 ' > >n > < puihb r t fit matrices A BXp and B mX(t and their

The correspondence map helps to formulate the concept of repulsion between corresponding instances. It may, for example, be defined as follows:

Definition 2 Consider matrices A nx rjt and B mX d. Let 7 be the correspondence map between the index sets of A and B. Define <P : R d - R, the repulsive strength between these matrices, as wherein the index to refers to the L2-norm, and is related to the Euclidean distance between points. The Euclidean distance of the formula may be replaced by a distance measure including Minkowski style metrics, Euclidean, manhattan, Chebyshev, minkowski, Canberra, braycurtis, haversine, mahalanobis, wminkowski, seuclidean, cosine, correlation, Hamming, jaccard, dice, russellrao, kulsinski, rogerstanimoto, sokalmichener, sokalsneath, and/or yule distance metric. Said distance measure may also be obtained in the form of a custom distance metric through the application of a separate metric learning step.

The repulsion tp is an average because a single instance in A n x P can have multiple corresponding instances in B mxq when n < m. All corresponding instances should preferably receive equal weight. Note that in Definition 2, A n xd and B mX d have the same dimension d. As used herein d is preferably 3, such that (when q and/or q' is already 3) one only needs to bring the high-dimensional omics data set to 3 dimensions. Given two data sets A n xd and B mX d and their correspondence map y, CAMP can be formulated as a constrained optimisation problem. min CA JV(^4) subject to

The present invention is characterized by choosing a specific step during which to implement a cost function. The present invention is also characterized by choosing specific inputs for the cost function.

In some preferred embodiments, the cost function takes into account the differences between the two data sets, preferably the difference between the two images. In some preferred embodiments, the cost function is subjected to the distances between the colours associated with the corresponding instances.

The advantage of the constrained cost function is that the differences between points are taken into account. In some preferred embodiments, the cost function is subjected to the distances between the colours represented by RGB values, but other cost functions could be used. Such an embodiment of the cost function is advantageous, as without the cost function it would not be possible to obtain a fused image. Furthermore, the cost function enables the incorporation of information of interest related to the corresponding instances. For example, the equality-constrained problem can be transformed into the following quadratic penalty function, which concludes the CAMP cost function:

The first term in the equation above is the cost function that focuses on preserving local distances within the data. The second term penalises the repulsion between the corresponding instances in the two data sets. The hyperparameter p as used herein controls the balance between the two terms.

A high value for p increases the importance of the correspondence information while a low value for p aims to preserve the manifold embedding. The inventors have found that the penalising term can be unforgiving, so preferably one starts with a low value for p. The value for p can be larger if the number of corresponding instances is low.

The result of the method according to the invention is a fused image which combines the information comprised within the first and second data set.

In some preferred embodiments, the fused image is the result of minimising the local distances within the manifold of the second data set, preferably from microscopy data, and the differences with the target colours from the first data set representing molecular distributions, preferably from hyperspectral data.

In some preferred embodiments, the method is used in pathology, in particular to enrich histological stainings and/or microscopy images with molecular measurements obtained from the same tissue sample. This is advantageous as typically evaluating a microscopy slide is not enough for a pathologist, and afterwards additional stainings have to be performed. Using the preferred method it becomes possible to visualize molecular measurements together with the microscopy image, such that a pathologist can immediately take into account molecular differences to make a better assessment.

In some preferred embodiments, the method is used for multi-omics data fusion (with or without a spatial component), for example to exploit the complementarity of these measurements obtained from the same patient or tissue sample. The fused information can be used to support applications for personalized medicine such as improved biomarker discovery, drug development, identification of suitable immunotherapies, and improved stratification of patients to prevent overtreatment and improve the quality of life. The fused information for example makes it possible to interpret both microscopy data in light of the molecular information and the other way around in one slide. In this way there might be detected elements, that would not be detected by evaluating one of the data sources individually. This is advantageous, for example in the context of an improved assessment of tissue heterogeneity for personalized medicine.

The present invention also relates to a computer program, or a computer program product directly loadable into the internal memory of a computer, or a computer program product stored on a computer readable medium, or a combination of such computer programs or computer program products, configured for performing the method as described herein, or embodiments thereof.

The present invention also relates to a data processing apparatus/device/system comprising means for carrying out [the steps of] the method as described herein, or embodiments thereof.

The present invention also relates to a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method as described herein, or embodiments thereof.

The present invention also relates to a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method as described herein, or embodiments thereof.

EXAMPLES

FIG. 1 illustrates a conceptual overview of an embodiment of the method according to the invention. The example is shown for MSI data where the data set is first reduced to three dimensions. This representation is subsequently fused with the microscopy data, such that all molecular trends are visualized at a higher resolution. FIG. 2 illustrates the application of a method according to an embodiment of the invention.

The molecular data consisting of n pixels and p features is reduced to the target embedding (nx3) in step c) (1). The pixel coordinates of the molecular and the (mx3) microscopy image undergo a registration step b) such that a correspondence matrix is obtained (2). Subsequently, the microscopy image is subjected to a dimensionality reduction step (3), wherein each pixel is evaluated in function of its correspondence to the target embedding in step d). Specifically, the projection step in the dimensionality reduction method is constrained based on the target embedding, causing pixels in the microscopy image to receive a similar colour based on the reduced target embedding of the molecular data (4). This approach enables the transfer of information obtained from a complete high-dimensional molecular data set to a single microscopy slide, but can also be used to transfer the information from a single feature or molecular image to the microscopy slide.

FIG. 3A illustrates the low dimensional representation of a lymphoma MSI data set (500000 pixels x 8000 m/z features, 10 pm resolution). This hyperspectral visualization represents the complete 8000 m/z feature space compressed into 3 dimensions, such that each colour is connected to a molecular trend present in the data. The information present in this image is subsequently used to perform data fusion with the corresponding microscopy image. In FIG. 3B, the fused results show that the molecular trends in the data can be visualised at a much higher resolution.

FIG. 4 illustrates the use of the method according to an embodiment of the invention to obtain an out-of-sample prediction. In addition to performing data fusion for the area covered by the MSI measurements, the method can be used to extend the information obtained to the complete microscopy image through out-of-sample prediction 6(OOS). This opens the door to interpreting much larger microscopic areas based on a limited amount of molecular measurements.

FIG. 5A shows the H&E staining or microscopic image and FIG. 5B shows the low dimensional representation of the spatial transcriptomics molecular information (281 pixels x 16 416 features, 200 pm resolution) obtained from the same tissue sample. FIG. 5C illustrates the fused results obtained by a method corresponding to an embodiment of the invention. FIG. 6 shows how, thanks to the data fusion approach, it is possible to highlight differences between cells and link these to their molecular function. The plasma cells produce immunoglobulins in their cytoplasm (green colour), and some invading cells can clearly be distinguished. Moreover, because these cells need to produce a lot of immunoglobulins, their cell nuclei are pushed towards the sides of the cell, which is even visible in the fused results.

FIG. 7 shows some blood vessels surrounded by a muscle layer or a single endothelial cell layer in the fused result and corresponding microscopy or H&E image. Collagen structures have become visible as well.

FIG. 8 illustrates the CAML (Correspondence-aware manifold learning) validation and generality. A multi-modal dataset based on MNIST was built, the handwritten digits dataset. (1) Using a Sobel filter corresponding data with a shared latent space was created. (2) Fusion - correspondence matrix is the diagonal matrix. (3) Transform instances based on the learned model from (2). (4) UMAP embeddings of the multi-modal dataset and the fused data - fusion still produces valid digits adapted to reflect also the property of the second modality.

It is thus clear from the embodiments above that the molecular results measured using MSI can now be brought to the single cell level. This means that valuable information from the microscopic images can be retained, such that a pathologist who for example needs to identify for example a tumour cell can now also distinguish between cells, for example based on the colour of their cytoplasm. According to such embodiments it is then for example possible to distinguish cells which produce certain proteins from those who do not, which is advantageous.