METHOD OF ANALYSIS' PATTERN RECOGNITION AND COMPUTER PROGRAM - UNIV ESTADUAL PAULISTA JÚLIO DE MESQUITA FILHO

Title:

METHOD OF ANALYSIS' PATTERN RECOGNITION AND COMPUTER PROGRAM

Document Type and Number:

WIPO Patent Application WO/2016/183647

Kind Code:

Abstract:

The present invention relates to a method of analysis' pattern recognition. It further relates to the computer program associated to this method. More specifically, the method allows the structural identification / elucidation and classification of chemical substances through the spectral data originated from one, two or three-dimensional analysis which generates peaks.

Inventors:

TEIXEIRA FREIRE RAFAEL (BR)
CASTRO-GAMBOA IAN (BR)

Application Number:

PCT/BR2015/000075

Publication Date:

November 24, 2016

Filing Date:

May 18, 2015

Export Citation:

Click for automatic bibliography generation Help

Assignee:

UNIV ESTADUAL PAULISTA JÚLIO DE MESQUITA FILHO - UNESP (BR)

International Classes:

G06F19/24; G01N24/08; G01R33/46; G16C20/20

Domestic Patent References:

WO2013150098A1

2013-10-10

Foreign References:

US20090299653A1	2009-12-03
US6895340B2	2005-05-17
US4719582A	1988-01-12

Other References:

CROASMUN, W. R. ET AL.: "Two-dimensional NMR spectroscopy: applications for chemists and biochemists", 2ND ED. EXPANDED AND UPDATED TO INCLUDE MULTIDIMENSIONAL WORK., 1994, pages 581 - 583
FREIRE, R. T.: "Development of a pattern recognition and dereplication software applied to nuclear magnetic resonance spectroscopy", REGISTER PRESENT IN INSTITUTIONAL REPOSITORY OF UNESP RELATED TO D.SC. THESIS, 15 August 2014 (2014-08-15), Retrieved from the Internet [retrieved on 20160127]
"Desenvolvimento de um software de reconhecimento de padrões e de desreplicação aplicado a espectroscopia de ressonância magnética nuclear", REGISTER PRESENT IN VIRTUAL LIBRARY OF FAPESP RELATED TO THE DOCTORATE SCHOLARSHIP PROJECT, Retrieved from the Internet [retrieved on 20160129]
FREIRE, R. T.: "Development of a pattern recognition and dereplication software applied to nuclear magnetic resonance spectroscopy", REGISTER PRESENT IN VIRTUAL LIBRARY OF FAPESP RELATED TO THE D.SC. THESIS, 15 August 2014 (2014-08-15), Retrieved from the Internet [retrieved on 20160129]

Attorney, Agent or Firm:

DE MORAES SPIANDORELLO, Fabíola (271 Bloco II,Barra Funda - São Paulo/SP, CEP: -070, BR)

Download PDF:

View/Download PDF PDF Help

Claims:

CLAIMS

1. Method of analysis' pattern recognition characterized bv the fact that it comprises the following steps:

a) Perform the analysis of the substance and treat the results obtained;

b) Prepare tables where the rows represent the peaks chemical shifts and columns represent the selected nuclei;

c) Insert the data on the computer program.

2. Method of analysis' pattern recognition according to claim 1

characterized bv the fact that, on step (a), the analysis of the substance can be performed by any kind of analysis method which generates peaks in a space having up to three dimensions.

3. Method of analysis' pattern recognition according to claims 1 or 2 characterized bv the fact that the analysis can be selected from 1 H NMR, HMBC, COSY, 13C NMR, UV, IF and mass spectroscopy, among others.

4. Method of analysis' pattern recognition according to claim 1

characterized bv the fact that, on step (b), the column order must obey the mass of the nuclei in crescent order.

5. Computer program that causes a computer to perform a method of analysis' pattern recognition characterized bv the fact that it comprises a function of pattern recognition developed by using the basic principle of the mathematical distance between each coordinates of peaks from two samples, where each peak consists of a point in a Cartesian space with coordinates x, y and z.

6. Computer program according to claim 5 characterized bv the fact that the correlation peak n for each sample 1 and 2 can be described as: sin - (s1 χδ ppm , s1y5 ppm, s1z5 ppm)

s2n = (s2 χδ ppm , s2y5 ppm, s1z5 ppm).

and that the distance between the points is calculated by the

Euclidean distance, which is given by the equation: = J(s2n x δ ppm ^{— S}1^{R x} δ ppm) + (^s2n y s ppm ^— sin Y δ ppm) (s2n z δ ppm ~ Sin Z _δ p_pm)

7. Computer program according to claims 5 and 6 characterized by

the fact that a similarity coefficient is represented by the equation

Similarity:∑ns2 x 100/nsl

wherein:

∑ns2 = Number of peaks found similar in sample 2

nsl = Number of peaks present in the sample 1

wherein the score represents percentage values of the number of

peaks attributed as similar presented in sample 2 by the number of total

peaks from sample 1.

8. Computer program according to claims 5, 6 and 7, characterized

by the fact that the pattern recognition function is:

[similar peaks, similarity]=recog_pat(s1 , s2, deltaX, deltaY, deltaZ, difNP)

wherein "s1" and "s2" are the correlation peaks from sample 1 and

sample 2 and "difNP" is the module of the difference between the number

of peaks from sample 1 and sample 2.

Description:

METHOD OF ANALYSIS' PATTERN RECOGNITION AND COMPUTER

PROGRAM

Technical Field of the Invention:

The present invention relates to a method of analysis' pattern recognition. It further relates to the computer program associated to this method.

More specifically, the method allows the structural identification / elucidation and classification of chemical substances through the spectral data originated from one, two or three-dimensional analysis which generates peaks.

Background of the invention:

Natural products are chemical compounds derived from natural sources such as plants, animals and microorganisms.

In the organic chemistry field, a natural product is an organic compound which was been purified and isolated from a natural source and which had been produced by the primary or secondary metabolism.

Natural products also have a pharmacologic or biologic activity, which can provide therapeutic advantages on the treatment of diseases.

The use of natural products as active is related to ancient's civilizations and, up to date, they remain as an important source for the discovery of new drugs.

Classical phytochemical studies use techniques of extraction, isolation (such as chromatography columns and high performance liquid chromatography, among others) and structural elucidation (such as mass spectrometry and NMR techniques) to identify a natural product.

One of the most difficult and exhausting stages of these studies relate to the structural elucidation and identification of compounds, which are present on these organisms. There are many strategies to help in this process, such as dereplication and metabolomics.

Dereplication is, in theory, performed on an early stage, to determine if the active compound was previously disclosed and to avoid the re- elucidation of an already known compound.

It can be performed by using liquid chromatography and mass spectrometry (LC-MS) or NMR, wherein the data obtained is compared to the information found on previously reported compounds databases. However, the lack of a strong database focused on natural products make the studies more difficult.

Considering this challenge, spectral NMR data obtained by NuBBE (virtual database of natural products and derivatives from the Brazilian biodiversity) were grouped to create a spectral database comprising more than 30,000 NMR analysis.

Further, data from HMDB (The Human Metaboiome Database) and MMCD (The Madison Metabolomics Consortium Database), available online, were also included.

The HMDB is a database of molecules found in the human body and has a collection of 41 ,511 substances. However, only a small part of these substances have their spectral data available. Only 895 1 H-1 D NMR analysis peaks and 895 13C HSQC NMR analysis peaks were

downloaded from this site along the substance figures and names, generating 3,580 files (895 ^*4).

The MMCD contains molecules presented in biological samples and offers a catalog of 20,306 compounds. However, only 794 of them contain NMR data obtained from in house experiments. For this database it was downloaded 448 NMR analysis peaks of 1 H-1 D, 13C, 13C-HSQC, and TOCSY along with their figures and their names resulting in 2,688 files (448*6).

Thus, the present invention relates to a method of analysis' pattern recognition and, further, to the computer program related to this method.

Current technologies for carrying out the identification of chemicals are based on nuclear magnetic resonance signals. Signals are entered into a software which identifies the substance as possible through a system of pattern recognition. However, if a group of peaks is inserted and these not belong to the database information, no information or conflicting information are generated for the user, making it impossible to identify the class or even the chemical compound.

Document DE 4436923 C1 of the state of art discloses an

identification system using a barcode which indicates both the structure and purity of the chemical product, obtained via a spectrum analysis system. The proposed process aims to the declaration and automatic identification of chemical products trade in goods with transport containers from one place to another.

On the present invention, the computer program performs the search and the identification of the spectral parameters of the analysis. The method and computer program of analysis' pattern recognition disclosed on the present invention uses solely peaks instead of the whole spectrum, turning the data dimension drastically reduced.

Nearly 350 NMR 13C-gHSQC and 13C-gHMBC analysis obtained from isolated compounds arising from different endophytes were treated and their peaks were exported to the computer program, creating a database comprising spectral information from 1 ,344 compounds.

It was developed a function of pattern recognition that considers the spatial distance between the peaks of the analysis of two samples.

Calculations identify each peaks are similar and, at the end, the function of pattern recognition attributes a similarity score for the correlation between two samples from the same experiment, generating a similarity profile.

Each chemical structure generates a different and unique profile, thus, this method can be considered as a fingerprint technique.

Two molecules with the same chemical skeleton that differ only in one substituent, will display similar behavior towards each technique. Taking this concept in consideration and applying the method and computer program developed in the present invention, it became possible to group substances with similar structures using statistical techniques such as HCA and PCA. Thus, the molecular elucidation process became easier and faster, requiring only a few steps.

Summary:

The present invention relates to a method analysis' pattern recognition. It further relates to the computer program associated to this method.

More specifically, the method allows the structural identification / elucidation of chemical substances through the spectral data originated from one, two or three-dimensional analysis which generates peaks.

Brief Description of the Drawings:

Figures 1 A-1 B shows a flowchart of the architecture of the pattern recognition function.

Figure 2 is a bar chart of the similarity values found for Roridin A, correlated with all substances presented in the 13C-gHSQC database.

Figure 3 is an overlapped 13C-gHSQC spectra of Roridin A in blue, and sample 155 Iisineiacsf26fr04precip2 in green with similarity index of 100%.

Figure 4 is a bar chart of the similarity values found for the sample 1 , Roridin A, correlated with all substances presented in the 13C-gHMBC database.

Figure 5 is an overlapped 13C-gHMBC spectra of Roridin A in blue, and sample 189 Iisineiacsf26fr04precip2 in green with similarity index of 100%.

Figure 6 is a hierarchical tree for all 1344 compounds performed on the 1 H-1 D similarity matrix.

Figures 7A, 7B e 7C are, respectively, the HCA from 1 H-1 D NMR similarity matrix, the HCA from the 13C-HSQC NMR similarity matrix and the HCA from the concatenated 1 H-1 D /13C-HSQC matrix.

Detailed Description of the Invention:

The invention disclosed herein relates to a method of analysis' pattern recognition and the computer program associated to this method. Specifically, the invention aims to the identification of chemical substances through the spectral data originated from one, two or three- dimensional analysis which generates peaks.

The computer program makes use of the signals and based on this information, it is generated a similarity matrix which becomes possible to group data (molecules) according to their chemical structure (for further details please see Examples 1 and 2).

To perform the similarity calculation of an unknown sample, it is necessary to compare the peaks from this substance with peaks of known substances or standards. The data obtained is compared to the

information found on previously reported compounds databases, such as NuBBE, HMDB and MCD.

Additionally, in case the compound does not exist in any of the databases, the computer program of the invention performs a hierarchical grouping of the signals, generating a similarity profile in which the molecules are grouped according to their structural similarities.

Thus, through this method, it is possible to group the molecules according to their chemical structures. So, even if the molecule of interest does not exist in a database, it is still possible to obtain information that assists in the structural elucidation, since it is possible to assign a class or even a structure for the compound.

- Method of analysis' pattern recognition:

The method of analysis' pattern recognition of the present invention comprises the following steps:

a) Perform the analysis of the substance and treat the results obtained;

b) Prepare tables where the rows represent the peaks chemical shifts and columns represent the selected nuclei;

c) Insert the data on the computer program.

On step (a), the analysis of the substance can be performed by any kind of analysis method which generates peaks in a space having up to three dimensions. Thus, the analysis can be selected from 1 H NMR,

HMBC, COSY, 13C NMR, UV, IF and mass spectroscopy, among others.

On step (b), the column order must obey the mass of the nuclei in

crescent order.

The details associated to step (c) will be disclosed as follows.

- Computer program associated to method of analysis' pattern

recognition:

Based on the analysis peaks coordinates, a function of pattern

recognition was developed using the basic principle of the mathematical

distance between each coordinates of peaks from two samples.

Each peak consists of a point in a Cartesian space with coordinates

x, y and z. For example, in a one-dimensional NMR the x-axis is the

chemical shift of a nucleus and y-axis, the intensity. In a UV spectra or an

IF spectra, the x-axis is the wavenumber and y-axis, the absorbance. In

mass spec, the x-axis is the m/z and y-axis is the relative abundance.

In a two-dimensional NMR, both x and y are chemical shifts of the

selected nuclei in the experiment and z is the intensity. In a HPLC-DAD

the coordinates of x is the time, y is the wavenumber, and z is the

absorbance. Lastly, In a three-dimensional NMR, x and y and z are the

chemical shifts of the selected nuclei in the experiment.

Consider the sample 1 as s1 and sample 2 as s2. The correlation

peak n for each sample can be described as:

sin = (s1 χδ ppm , s1y5ppm, s1z5 ppm)

s2n = (s2 χδ ppm , s2y5 ppm, s1z6 ppm)

One example to calculate the distance between points is to use the

Euclidean distance and is given by;

= -J(s2h x s p m - sin x s _ppm) ² 4- (s2n y _{β ppm} - sin y _5ppm) ²(s2n z _{6 ppm} - sin z _{s ppi}

By the use of the equation above, and setting a maximum value for

the distance, it is possible to assign two peaks in different samples as similar to each other.

This can be indicative of peaks from the same chemical environment, which can represent the same part of a molecule.

Considering the total number of peaks of an analysis and taking in consideration the total peaks of s1 that are found in s2, it is possible to generate a similarity coefficient represented by the third equation

Similarity:∑ns2 x 100 /nsl

wherein:

∑ns2 = Number of peaks found similar in sample 2

nsl = Number of peaks present in ;the sample 1

This score, given by the third equation, are percentage values of the number of peaks attributed as similar presented in sample 2 by the number of total peaks from the samplet.

If we apply this function for one substance to the entire database, a similarity profile and a p-vals profile are generated where the intensity ranges are from 0 to 100 and 0 to 1 , respectively.

The similarity profile represents how similar this substance is with all others presented in the database and the p-vals profile are the statistical significance of the results.

These profiles can be considered as the substance behavior between all others in the database using, for example, the 1D 1H NMR peaks.

Thus, the pattern recognition function is:

[similar eaks, similarity]=recog_pat(s1 , s2, deltaX, deltaY, deltaZ, difNP) The function created has six input variables (s1 , s2, deltaX, deltaY, deltaZ, difNP) and two output variables. The input variables "s1" and "s2" are the correlation peaks from samplel and sample2. The rest of the inputs are the similarity limitation values between s1 and s2.

The value of "difNP" is the module of the difference between the number of peaks from samplel and sample2. This variable is used to set a maximum value for the difference between the numbers of peaks from two samples. If this condition is not obeyed, the similarity index is considered zero. The output similarity is the similarity value calculated using the pattern recognition function. The output "pontos_minimos" is the matrix that has all the peaks with the minimum distance.

This function only uses the peak's coordinates to perform all the calculation. This is an advantage to the conventional methods that uses the entire spectrum. The time and power processor is drastically reduced allowing any kind of computer run this function in real time.

The flowchart of Figures 1 A-1 B shows the architecture of the pattern recognition function.

This method can be considered as a fingerprint technique because each substance has its own profile. According to that, similar structures has similar profiles and as more substances are added to the database, the more specific the profile.

The function of pattern recognition takes in consideration all peaks from spectra with all peaks from all spectrum presented in the database. This generates a molecule profile, a signature, which is based on the peaks.

With multivariate statistical analysis, it is possible to group the substances where each group contains substances with similar chemical structures.

Example 1: Evaluation and demonstration of the pattern recognition function applied to the NMR endophyte database:

For this demonstration using Proton-Carbon13 HSQC and Proton- Carbon13 HMBC NMR spectrum, the substance roridin A, isolated from endophyte Myrothecium gramineume, was chosen. This substance is a tricotocen with high bactericidal activity and its structural formule is

After the NMR peaks were copied to a table, exported to MATLAB and the pattern recognition for this substance against all substances in the database was calculated, a similarity bar graph was plotted, as shown in Figure 2.

Values above 50 % are considered high for this system when applied to real data, mainly because when a real NMR spectrum is not very well processed, erroneous correlations peaks are assigned. The most important factors that lead to these errors are the phase correction and apodization.

Figure 2 shows the similarity bar graph for 13C-gHSQC analysis. There are high similarities values between samples in the region 150 and 160, where the sample 155, named "Iisineiacsf26fr04precip2" had a similarity value of 100%. Once a high level of similarity is identified, an overlap plot of the peaks can be made.

Figure 3 shows this procedure, where the green dots are the peaks from sample 155 and the blue dots are the peaks from Roridin A. In the Figure, it is possible to observe that these peaks between these two samples correspond to the same substance.

The pattern recognition calculations were repeated for the same substance, the Roridin A, but, this time using the peaks from the 13C- gHMBC endophyte database. The similarity bar graph is shown in Figure 4, where the sample 189 has a similarity value of 100%. The superposition of the 13C gHMBC peaks, where the green dots are the peaks from sample 189 and the blue dots are the peaks obtained from the Roridin A peak table, shows that these groups of peaks are the same and

represents the same substance.

By combining the name of the analysis found with 100 % similarity

(Figure 5), both leads to the same name "Iisineiacsf26fr04precip2".

Based on this information, it is possible to infer that the sample in question is the Roridin A. For this case it was possible to name this FID as

Roridin A, recovering the experimental data for this substance.

Example 2: Similarity Hierarchical Clustering

Two molecules with the same chemical skeleton that only differ in one substituent will display similar behaviors towards each NMR technique.

Taking this concept in consideration and applying the algorithms developed in the present invention, it became possible to group

substances with similar structures using statistical techniques such as HCA and PCA.

The alpha D glucose from the database HMDB was selected as an example of this HCA. The substance index is 1259, which means this is the 1259th compound in the database. This number is random, it is related to the HMDB indexing and it is constant to all NMR similarity matrices.

Figure 6 shows the hierarchical tree for all 344 compounds performed on the 1 H-1 D similarity matrix. The dashed line informs the position of the alpha D glucose.

The HCA was performed in three different similarity matrices and the region which represents the glucose is displayed in Figures 7A, 7B and 7C.

On Figure 7A, it is possible to note that the algorithm could group the glucose based on the 1 H-1 D NMR peaks. Substances written inside the brackets are from the HMDB database and substances outside of brackets are from the MMCD database. There are three different analysis peaks from glucose, two of them from HMDB and one is from MMCD.

As we compare the glucose peaks from different databases, it is possible to observe that these set of peaks are very similar, but not equal. This is related the way that the peak was made and the criteria for the peaks selection in each database, such as field strength or solution conditions.

Regardless all differences, the algorithm gave accurate information for this substance using 1 H-1 D N R peaks where the glucose were grouped together.

The HCA for 13C-HSQC peaks using the same algorithm can be observed in Figure 7B. In this case, only two of the glucoses were grouped where both of them are from the HMDB database.

The d-(+)-glucose was grouped in cellobiose molecules group. As we concatenate the similarity matrices (13C-HSQC and H-1D) and generate another HCA, possible errors and misunderstanding clustering are corrected by the method proposed.

The Figure 7C shows this HCA performed on this matrix. The three glucoses were grouped together and the cellobiose is on the same cluster in 500 distance units.

All this demonstration was made using the glucose peaks as an example. If we have peaks of a new compound isolated by any kind of methodology, it is possible to perform the HCA grouping of this new compound in groups with similar structure.

After the structural elucidation of this compound, the peaks can be inserted in the database in order to make this HCA more efficient and robust. There is no limit to the number of compounds and, as more substances are added, better the results are.

One important fact is that never or almost never, the same substance will have distance zero in the HCA. This is due to the fact that small shifts on the peaks (which are related to the sample temperature, pH and concentration, among others) will generate small differences in the similarity profile of these analyses.

When the HCA is performed, this small difference in the profiles will generate a small distance between theses analysis. Only if all peaks have an exactly chemical shift they will have a distance value of zero in the HCA.

To compare the efficiency of the NMR pattern recognition algorithm performed in the different NMR techniques, the dendrograms were graphically compared by connectors that connect the same sample in each side of a dendrogram using a function (as seen on Figure 8).

The samples positions are connected by lines that show the sample permutation between the dendrograms. Thus, it is possible to observe the general behavior of the hierarchical clustering.

Parallel lines indicate that the samples remained in the same position. The lines, whose slope values is different of zero, indicates the sample changed its position between the dendrograms.

The higher the modulus of the slope is, the greater is the Euclidean distance of the permutation. Samples that have same slope values, indicates a permutation with equal Euclidean distance values.

Previous Patent: INTRAOCULAR LENS - PROSTHESIS - WITH EMBEDDED INTRAOCULAR PRESSURE-MEASURING DEVICE

Next Patent: PACKAGE WITH A DOUBLE OPENING FOR BETTER UTILISATION OF PRODUCT RESIDUES